SLM architecture in MegaCpp: hybrid patterns, block ownership, and why the letters matter
A grounded architectural read of the MegaCpp small-model stack: hybrid patterns, block semantics, schedule ownership, and why names like ablock, mblock, and eblock are operational rather than decorative.

MegaCpp does not describe its small-model stack as "a transformer with extras."
It describes it as a pattern of block families with different ownership,
runtime costs, and scheduling rules. That is why strings like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample
matter. They are not branding. They are the shortest honest description of how
capacity is distributed. Decode the letters before anything else: A means an
attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns block, M means a MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style state-space block, E means an expert
or MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack block, and R means the recurrent-style tail. The more
implementation-focused version of that claim lives in Unique additions and why
they exist, while the data and optimization
companions are SLM data and SLM training:
architectureQuick term guideArchitectureThis naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.GroundingMegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode only stays useful if the corpus contract and training contract
respect the same block map.
If those letters are not already familiar, use MegaCpp model glossary as the companion while reading this page. The checked-in NAM56R block taxonomy sample and NAM56R pattern composition sample are the shortest public-safe receipts for the same decoder.
Glossary checkpoint
The public NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample example pack gives one concrete baseline for reading the
stack. The short pattern AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample expands to 52 layers, which the checked
in composition sample counts as 13 attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns slots, 22 expert slots, 13
MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-family slots, and 4 recurrent-tail slots. That expansion is what turns
"hybrid" into an actual scheduling problem rather than a marketing adjective.
In other words, the short string already encodes a hardware story: 13 slots pay attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-style costs, 22 slots pay expert-routing and expert-GEMM costs, 13 slots pay state-space costs, and the last 4 slots keep a recurrent-tail seam. If a later post talks about NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample without decoding the letters first, this is the page it should send you back to.
The quickest checked-in path is:
- NAM56R block taxonomy sample
- NAM56R pattern composition sample
- NAM56R feature placement sample
- NAM56R launch policy
What the letters mean
MegaCpp uses a project-specific architectural vocabulary:
| Token | MegaCpp name | Main job |
|---|---|---|
A |
ablock |
attention-heavy token mixing |
M |
mblock |
Mamba-style or state-space sequence mixing |
E |
eblock |
MoE or other conditional-capacity blocks |
R |
rblock |
recurrent-style or tail consolidation blocks |
C |
cblock |
context-defined coordination or wrapper blocks |
Those names are useful because they force the engineering question that matters:
which subsystem owns this block? An ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample is not maintained the same way an
eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample is. The attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend, the state-space scan, the router, and the
tail path all have different failure modes and different scaling behavior. That
ownership view is also what makes sequence, context, and expert
splits readable: once you know which block
family owns which semantics, the distributed-axis question stops being a vague
"parallelism" problem.
The compact checked-in walk from letter to workload is NAM56R block taxonomy sample -> NAM56R pattern composition sample -> NAM56R feature placement sample. The glossary companion is MegaCpp model glossary, which explains where those local words came from and how they change once launch rewriting enters the picture.
That last qualifier matters for cblock, and it should be stated early because
the letter can be misleading on first touch. In the broader public-safe
examples, cblock is not one fixed NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample primitive. The core-block sample
uses it for a concept-retrieval block, while the hybrid layout notes use it
more loosely for coordination or wrapper logic. The base NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample pattern in the
public recipe pack remains A/E/M/R, so treat cblock as context-defined
glue, not as a default fifth NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample family.
Why a small model needs explicit block ownership
Large dense models can hide a lot of architectural redundancy behind scale. Small specialist models cannot. Every repeated mechanism competes for the same parameter budget, activation budget, and compile budget.
MegaCpp's hybrid pattern is a way of allocating that budget intentionally:
- attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns anchors preserve general token interaction
- state-space blocks carry efficient sequential dynamics
- expert blocks add bursty conditional capacity
- tail blocks handle consolidation or project-specific end-of-pattern work
That is a more informative story than saying "the model is hybrid." It tells
you where the compute is supposed to go and what kind of runtime support each
part needs. On real hardware, those choices show up directly in H200 memory
geometry and activation recompute
boundaries, because each block family
stresses a different part of the runtime budget. The expert-heavy slice of that
budget is the exact continuation in specialist routing, which is
why this architectureQuick term guideArchitectureThis naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.GroundingMegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode note should not be read as if E blocks were just a side
flag.
The letters also do not tell the whole execution story on their own. The
checked-in feature-placement receipt keeps a second map beside the pattern:
input-side n-gram hash and structure enrichment sit before the main stack, MTP
lives on the objective suffix, A blocks host MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries plus selected DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample, Engram,
and mHC placements, E blocks carry MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack and MoD, M blocks carry MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode and
MIMO, and R blocks stay on an explicit recurrent index list. That extra
placement layer is what keeps a hybrid recipe from drifting into "same letters,
different model" territory.
Pattern strings are only useful if runtime respects them
The pattern stops being real the moment the runtime treats every layer as if it were the same class. That is why MegaCpp ties the pattern to three concrete surfaces:
- the recipe layer, which expands the declared pattern
- the schedule layer, which decides how each block family is executed
- the verification layer, which checks the implementation still matches the declaration
A pattern-aware model can drift silently if the schedule becomes too generic.
You can still print AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample in logs while routing most of the stack
through transformer-default code paths. The fix is not better prose. The fix
is keeping block ownership explicit in code and tests.
The architecture claim
The defensible public claim is narrower than a marketing slogan:
- MegaCpp uses a pattern-driven hybrid architectureQuick term guideArchitectureThis naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.GroundingMegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode.
- The letters correspond to block families with different runtime ownership.
- The runtime is allowed to treat those families differently.
- Parameter accounting should distinguish total capacity from active capacity.
That last point matters especially for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-heavy variants. Once routed experts, shared experts, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks, and state-space blocks coexist, "model size" stops being one obvious number. Total parameters and active parameters should not be collapsed into one headline. For launch and sizing conversations, that is the same distinction carried into NAM56R launch policy: total stored parameters and active per-token capacity should not be presented as the same number.
A practical reading of the pattern
For a pattern like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample, the useful interpretation is:
A E M E A E M E A E M R
- attention anchors keep broad token interaction alive
- expert blocks inject conditional capacity between anchors
- mamba-style blocks carry efficient sequence mixing
- the tail block is reserved for specialized consolidation logic
This is why MegaCpp keeps a glossary for ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample, mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample, eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample,
rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample, and cblock. The goal is not to invent fancy names. The goal is to
make it obvious which part of the stack owns which semantics.
For the family-specific history behind M and R, the checked-in local
continuations are author Mamba3 spec and
M2RNN mixer spec sample:
they explain why those letters are schedule families, not generic buzzwords for
"anything recurrent."
For the model-family labels, use the same caution. NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample is the better
grounded public recipe family because the site samples include its pattern,
feature placement, launch split, and translation plan. NAM52 appears in the
public article set as an earlier hybrid family label and comparison anchor, but
it is not exposed as one public recipe surface as complete as NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample.
What survived into the current architecture story
Several ideas are durable enough to keep in public copy:
- pattern notation is meaningful only if the runtime and tests respect it
- MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack blocks should be treated as first-class architectural objects, not as a hidden option bit
- state-space blocks should not be described as ordinary transformer layers if the runtime gives them their own path
- parameter accounting must stay explicit about total versus active capacity
That is also why the MegaCpp glossary matters. The vocabulary is part of the architectureQuick term guideArchitectureThis naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.GroundingMegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode discipline. It gives scheduling, profiling, and evaluation notes a shared language.
What to avoid in public wording
Public wording should avoid two shortcuts.
The first is treating MegaCpp-specific names as universal standards. ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample and
mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample are useful names inside MegaCpp. They are not industry-wide terms.
The second is flattening the architectureQuick term guideArchitectureThis naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.GroundingMegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode into one feature list. A hybrid model is not just "attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns plus MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode plus MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack." The relevant claim is that those mechanisms are arranged as a pattern, and that the runtime is allowed to honor that pattern instead of pretending every depth slot is equivalent.
Frequently asked questions
Why not just call this a transformer with a few extras?+
What do the pattern letters buy in practice?+
Does the pattern string also tell you where MLA, Engram, MTP, or MoD live?+
A, E, M, or R families, and which ones belong to the training objective rather than the stack itself.What does AEMEAEMEAEMR decode to on first read?+
Why separate total and active parameters in public descriptions?+
How should this be compared with public hybrid models like Jamba or StripedHyena?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.
A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.
A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
This naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…