MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 7 min readDavid Gornshtein
Architecture
SLM
Hybrid Models
Mamba
MoE
NAM56R

SLM architecture in MegaCpp: hybrid patterns, block ownership, and why the letters matter

A grounded architectural read of the MegaCpp small-model stack: hybrid patterns, block semantics, schedule ownership, and why names like ablock, mblock, and eblock are operational rather than decorative.

MegaCpp
Focused on applied C++ model engineering
Article Preview
SLM architecture in MegaCpp: hybrid patterns, block ownership, and why the letters matter
Published 7 min readDavid Gornshtein

MegaCpp does not describe its small-model stack as "a transformer with extras." It describes it as a pattern of block families with different ownership, runtime costs, and scheduling rules. That is why strings like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample matter. They are not branding. They are the shortest honest description of how capacity is distributed. Decode the letters before anything else: A means an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns block, M means a MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style state-space block, E means an expert or MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack block, and R means the recurrent-style tail. The more implementation-focused version of that claim lives in Unique additions and why they exist, while the data and optimization companions are SLM data and SLM training: architectureQuick term guideArchitectureThis naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.GroundingMegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode only stays useful if the corpus contract and training contract respect the same block map.

If those letters are not already familiar, use MegaCpp model glossary as the companion while reading this page. The checked-in NAM56R block taxonomy sample and NAM56R pattern composition sample are the shortest public-safe receipts for the same decoder.

Glossary checkpoint

The public NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample example pack gives one concrete baseline for reading the stack. The short pattern AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample expands to 52 layers, which the checked in composition sample counts as 13 attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns slots, 22 expert slots, 13 MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-family slots, and 4 recurrent-tail slots. That expansion is what turns "hybrid" into an actual scheduling problem rather than a marketing adjective.

In other words, the short string already encodes a hardware story: 13 slots pay attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-style costs, 22 slots pay expert-routing and expert-GEMM costs, 13 slots pay state-space costs, and the last 4 slots keep a recurrent-tail seam. If a later post talks about NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample without decoding the letters first, this is the page it should send you back to.

The quickest checked-in path is:

What the letters mean

MegaCpp uses a project-specific architectural vocabulary:

Token MegaCpp name Main job
A ablock attention-heavy token mixing
M mblock Mamba-style or state-space sequence mixing
E eblock MoE or other conditional-capacity blocks
R rblock recurrent-style or tail consolidation blocks
C cblock context-defined coordination or wrapper blocks

Those names are useful because they force the engineering question that matters: which subsystem owns this block? An ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample is not maintained the same way an eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample is. The attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend, the state-space scan, the router, and the tail path all have different failure modes and different scaling behavior. That ownership view is also what makes sequence, context, and expert splits readable: once you know which block family owns which semantics, the distributed-axis question stops being a vague "parallelism" problem.

The compact checked-in walk from letter to workload is NAM56R block taxonomy sample -> NAM56R pattern composition sample -> NAM56R feature placement sample. The glossary companion is MegaCpp model glossary, which explains where those local words came from and how they change once launch rewriting enters the picture.

That last qualifier matters for cblock, and it should be stated early because the letter can be misleading on first touch. In the broader public-safe examples, cblock is not one fixed NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample primitive. The core-block sample uses it for a concept-retrieval block, while the hybrid layout notes use it more loosely for coordination or wrapper logic. The base NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample pattern in the public recipe pack remains A/E/M/R, so treat cblock as context-defined glue, not as a default fifth NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample family.

Why a small model needs explicit block ownership

Large dense models can hide a lot of architectural redundancy behind scale. Small specialist models cannot. Every repeated mechanism competes for the same parameter budget, activation budget, and compile budget.

MegaCpp's hybrid pattern is a way of allocating that budget intentionally:

That is a more informative story than saying "the model is hybrid." It tells you where the compute is supposed to go and what kind of runtime support each part needs. On real hardware, those choices show up directly in H200 memory geometry and activation recompute boundaries, because each block family stresses a different part of the runtime budget. The expert-heavy slice of that budget is the exact continuation in specialist routing, which is why this architectureQuick term guideArchitectureThis naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.GroundingMegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode note should not be read as if E blocks were just a side flag.

The letters also do not tell the whole execution story on their own. The checked-in feature-placement receipt keeps a second map beside the pattern: input-side n-gram hash and structure enrichment sit before the main stack, MTP lives on the objective suffix, A blocks host MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries plus selected DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample, Engram, and mHC placements, E blocks carry MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack and MoD, M blocks carry MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode and MIMO, and R blocks stay on an explicit recurrent index list. That extra placement layer is what keeps a hybrid recipe from drifting into "same letters, different model" territory.

Pattern strings are only useful if runtime respects them

The pattern stops being real the moment the runtime treats every layer as if it were the same class. That is why MegaCpp ties the pattern to three concrete surfaces:

  1. the recipe layer, which expands the declared pattern
  2. the schedule layer, which decides how each block family is executed
  3. the verification layer, which checks the implementation still matches the declaration

A pattern-aware model can drift silently if the schedule becomes too generic. You can still print AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample in logs while routing most of the stack through transformer-default code paths. The fix is not better prose. The fix is keeping block ownership explicit in code and tests.

The architecture claim

The defensible public claim is narrower than a marketing slogan:

  • MegaCpp uses a pattern-driven hybrid architectureQuick term guideArchitectureThis naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.GroundingMegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode.
  • The letters correspond to block families with different runtime ownership.
  • The runtime is allowed to treat those families differently.
  • Parameter accounting should distinguish total capacity from active capacity.

That last point matters especially for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-heavy variants. Once routed experts, shared experts, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks, and state-space blocks coexist, "model size" stops being one obvious number. Total parameters and active parameters should not be collapsed into one headline. For launch and sizing conversations, that is the same distinction carried into NAM56R launch policy: total stored parameters and active per-token capacity should not be presented as the same number.

A practical reading of the pattern

For a pattern like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample, the useful interpretation is:

A E M E A E M E A E M R

- attention anchors keep broad token interaction alive
- expert blocks inject conditional capacity between anchors
- mamba-style blocks carry efficient sequence mixing
- the tail block is reserved for specialized consolidation logic

This is why MegaCpp keeps a glossary for ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample, mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample, eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample, rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample, and cblock. The goal is not to invent fancy names. The goal is to make it obvious which part of the stack owns which semantics.

For the family-specific history behind M and R, the checked-in local continuations are author Mamba3 spec and M2RNN mixer spec sample: they explain why those letters are schedule families, not generic buzzwords for "anything recurrent."

For the model-family labels, use the same caution. NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample is the better grounded public recipe family because the site samples include its pattern, feature placement, launch split, and translation plan. NAM52 appears in the public article set as an earlier hybrid family label and comparison anchor, but it is not exposed as one public recipe surface as complete as NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample.

What survived into the current architecture story

Several ideas are durable enough to keep in public copy:

  • pattern notation is meaningful only if the runtime and tests respect it
  • MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack blocks should be treated as first-class architectural objects, not as a hidden option bit
  • state-space blocks should not be described as ordinary transformer layers if the runtime gives them their own path
  • parameter accounting must stay explicit about total versus active capacity

That is also why the MegaCpp glossary matters. The vocabulary is part of the architectureQuick term guideArchitectureThis naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.GroundingMegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode discipline. It gives scheduling, profiling, and evaluation notes a shared language.

What to avoid in public wording

Public wording should avoid two shortcuts.

The first is treating MegaCpp-specific names as universal standards. ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample and mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.Groundingblock taxonomy sample are useful names inside MegaCpp. They are not industry-wide terms.

The second is flattening the architectureQuick term guideArchitectureThis naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.GroundingMegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode into one feature list. A hybrid model is not just "attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns plus MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode plus MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack." The relevant claim is that those mechanisms are arranged as a pattern, and that the runtime is allowed to honor that pattern instead of pretending every depth slot is equivalent.

FAQ

Frequently asked questions

Why not just call this a transformer with a few extras?+
Because the runtime does not own every block the same way. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand., state-space, and expert blocks have different scheduling paths, scaling limits, and failure modes.
What do the pattern letters buy in practice?+
They give launch planning, profiling, and debugging a stable map of which subsystem owns each depth slot. Without that, logs can preserve the pattern string while the runtime quietly collapses back to generic defaults.
Does the pattern string also tell you where MLA, Engram, MTP, or MoD live?+
No. The pattern tells you block-family ownership, not every feature attachment. The companion NAM56R feature placement sample is the public-safe receipt for which features live on the input side, which ones attach to A, E, M, or R families, and which ones belong to the training objective rather than the stack itself.
What does AEMEAEMEAEMR decode to on first read?+
Read it as attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. anchor, expert block, MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…-style block, expert block, then repeat that motif before a recurrent-style tail. The checked-in NAM56R pattern composition sample is the public-safe receipt for the 52-layer expansion and block counts.
Why separate total and active parameters in public descriptions?+
Because routed experts and hybrid blocks store more capacity than any one token activates at once. In MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.-heavy variants, collapsing those numbers into one headline hides the actual compute and memory story.
How should this be compared with public hybrid models like Jamba or StripedHyena?+
Use those systems as context, not as a naming authority. Public hybrid work already mixes attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. with MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and… or MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble. blocks, and other public stacks mix attention with gated-convolution operators. MegaCpp's narrower claim is that its own pattern string must preserve block ownership through scheduling, verification, and parameter accounting.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

ablock

The attention-heavy block family in MegaCpp's A/M/E/R notation.

mblock

The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.

eblock

The expert / MoE block family in MegaCpp's A/M/E/R notation.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

AEMEAEMEAEMR

A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

rblock

The recurrent tail block family in MegaCpp's A/M/E/R notation.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

Architecture

This naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attention, Mamba-family state-space layers, expert/MoE layers, and recurrent tails. Strings such as AEMEAEMEAEMR are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56R are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.

SLM

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Mamba

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…