MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202610 min readDavid Gornshtein

Glossary

Models

Mamba

MoE

Attention

Architecture

MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode

Q: Why can a recipe depend on R even when the final launch string no longer shows it?

Because R is an authoring-time symbol before it is a launch-facing one. The feature plan keeps those recurrent-tail slots visible long enough to place the custom handling correctly, then the final emitted string is rewritten into the subset the downstream launcher actually accepts. Losing the literal letter at the end of the handoff does not mean the recurrent-tail budget disappeared. The next useful local companions are M2RNN and Engram memory and NAM56R launch policy.

A grounded glossary for MegaCpp model notation, hybrid layer patterns, and block-family names, tied back to live builder code, launch helpers, and regression tests in MegaCpp.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode

Published April 18, 2026•10 min read•David Gornshtein

This naming scheme is an execution vocabulary, not branding. In this stack, A, M, E, and R are stable symbols for attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: Attention About: fused MLA on NVIDIA Reference: shared MLA adapter boundaries, Mamba-family state-space layers, expert/MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingMoE The MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers, and recurrent tails. Strings such as AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: AEMEAEMEAEMR Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample are used by launch builders and tests to describe ordered hybrid layouts. Names such as NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample are shorthand for concrete recipes whose real meaning lives in builder code, schedule patches, precision gates, and launch assertions rather than in any single README.

When people first meet this codebase, the model names look denser than they really are. The confusion comes from seeing a launch string, a report, and a unit test each preserving a different slice of the same contract. The way to read the notation correctly is to start from the code paths that consume it. In MegaCpp's Megatron-args tests, the feature-plan builder is called with pattern="AEMEAEMEAEMR" and depth=52, then checked for downstream flags such as MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA About: MLA and weight absorption Reference: fused MLA on NVIDIA, MTP, FIM, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingMoE The MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, and DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA About: DSA and CUDA graph safety History: DSA index cache patch. In the corresponding NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample launch coverage, the same pattern is expanded into the launch notation used by the Megatron-side recipe. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: FP8 About: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes-related examples explain why only the expert layers run in FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: FP8 About: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes for the NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample family. Those sources together tell you that the notation is operational: it decides how the model is assembled, scheduled, and optimized.

The stable symbols: A, M, E, and R

The safest part of the glossary is the four-letter alphabet. It is repeated in tests, builder helpers, and patch docstrings, and it is the minimum information you need before reading any pattern string.

Symbol	Meaning in this stack	Where the meaning is grounded
`A`	attention layer	launch-pattern tests and selective precision docs
`M`	Mamba-family state-space layer	the public TE mixer sample, the public TE stack spec sample
`E`	expert / MoE layer	selective FP8 MoE path, hybrid schedule logic
`R`	recurrent tail or custom recurrent-style layer	verified by the launch-notation tests that track custom-layer indices

That mapping is not inferred from blog prose. The clearest compact summary is the module docstring in the public selective-FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: FP8 About: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingMoE The MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack patch sample, which says that the NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample family uses the pattern AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: AEMEAEMEAEMR Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample and explicitly glosses A=attention, E=MoE, M=Mamba/Mamba3, and R=M2RNN. The launch tests then prove that the pattern is not just documentation. get_custom_layer_indices(pattern="AEMEAEMEAEMR", depth=52, custom_symbols=("R",)) is expected to return (12, 24, 36, 48), which means the recurrent symbol is discovered programmatically after tiling the pattern through depth.

That is the first useful mental model: the symbols are the architectural primitive types, and the pattern string is the ordered list from which deeper launch syntax is derived.

What pattern strings actually do

A string like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: AEMEAEMEAEMR Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample is not a label pasted on the side of the model after the fact. It is a compact serialization of layer order. That matters because this repo does not treat the stack as one repeated homogeneous block. It composes multiple families and then feeds that composition into launch helpers.

The pattern-expansion behavior is easiest to see in the NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample launch-notation samples. One example calls build_nam56r_lite_main_pattern(pattern="AEMEAEMEAEMR", depth=52, mtp_depths=1) and checks that the output contains 17 M symbols, 22 E symbols, and 14 attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: Attention About: fused MLA on NVIDIA Reference: shared MLA adapter boundaries positions. Another flips use_dsa_symbol=True and checks that all attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: Attention About: fused MLA on NVIDIA Reference: shared MLA adapter boundaries positions become D, producing 14 D symbols and an ending of /D-. That tells you two important things.

First, the repeated source pattern is a template that is expanded across model depth. Second, the expanded form can be rewritten for downstream execution modes. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: Attention About: fused MLA on NVIDIA Reference: shared MLA adapter boundaries is not always emitted as plain attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: Attention About: fused MLA on NVIDIA Reference: shared MLA adapter boundaries in the launch string; when the DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA About: DSA and CUDA graph safety History: DSA index cache patch path is enabled, the symbolic schedule reflects that. In other words, the notation is not frozen. It is a boundary representation between architectural intent and runtime specialization.

One modifier rule is worth stating explicitly because the research pack makes it easy to overread the launch-facing letters. D and similar symbols are usually downstream rewrites of the base pattern, not replacements for the stable A/M/E/RQuick term guideA / M / E / RMegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.GroundingAbout: A / M / E / R About: SLM architecture Example: hybrid layout notes alphabet. In this article set, D is the launch-facing marker for an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: Attention About: fused MLA on NVIDIA Reference: shared MLA adapter boundaries slot that has been specialized into the DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA About: DSA and CUDA graph safety History: DSA index cache patch path, while the underlying architectural family is still attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: Attention About: fused MLA on NVIDIA Reference: shared MLA adapter boundaries. The safe decode order is: read the base pattern first, then apply the execution-mode rewrite.

A small example helps:

Base reference pattern:   A E M E A E M E A E M R
Tiled through depth:      ... repeated to 52 layers ...
Launch-facing rewrite:    attention may become '*' or 'D'
Custom layer extraction:  R positions become tracked tail indices

This is why you should resist the temptation to read a single literal string as the whole truth. The authoritative reading is “pattern plus expansion rules plus feature toggles.”

NAM52 and NAM56R are recipe handles, not universal standards

The names NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample look like official external model families, but in MegaCpp they behave more like recipe handles. They point to a bundle of assumptions: depth, hybrid layout, enabled features, and sometimes hardware-facing expectations captured in code comments or implementation notes.

NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample is the more explicit one in the current code and documentation. It appears in launch paths, memory-budget discussion, FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: FP8 About: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes work, and Mamba-related code. Examples mention memory savings or budget limits at NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample batch sizes because the recipe is used as a practical calibration target for throughput and memory work.

NAM52 shows up more in reports, for example in modern-accelerator prefill receipts. In practice that means NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample are best read as stable checkpoints in the project’s experimentation history. They are useful because different patches, reports, and launch helpers agree on them. They are risky when treated as self-explanatory. If you need exact semantics, you still have to read the launch recipe or the feature plan that is active in that lane.

The public-safe research pack adds one useful caution here: NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample is the clearer recipe handle because the recurrent R slots, feature placement, and launch translation are all shown directly in checked-in surfaces. That also means a NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample throughput or memory receipt usually bakes in the recurring R placements and the active feature plan, not just a bare depth number. NAM52 is still useful, but it more often behaves like an evaluation-family label than a fully spelled-out public recipe.

Name	Safe interpretation	What not to assume
`NAM52`	a specific local recipe / benchmark family used in reports and bring-up	not a universal external architecture standard
`NAM56R`	a local hybrid recipe with a well-used pattern and recurrent tail notation	not a guarantee that every run with the label has identical features

The practical rule is simple: use the names as anchors for discussions, but resolve the actual behavior from the builder path and the launch flags.

What M actually means here: not generic Mamba, but a specific author path

The M symbol is where generic vocabulary becomes dangerously imprecise. In this repo, “Mamba” could mean the general state-space family, the Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine About: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation stack wrapper, or the authored Mamba3 kernel path. Those are related, but they are not interchangeable.

The top-level description in the public Mamba TE mixer sample is the most direct source. The file defines an authored Mamba3 TE mixer as a drop-in replacement for MambaMixer that keeps TE projection layers while replacing the upstream convolution-plus-SSD scan path with authored Mamba3 kernels. The docstring names the supported behaviors explicitly: trapezoidal discretization, QK-Norm via gated RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: RMSNorm About: Author Mamba3 spec About: Mamba3 hybrid on B/C, learnable B/C bias, complex RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: RoPE About: fused MLA on NVIDIA History: long context and attention sinks on B/C, data-dependent A, and MIMO. It also states that there is no conv1d in this path; the authored scan owns the state-space computation.

the public Mamba TE stack spec sample reinforces the same picture from the stack side. Its module comment describes a stack that preserves upstream TE submodules while swapping in the authored Mamba3 mixer. So when someone casually says “this is an MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: mblock About: SLM architecture Example: block taxonomy sample,” they may be referring to one of three levels:

Informal term	Usually points to
`mblock` / `M` layer	a Mamba-family slot in the hybrid schedule
Mamba TE stack	the stack spec that preserves TE scaffolding
authored Mamba3 path	the specific authored Mamba3 TE mixer with trapezoidal, RoPE, data-dependent `A`, and MIMO support

That distinction matters for debugging. If a report says the Mamba lane is slow, the next question is whether the problem is in the stack integration, the authored kernel path, or an adjacent precision policy. The glossary should preserve that hierarchy instead of flattening it into “M means state space, done.”

What E means: expert layers are runtime islands with their own precision policy

The E symbol is also more specific than “there is some MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingMoE The MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack somewhere.” In the current tree, expert layers are treated as the place where certain optimizations make sense even when they do not help the rest of the model. The strongest example is the public selective-FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: FP8 About: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingMoE The MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack patch sample.

That module explains the central claim in plain language: on the NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample family, running FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: FP8 About: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes everywhere is slower because Mamba scans and attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: Attention About: fused MLA on NVIDIA Reference: shared MLA adapter boundaries remain bandwidth-bound while paying conversion overhead, but the expert FFN GEMMs inside MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingMoE The MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers do benefit. The patch therefore monkey-patches Megatron’s FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: FP8 About: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes context helper so that only the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingMoE The MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers stay in FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: FP8 About: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes while non-MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingMoE The MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers fall back to nullcontext() and therefore allocate in BF16. This is not a paper definition of MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingMoE The MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack. It is a runtime definition of the expert blocks as the compute islands where reduced precision buys something measurable.

That behavior also explains why pattern notation is useful. Once you know where the E positions are, you can target them for feature gating, performance patches, or schedule experiments without rewriting the entire stack. The helper _compute_moe_layer_indices() resolves the active pattern and returns the exact zero-based E positions. In other words, the symbol is not just descriptive. It is how the code discovers which layers should receive a special policy.

What R means, and why it is preserved instead of erased

The R symbol is easy to miss because there are fewer references to it than to A, M, or E. But the launch tests make clear that it is intentionally preserved long enough to drive special handling. In the NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample launch-notation tests, recurrent positions are extracted with custom_symbols=("R",), and the expected indices are hard-coded. Another launch test checks that the final emitted pattern contains no literal R after the main rewrite, because the launch-facing pattern maps it into a form the downstream system understands.

That tells you the right glossary entry for R: it is not “a mysterious spare letter.” It marks custom recurrent-tail layers in the authoring notation, then participates in a rewrite step before the final launch string is emitted. The exact backend implementation may vary, but the presence of the symbol is a first-class part of how the recipe is specified.

This is also why local terms such as rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: rblock About: SLM architecture Example: block taxonomy sample stay useful. They give people a short name for a structural family that still needs custom handling when the launch pattern is derived.

Where ablock, mblock, eblock, rblock, and cblock fit

The lowercase block-family words are less canonical than the four letters, but they are still useful shorthand when grounded. The project most strongly supports ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: ablock About: SLM architecture Example: block taxonomy sample, mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: mblock About: SLM architecture Example: block taxonomy sample, eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: eblock About: SLM architecture Example: block taxonomy sample, and rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: rblock About: SLM architecture Example: block taxonomy sample, because those correspond directly to the stable symbol set. Reports also use terms such as EBlockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: eblock About: SLM architecture Example: block taxonomy sample in exactly that spirit, for example when discussing compile behavior around MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingMoE The MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers. They should be presented as MegaCpp-specific shorthand, not as industry-standard taxonomy.

The safest way to use these words is as prose aliases for the symbol families:

Term	Safe local reading
`ablock`	an attention-family layer or slot
`mblock`	a Mamba-family layer or slot
`eblock`	an expert/MoE layer or slot
`rblock`	a recurrent-tail layer or slot
`cblock`	use only when the surrounding code or article defines it explicitly

That last line matters. cblock can be a tempting extension, but unless the immediate code or doc surface defines it, the term should be treated cautiously. The rest of the block-family words are grounded by the stable symbol map and by launch tests. cblock is only safe when the surrounding source defines what C stands for in that context.

How to read a model name without lying to yourself

The best practical workflow is to decode names in layers.

Read the stable symbols first: A, M, E, R.
Check whether you are looking at the authoring pattern, the expanded launch pattern, or a performance-oriented explanation.
Resolve the active feature plan, because MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA About: MLA and weight absorption Reference: fused MLA on NVIDIA, MTP, FIM, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA About: DSA and CUDA graph safety History: DSA index cache patch, and precision patches can all change how the same structural recipe behaves.
Treat NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R About: NAM56R Megatron translation Example: NAM56R Megatron plan sample as local recipe handles whose exact meaning depends on the builder path.

A concrete snippet from the tests captures that mindset:

plan = build_nam56r_feature_plan(pattern="AEMEAEMEAEMR", depth=52, mtp_depths=1)
bundle = build_megatron_args_bundle(
    plan=plan,
    use_mla=True,
    use_mtp=True,
    use_fim=True,
    use_moe=True,
    use_dsa=True,
)

The recipe name alone is not enough. The effective model is the pattern plus the feature plan plus the execution policy. Once you read the notation that way, the glossary stops being mysterious. It becomes a compact map from architectural intent to runtime behavior.

FAQ

Frequently asked questions

Why can a recipe depend on R even when the final launch string no longer shows it?+

Because R is an authoring-time symbol before it is a launch-facing one. The feature plan keeps those recurrent-tail slots visible long enough to place the custom handling correctly, then the final emitted string is rewritten into the subset the downstream launcher actually accepts. Losing the literal letter at the end of the handoff does not mean the recurrent-tail budget disappeared. The next useful local companions are M2RNN and Engram memory and NAM56R launch policy.

How should I decode NAM56R when a receipt only gives the recipe name?+

Start with the recipe name as a pointer, not as the full architecture. Decode the base AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix. pattern, apply the active feature plan, then read the launch-facing rewrite that may turn attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. slots into DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture. slots or hide custom recurrent-tail markers from the final launcher string. The checked-in NAM56R launch policy and NAM56R translation note are the safest local companions when a benchmark or memory note only says NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label..

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Grounding

AEMEAEMEAEMR

A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.

Grounding

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

Grounding

A / M / E / R

MegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.

Grounding

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

ablock

The attention-heavy block family in MegaCpp's A/M/E/R notation.

Grounding

mblock

The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

eblock

The expert / MoE block family in MegaCpp's A/M/E/R notation.

Grounding

rblock

The recurrent tail block family in MegaCpp's A/M/E/R notation.

Grounding

RMSNorm

Root-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.

Grounding

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

Grounding

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Grounding

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

Grounding

Model notation

A grounded architectural read of the MegaCpp small-model stack: hybrid patterns, block semantics, schedule ownership, and why names like ablock,…

Grounding

.nv.capmerc

Observed section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.

Grounding

TMA multicast

The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.

Grounding

tcgen05.alloc

Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.

Grounding

tcgen05.ld

Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.

Grounding

tcgen05.mma

The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.

Grounding

WGMMA

Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.

Grounding

TMEM

Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.

Grounding

e_flags

The ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.

Grounding

David Gornshtein • MegaCppMore posts →

MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode

The stable symbols: A, M, E, and R

What pattern strings actually do

NAM52 and NAM56R are recipe handles, not universal standards

What M actually means here: not generic Mamba, but a specific author path

What E means: expert layers are runtime islands with their own precision policy

What R means, and why it is preserved instead of erased

Where ablock, mblock, eblock, rblock, and cblock fit

How to read a model name without lying to yourself

Read next

References

Frequently asked questions

Terms used in this article