Throughput vs quality knobs: which trade-offs are real
A grounded map of the knobs that actually move the throughput-quality frontier in hybrid NAM52 and NAM56R training, based on public code, articles, and upstream references.

The real throughput-versus-quality knobs are not cosmetic flags. The big levers are block pattern, expert routing behavior, auxiliary-head policy, precision scope, and checkpoint or recompute policy. Those knobs matter because they change which work dominates a step. A pattern like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample pays for attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, expert routing, and recurrent-state handling differently than a mostly dense stack, so the same optimization can be a major gain in one family and almost irrelevant in another.
Start with the pattern notation instead of the family nickname: in this stack, A means an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns block, M means a MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style state-space block, E means an expert or MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack block, and R means the recurrent-style tail. The quickest local decoders are SLM architecture, model glossary, and the checked-in NAM56R block taxonomy sample.
Public code and notes
- DSA indexer sample
- Mamba linear CE parity sample
- NAM56R Megatron plan sample
- NAM56R block taxonomy sample
- Distributed debugging notes
- Hybrid layout notes
The easiest mistake in model tuning is to treat throughput and quality as separate checklists. Public materials point the other way. The published architecture description combines attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode, expert, and recurrent-style pieces, and the public implementation surfaces make it clear that runtime changes can also affect parity and trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 behavior. If you want a useful knob map, you have to start from the pattern notation and from the exact block mix that a run uses.
Start from the block pattern, not from folklore
The hybrid notation used here is the right place to start. In public code and docs that same split often appears as ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, and rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample. The checked-in taxonomy sample makes the mapping explicit: A-blocks own attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy token mixing, M-blocks own state-space sequence mixing, E-blocks own routed expert capacity, and the R-block tail keeps its own recurrent-style seam. The published notes also describe the stack as combining MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-3 hybrid layers, sparse attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns work, MTP, and other extensions rather than a single uniform transformer shape. That is why pattern strings such as AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample are more than labels: they are cost models in shorthand.
Once you read the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 stack this way, the throughput-quality frontier becomes much less mysterious. An A-heavy model responds strongly to attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-kernel changes, KV handling, normalization, and sequence-length policy. An E-heavy model responds much more to router behavior, token distribution, expert overlap, and capacity handling. An M-heavy model shifts the story again, because recurrent state and MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-specific projections can dominate the non-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns part of the step.
| Pattern element | Main compute pressure | Typical quality lever | Why it changes the frontier |
|---|---|---|---|
A / ablock |
attention kernels, QKV projection, sequence length | context handling, mask semantics, RoPE behavior | attention cost scales differently from expert or recurrent work |
M / mblock |
state update, mixer kernels, specialized projections | long-context recurrence behavior | recurrent compute can replace or complement attention cost |
E / eblock |
router, expert GEMMs, token shuffles | specialization, capacity, token coverage | routing policy changes both speed and learning behavior |
R / rblock |
recurrent memory update and scheduling | persistence and temporal bias | different state path than pure attention or MoE |
That table is the reason generic advice fails. "Enable the faster kernel" is not a universal tuning strategy when half the time is not in the kernel you are staring at.
Architectural knobs are the first-order ones
The biggest real knobs are architectural. In practice that means block pattern, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing policy, shared-versus-routed expert balance, MTP depth, and whether a feature is truly on or merely parsed. Public issue discussions and code examples show why this matters. For example, if a routing mode is exposed in configuration but execution still follows a different policy, a user can think they changed a quality or throughput knob when they actually did not. That is the worst kind of knob: visible in config, absent in execution.
MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack settings are especially high leverage because they change both math and traffic. Public materials provide several useful reality checks: router dtype is not an independent truth when a run is already under bf16 autocast, shared expert overlap is not free if there is no concurrent stream path, and FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-related claims need to be tied to the exact expert path rather than treated as a blanket model speedup. In other words, expert knobs are real, but only when they are wired into the path that actually executes.
The same logic applies to MTP. Public examples separate MTP as a trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200-only feature and make mtp=False the real off switch, with explicit mtp_depth=1 and mtp_depth=3 lanes. That matters because auxiliary heads are not just a quality idea; they add projection, loss, memory, and optimizer work. If you compare two runs and forget to name MTP depth, you are not comparing like with like. Mamba linear CE parity deep dive is the narrower continuation for the places where a throughput fix is really an output-contract fix.
model_family: NAM56R
pattern: AEMEAEMEAEMR
major_knobs:
moe_enabled: true
mtp_depth: 3
regional_compile: true
grad_reduce_in_fp32: true
recompute: selective
report_rule: always publish pattern plus active architectural knobs
That kind of structured report is much more useful than a one-line claim that one run was "faster" or "better."
Precision and communication knobs are real, but conditional
The second tier of knobs sits around precision, gradient movement, and overlap. These are still real, but they only pay off if the bottleneck actually matches the knob. Public examples are clear on this point. One visible setting is grad_reduce_in_fp32 through the optimizer path, explicitly keeping gradient buffers in float32 for the communication and writeback route. Distributed optimizer stress is the direct companion for that knob because it shows the drift and collective-order bugs that appear when the optimizer contract is wrong.
Those changes matter because communication policy can alter both stability and throughput. But they are not interchangeable with architecture. If attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns is only a small fraction of the step, then an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-kernel win moves the total less than you expect. In public analysis, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns can be a small share of step time under some regional-compile configurations. That single detail is enough to kill a lot of misleading performance narratives.
A similar story shows up in the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode path. Public notes describe targeted work on MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode in-proj fusion and also explain why a full replacement is higher risk: state-dict migration, extension parity, and FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper integration all complicate the move. That is a good example of a knob that looks local but is actually architectural. Fusing an in-proj can help throughput, but if the rest of the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode path is still excluded from the broader precision stack, the total impact is narrower than a dashboard might suggest.
| Knob | Throughput upside | Quality or stability risk | Grounded reading |
|---|---|---|---|
grad_reduce_in_fp32 |
better reduction robustness, sometimes steadier scaling | higher comm cost than bf16-only path | useful when reduction quality is part of the bottleneck |
| FP8-scoped expert path | expert GEMM speedups | recipe mismatch, coverage gaps, parity debt | only count wins on the path actually using FP8 |
| TE fusion in Mamba ingress | lower projection overhead | migration and extension compatibility risk | useful, but not a blanket Mamba rewrite |
| bucket and overlap tuning | better comm-compute overlap | easy to mis-measure if the step is compute-bound elsewhere | worth naming in structured reports, not overgeneralizing |
The key idea is that these are path-sensitive knobs. You have to know where the time is before you can rank them.
Checkpointing and recompute policy are often the most honest trade-off
Engineers often describe checkpointing as a pure throughput loss taken only to fit memory. The current codebase gives a more nuanced picture. Recompute policy changes the shape of the step, the peak-memory envelope, and sometimes which model size is even runnable on a given lane. That means it is one of the cleanest real throughput-quality knobs because it determines whether you can afford a larger or richer model at all.
Public notes repeatedly separate memory-saving moves from architectural ones. There are explicit discussions in activation checkpointing deep dive, deferred comparisons on recompute policy, and the importance of reporting exact runtime lanes instead of collapsing everything into one speed number. In the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode path this is even clearer: the implementation includes a dedicated recompute surface, which is a reminder that recurrent-style paths have their own memory behavior and their own honest trade-off surface.
For MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, the memory story is even sharper. Public examples show how changing an implementation surface can shrink a large intermediate into a much smaller fused buffer while preserving the intended math. The point is not only that one kernel is better; it is that memory shape changes what model and batch configurations are feasible. On a real trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane, that feasibility boundary feeds back into quality because the affordable model, context, and batch policy change.
This is why recompute and memory-shape decisions deserve to sit next to routing and MTP in any serious tuning discussion. They are not housekeeping.
Quality knobs must be described with activation windows
Some features do not impose steady per-step cost. They activate later, activate conditionally, or matter only after a schedule boundary. If you measure them without stating the active window, you can easily mark a costly feature as free or a useful feature as irrelevant.
Public discussions already show concrete examples of this measurement problem. When quality-facing behavior is silently different from what a flag suggests, the benchmark itself becomes hard to trust. Those are not just correctness bugs; they are benchmarking traps. If a feature activates differently than you think, your throughput-quality chart is fiction.
That is also why family labels such as NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample should never be dropped from performance reports. They encode scale, intended recipe, and usually a different mix of active paths. A sentence like "better on a modern accelerator" is vague. A sentence like "better on NAM52, AEME-leaning pattern, no MTP, selective recompute" is usable.
Here the articles and public examples complement each other well. The article side provides the model taxonomy and trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200-path framing, while the public examples show how a local kernel or class-parity fix can move memory or throughput on named configurations. For example, the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode linear cross-entropy example shows how restoring output-layer parity removes an unnecessary logits allocation and stabilizes an otherwise OOM-prone lane. That is a throughput gain with a direct architectural interpretation: it comes from changing which output-layer implementation is in play, not from generic optimization folklore.
What to standardize in every tuning report
If you want throughput-versus-quality discussions to become cumulative instead of repetitive, standardize the report format. The strongest version is simple.
Include the model family, the exact pattern string, the active architectural knobs, the precision or communication knobs, the memory policy, the feasible batch or context envelope, and the activation window for any delayed feature. Then publish both the observed speed metric and the quality metric in the same report. That does not solve every interpretation problem, but it removes most of the avoidable ones.
One practical template is:
family=NAME
pattern=AEMEAEMEAEMR
lane=dense accelerator | dense TPU | MoE eval
arch_knobs=moe_enabled,mtp_depth=3,router_policy=topk
runtime_knobs=regional_compile,grad_reduce_in_fp32,recompute=selective
fit_window=context=64k,microbatch=2
quality_window=steps[1000:2000]
throughput_metric=tokens_per_second
quality_metric=task_loss_or_eval
That kind of structure turns performance discussion into engineering evidence. It also makes it easier to compare notes across the articles and public examples, because both speak in terms of concrete lanes, concrete modules, and reproducible reports.
The short conclusion is straightforward. Real knobs are the ones that change executed structure, memory shape, routing behavior, or communication semantics. Fake knobs are the ones that exist only in parsed args, in partial audits, or in reports that omit the active path. Once you enforce that distinction, throughput-versus-quality trade-offs stop looking mystical and start looking like what they are: model- and lane-specific engineering choices.
Frequently asked questions
Which knobs are real and which are fake?+
Why report the pattern string every time?+
A, M, E, and R blocks spend time and memory differently. Without the pattern string, two similar runs can be optimizing different bottlenecks.Why are recompute and memory-shape decisions treated as quality knobs here?+
When does grad_reduce_in_fp32 count as a real knob instead of report noise?+
What is the minimum measurement tuple before you claim a knob win?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.
A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.
MegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.
The attention-heavy block family in MegaCpp's A/M/E/R notation.
The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
The expert / MoE block family in MegaCpp's A/M/E/R notation.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
The recurrent tail block family in MegaCpp's A/M/E/R notation.
Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.