MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202615 min readDavid Gornshtein

H200

Training

Performance

Nam52

NAM56R

MoE

Mamba

Training speed anatomy on H200

Q: What do goodput and badput mean in this article?

goodput is the fraction of wall time spent doing useful training-step work; badput is the wall time lost to compilation, checkpointing, data loading, eval, or idle gaps. The checked-in goodput tracker sample is the compact local definition.

Q: Why did an attention optimization not move end-to-end throughput?

Because attention may be a small slice of the real step on a mixed or expert-heavy lane. GPU profile receipt sample and measured optimization receipts are the local proof surfaces for that claim: they keep observed dispatch, throughput, and memory together so a narrow kernel win cannot be mistaken for a whole-step win.

Q: What overlap facts should a speed receipt record?

At minimum: whether CUDA_DEVICE_MAX_CONNECTIONS was left unset or overridden, whether communication and compute actually ran on separate streams, whether bucket sizes or overlap flags were active, and whether compile warmup was explicit or skipped. Without those fields, "overlap enabled" is not enough to separate a genuinely concurrent lane from a serialized one. The checked-in compile/runtime receipt sample is the compact lane header, and NCCL and collective hangs is the transport-side companion. On H200-class lanes it is also worth saying whether the bucket sweep stayed in the larger regime these runs can actually use instead of inheriting tiny older defaults.

Q: Why keep both allocated and reserved memory in an H200 speed receipt?

Because they answer different questions. allocated tracks live tensor demand, while reserved tracks the larger pool the caching allocator is still holding. If reserved climbs while allocated stays relatively flat, the lane is often seeing fragmentation or cache growth rather than a new steady-state model footprint. That distinction matters on H200 because allocator drift can look like a throughput regression long before it becomes an outright OOM. Profiler and receipts is the article-length explanation, and GPU profile receipt sample is the compact local companion.

What actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail to move whole-step throughput.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Published April 18, 2026•15 min read•David Gornshtein

H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit speed in the current stack is shaped less by any single kernel headline and more by step anatomy: compile policy, block mix, communication overlap, memory shape, and whether a supposed fast path is even active. lane here means one concrete runtime shape: hardware, compile policy, parallelism, and feature mix taken together, not just “runs on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry.” The checked-in public receipts used here show three things clearly: regional_compile + MoE can still stall in explicit warmup, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns can be a small minority of total step time on some lanes, and some of the biggest gains come from removing unnecessary allocations rather than from rewriting the hottest-looking kernel. batch geometry here means the concrete per-device microbatch, sequence length, and accumulation shape that decides both how much work each step does and how much activation pressure the lane carries while doing it. The quickest proof surfaces are the checked-in compile warmup policy sample, goodput tracker sample, and measured optimization receipts. That is also why this post crosses naturally into Activation checkpointing deep dive, Mamba 3 kernel journey, Specialists, and Training on 8x H200 SXM: the operator playbook.

People like to ask for a single answer to "what makes H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry fast?" The useful answer is structural. TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit speed is the sum of setup costs, compile behavior, forward and backward mix, communication policy, and memory pressure. On a hybrid stack like NAM52 or NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample, those pieces vary by pattern. That is why the best H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry notes here are not generic benchmark blurbs; they are receipts tied to exact lanes and exact code paths. The compile-policy half of that story is spelled out in The Compile-Time Tax We Accept for Runtime Speed.

Start with the lane, not the GPU

The checked-in H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry status excerpt and neighboring receipt posts already frame H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry correctly: keep the runtime recipe explicit, and do not blend a stable dense lane with adjacent partial or unstable lanes into one generic "H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry" story. That wording matters because it refuses to treat "H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry" as a single performance fact. In checked-in form, the shortest lane description is the tuple from compile/runtime receipt sample plus the block-pattern context in NAM56R pattern composition sample. A dense lane, a MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack lane, and a hybrid MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode lane can all run on the same accelerator and still have very different speed anatomy, which is the same reason Modal vs Owned H200:8 vs TPU treats hardware choice as a lane-choice problem rather than a logo-choice problem.

The same point shows up in public instrumentation patterns: attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns can be a modest share of end-to-end step time on a compile-heavy or expert-heavy lane. That is a deceptively important constraint. It means many attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-centric optimization claims are bounded before they start. Even a perfect attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns win cannot move total step speed much if attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns is a thin slice of the step.

Layer of analysis	Wrong question	Better question
hardware	"How fast is H200?"	"Which H200 lane and which pattern?"
kernels	"Did attention get faster?"	"How much of the step is attention on this lane?"
compile	"Did compile complete?"	"Did explicit warmup help or did it stall the lane?"
memory	"Did peak memory drop?"	"Did the drop enable a larger or more stable training configuration?"

This framing sounds obvious, but it filters out most low-value performance discussion immediately.

Compile policy is part of speed, not just startup overhead

A checked-in compile-warmup sample isolates one of the clearest bottlenecks in this stack: explicit CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 compile warmup on regional_compile + MoE lanes. regional compile here means compiling narrower subgraphs or model regions instead of one monolithic end-to-end graph, which can reduce some runtime costs but also creates its own startup-policy edge cases. compile warmup here means an intentional pre-step compile pass that tries to populate kernels and caches before real trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit work starts. The sample makes clear that the slowdown is not well explained by blaming a newer PyTorch alone. Instead, the strongest regression sits in compile-warmup policy. That is the same contract problem described in The Compile-Time Tax We Accept for Runtime Speed, just seen from the throughput side instead of the compiler side.

The policy interaction matters. Earlier guardrails narrowed some compile failures, but later logic still re-enabled explicit compile warmup on a lane shape where the checked-in receipts show that warmup remained a blocker. The live repros disproved the idea that this class of H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry regional_compile + MoE lane was now safe to pre-warm unconditionally.

The sample then walks through three useful receipts.

A stalled warmup lane with warmup enabled still gets stuck in compile warmup even with effective MoD bypass.
A manual no-warmup workaround lane skips warmup, keeps caches growing lazily, and progresses into autotune and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack forward work.
A patched default lane adopts the same practical behavior automatically by skipping explicit warmup on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 regional_compile + moe_enabled paths.

This is a strong lesson because it shows how startup policy changes total runtime behavior. Compile warmup is often described as a harmless front-loaded cost that pays off later. Here it was an execution blocker. On this class of H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry lanes, skipping explicit warmup was not a micro-optimization. It was the difference between stalling in setup and reaching the real trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit path.

Another useful decoder is that regional_compile is not just a smaller-graph option. On current expert-heavy lanes it also gives up part of the static CUDA-graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample-friendly story, which is one reason MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch can turn warmup into CPU-managed churn instead of a one-time precompile win.

if device == cuda and regional_compile and moe_enabled:
    compile_warmup = skipped
else:
    compile_warmup = normal_policy

That simple rule is more valuable than a dozen vague claims about "compiler stability." The checked-in compile warmup policy sample is the narrow local proof surface for that exact skip logic, and regional compile dynamic batch sample is the compact companion showing how compile policy and batch geometry interact on a real lane instead of as separate abstractions.

The practical fallback suggested by those receipts is narrower than it sounds: skip explicit warmup for the full expert-heavy lane, but still precompile the clearly static pieces on their own. Dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns or MLP regions can still benefit from precompile, while dynamic routing stays on the eager path and fills caches lazily during real steps. That keeps the useful part of warmup without pretending the whole regional_compile + MoE lane has become graph-friendly.

Whole-step speed comes from dominant work, not from the most fashionable kernel

Once the lane gets past setup, the next question is where the time actually goes. The checked-in public samples provide several grounded examples that push against hype-driven optimization. The shortest measurement readback is measured optimization receipts, which keeps the condition, throughput delta, and memory delta together instead of flattening them into one number.

An attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-light lane is one example. Another comes from the checked-in Mamba linear cross-entropy sample, where the output-layer loss contract stays explicit during refactors instead of being inferred from matching shapes alone. That kind of receipt matters because memory-shape cleanups only count if the logits-to-loss contract is still correct after the cleanup.

The checked-in DSA indexer memory sample tells a similar story from another direction. It makes the dense score surface explicit and shows why a non-materializing or fused path exists in the first place. On paper that is a memory optimization; in practice it is exactly the kind of change that can pay back in allocator pressure, launch stability, and configuration headroom.

That is the same reason some of the most valuable H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry wins do not look glamorous in isolation. A fused DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample indexer or a cut-cross-entropy style loss path helps because it deletes giant intermediates, allocator churn, and logit materialization from the dominant lane, not because the kernel name is fashionable. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry, memory-shape repairs often become speed work precisely when they keep the real microbatch and communication plan intact, which is the same boundary explored in DSA indexer memory fix and Mamba linear CE parity deep dive.

The lesson is straightforward. Kernel-level work matters, but whole-step speed on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry is often governed by:

whether the kernel is on the active path
how much of the step that path occupies
whether memory shape is forcing smaller microbatches or triggering instability
whether communication or compile policy is the real limiter

This is just Amdahl's law in lane form. On a hybrid H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry step, doubling a 15% slice only buys a small total-speed win, while removing a compile-warmup blocker or deleting a giant intermediate can change the whole run because those costs sit on the dominant path. That is why the best H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry receipts keep dominant step share next to the measured delta instead of celebrating kernel-local speedups in isolation.

For expert-heavy NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample-style lanes, that attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns share can be lower than people expect: low-teens on one lane, mid-teens on another, with routing, loss, and communication taking the rest of the step. That is not a universal benchmark claim. It is a useful smell test for hybrid receipts. If the lane is truly expert-heavy and the report still implies attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns owns most of wall time, the first question should be whether the receipt measured the real production path or a narrower kernel-focused slice.

That is why local profiler wins frequently fail to move the end-to-end number. They solve the wrong slice.

Communication and overlap still matter, but only when the lane can use them

The same checked-in receipts also record several communication-side changes with real performance implications: Megatron-style bucket handling, overlap-related harness patterns, and optional fp32 gradient-reduction support in the optimizer path. These are meaningful because they alter how much of the step is hidden behind communication and how stable the reduction path remains. For the expert-specific side of that story, Expert parallel and MoE sharding is the natural companion, while goodput tracker sample is the smallest checked-in proof of how compile, step, checkpoint, and idle time stay separated. That same separation is why Profiler and receipts belongs in this lane too: if the wall clock is dominated by startup badput, a steady-state kernel story alone is incomplete.

But once again, the repo is careful not to overclaim. Some items are explicitly called out as no-ops or partial truths. Shared expert overlap is not treated as a free speedup if the concurrency path is not really there. Router dtype claims are checked against autocast reality. And several deferred items are named honestly instead of being retroactively counted as delivered throughput work.

Overlap also needs a real concurrency contract. Bucket tuning only helps if the buckets are large enough to keep the link busy but not so large that compute waits on one giant reduction, and shared-expert overlap only counts when communication and compute actually live on separate streams under a runtime configuration that allows them to overlap. A flag that says "overlap" without that stream reality is still a serialized lane.

The minimal overlap contract is concrete. Record the actual CUDA_DEVICE_MAX_CONNECTIONS setting for the lane instead of assuming one universal H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry default, put NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 traffic on a real communication stream, leave shared-expert or dense compute on the compute stream, and use buckets large enough to saturate the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry link instead of tiny legacy defaults that only create launch overhead. If any one of those conditions is missing, the lane may still be correct, but it should not be reported as communication-hidden.

That last detail is why bucket numbers belong in the receipt instead of in a side note. The checked-in communication write-up keeps the public rule simple: use the lane-scaled bucket policy, record whether the larger-bucket regime was actually active, and do not report "overlap" unless the stream split and launch policy made overlap possible in the first place. Comms cost and overlap is the local continuation when the question stops being "did overlap exist?" and becomes "which communication policy hid real wall time?"

This is exactly how H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry performance work should be reported. If overlap exists only on paper, do not count it. If a precision policy helps only one branch of the model, say so. If a parity gap means a supposedly optimized path is not running in the current recipe, then the right answer is not "performance is disappointing" but "the fast path is not actually active."

Speed lever	When it helps	When it disappoints
comm bucket tuning	communication-bound or overlap-friendly lanes	compute-dominant lanes where comm is already hidden
fp32 grad reduction	stability-sensitive large runs	pure speed chasing when bf16 reduction was already sufficient
expert overlap	real concurrent execution exists	overlap flag is present but runtime does not overlap meaningful work
attention-kernel upgrades	`A`-heavy lanes with big attention share	mixed or `E`-heavy lanes where attention is a minor slice

That table sounds modest, but it is the honest one.

Memory shape is often the hidden governor of H200 throughput

On large accelerators it is tempting to assume memory is no longer the main issue. The receipts here say otherwise. Several of the most meaningful wins in MegaCpp come from reducing useless buffer materialization or from keeping an intermediate out of the graph entirely. The MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode output-layer parity fix is a good example because it removes a large per-slot allocation. The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample indexer reproducer is another because it collapses an extremely large intermediate into a much smaller fused buffer with matching math. If your next question is whether to trade memory for recompute instead, Activation checkpointing deep dive is the right adjacent post.

Why does that matter for speed instead of only feasibility? Because memory shape changes everything else. It can determine whether a target microbatch fits, whether pipeline slots remain stable, whether the runtime spends time fighting allocator pressure, and whether a lane can stay on the intended fast path instead of dropping into a fallback.

Scale makes that claim concrete. A naive long-context DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample indexer can transiently want tens of GB of routing state, and a naive loss path can create its own short-lived giant logit spike at the output boundary. The fused or on-the-fly variants matter because they delete those peaks rather than merely moving them around. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry that usually shows up as the real win: the intended microbatch fits, allocator churn stops dominating the wall clock, and the overlap plan has a chance to be real.

The same principle appears in the checked-in compile and runtime samples. If an explicit warmup policy or a bulky intermediate pushes the lane into instability, then the theoretical kernel speed on the steady-state path becomes irrelevant. The run never reaches the clean steady state you thought you were measuring.

For NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample-scale discussions this is especially important. Model family labels are not decorative. They tell you the likely pressure points. A NAM52 run and a NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample run can differ not just in size but in which path becomes memory-sensitive first. The checked-in NAM56R NeMo recipe sample, MLA integration pattern sample, and index-cache patch nearcopy are the compact local proof surfaces for those shape-specific pressure points.

Mamba and expert paths need separate H200 narratives

Another recurring mistake is to collapse all non-dense work into one bucket. The sources point in the opposite direction. MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-specific work, expert-routing work, and sparse-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns work each have their own limiting factors. That is why The Mamba 3 Kernel Journey and Specialists need to stay separate references instead of being rolled into one generic "advanced kernels" story.

The checked-in Mamba linear cross-entropy sample shows a subtler point: on modern Triton and H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry, a seemingly obvious backward-kernel cleanup can be mostly neutral because compiler passes already remove the redundant work. That is exactly the kind of receipt that keeps an optimization program honest. Not every attractive kernel diff translates into a big H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry win on a current toolchain.

The compiler-side decoder is sharper than "sometimes the gain is small." On a current Triton toolchain, masked or otherwise dead backward paths can vanish before they ever become real H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry work: arithmetic disappears, and so do the corresponding load/store instructions when they do not reach the final gradient. That is why some kernel-local cleanups benchmark as nearly neutral. If the old path was already dead-code-eliminated, the next real win has to come from removing a live buffer, allocator churn, or a synchronization edge instead.

Expert paths tell a different story. Public H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry receipts show that regional_compile + MoE has special sensitivity during startup, and that router and overlap claims have to be checked against runtime truth. Together these examples say that expert-heavy H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry speed is influenced as much by compile policy and routing execution reality as by pure GEMM speed.

The result is that the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry speed story should usually be split three ways:

dense or mostly attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy lanes
expert-heavy lanes with routing and regional-compile considerations
MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode or recurrent-heavy lanes where projection and state behavior matter more than attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns folklore

That split is far more predictive than any single accelerator-wide headline number.

What to put in an H200 speed receipt

The right output format is boring and specific. That is good. A credible H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry speed receipt should include the model family, pattern string, compile policy, whether warmup was explicit or skipped, whether the lane is dense or MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-heavy, any major precision or overlap knobs, and one note about dominant step share if you have it.

For example:

family: NAM52
pattern: AEME
device: H200
lane: cuda_regional_compile
moe_enabled: true
compile_warmup: skipped
dominant_cost: expert_forward_plus_comm
attention_share: low
notes: local attention win will not move total much on this lane

That kind of record lets future readers compare apples to apples. It also prevents performance folklore from spreading across lanes that do not share the same anatomy.

One more field pays for itself in public receipts: the disqualifier. If regional_compile is still forcing warmup churn, if batch geometry changed between runs, or if overlap never became concurrent on real streams, say that in the same record as the win. That keeps a trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit receipt from collapsing into kernel theater, and it is why Profiler and receipts, Comms cost and overlap, and Graph recompilation hell are companion reads rather than separate topics.

The same discipline applies to measurement conditions. The checked-in optimization-receipt helper keeps warm-cache state, memory delta, and the comparison condition on the same row, which is exactly the right antidote to loose H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry speed claims. If one result used a warm Inductor or Triton cache and another changed the live batch shape, the comparison is already contaminated before anyone argues about kernels. Measured optimization receipts and Profiler and receipts are the local proof surfaces for keeping those conditions attached to the claim.

The main conclusion is simple. H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit speed is determined by the executed lane, not by the chip name alone. In the current stack, compile warmup policy, block mix, communication overlap, and memory shape explain more than fashionable kernel narratives do. The best wins come from removing fake work, activating the right real path, and reporting speed with enough structure that someone else can tell what was actually measured. If the next question is how those same shape choices turn into fit or OOM boundaries, the direct companions are H200 memory geometry and Why a 4B-8B model fills an H200 and still OOMs.

FAQ

Frequently asked questions

What should I verify first on a slow H200 lane?+

Start with the lane definition, not the profiler screenshot. Compile/runtime receipt sample tells you whether the lane is compiled, regional, dynamic-batch, or MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.-heavy; compile warmup policy sample tells you whether startup should have been explicit or skipped; and goodput tracker sample tells you whether the slowdown is really in step time or is still compile, checkpoint, or idle badput.

What do goodput and badput mean in this article?+

goodput is the fraction of wall time spent doing useful trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…-step work; badput is the wall time lost to compilation, checkpointing, data loading, eval, or idle gaps. The checked-in goodput tracker sample is the compact local definition.

Why can high kernel-active time still mislead on H200?+

Because a busy kernel trace can still sit inside a bad lane if compile, data loading, checkpointing, or memory-shape stalls dominate total wall time. Goodput tracker sample, GPU profile receipt sample, and measured optimization receipts only become useful together: one keeps wall time honest, one keeps the kernel slice visible, and one ties the measured delta back to throughput and memory instead of to activity alone.

How does batch geometry change speed even when the kernels stay the same?+

Because microbatch size, sequence length, and accumulation shape decide whether the lane reaches a stable fast path or spends its time in smaller, more interruption-prone steps. The local companion is regional compile dynamic batch sample, and the memory-side explanation is H200 memory geometry.

Why did an attention optimization not move end-to-end throughput?+

Because attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. may be a small slice of the real step on a mixed or expert-heavy lane. GPU profile receipt sample and measured optimization receipts are the local proof surfaces for that claim: they keep observed dispatch, throughput, and memory together so a narrow kernel win cannot be mistaken for a whole-step win.

Why can an overlap flag still be a no-op on H200?+

Because overlap needs real concurrent execution: communication on its own stream, compute on another, and bucket sizes plus runtime settings that let both stay busy. If the runtime still serializes those phases, the flag changes the config more than the wall clock.

What overlap facts should a speed receipt record?+

At minimum: whether CUDA_DEVICE_MAX_CONNECTIONS was left unset or overridden, whether communication and compute actually ran on separate streams, whether bucket sizes or overlap flags were active, and whether compile warmup was explicit or skipped. Without those fields, "overlap enabled" is not enough to separate a genuinely concurrent lane from a serialized one. The checked-in compile/runtime receipt sample is the compact lane header, and NCCL and collective hangs is the transport-side companion. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.-class lanes it is also worth saying whether the bucket sweep stayed in the larger regime these runs can actually use instead of inheriting tiny older defaults.

What makes two H200 speed receipts comparable?+

Only compare receipts that keep the lane header stable: same pattern family, same batch geometry, same compile policy, same overlap policy, and the same note about whether the lane is dense, expert-heavy, or recurrent-heavy. If one run changed microbatch shape or warmup policy while the other changed a kernel, that is not one experiment; it is two lane changes. Compile/runtime receipt sample, regional compile dynamic batch sample, and Training speed by feature are the compact local surfaces for making that comparison honestly.

When is a memory fix more valuable than a new kernel?+

When the old memory shape is forcing smaller microbatches, causing allocator churn, or pushing the lane off the intended fast path. The checked-in DSA indexer memory sample is a good local example: the change looks like a memory cleanup, but the operational effect is lane stability and batch headroom. In those cases the memory fix changes the whole run, not just one operator.

Why keep both allocated and reserved memory in an H200 speed receipt?+

Because they answer different questions. allocated tracks live tensor demand, while reserved tracks the larger pool the caching allocator is still holding. If reserved climbs while allocated stays relatively flat, the lane is often seeing fragmentation or cache growth rather than a new steady-state model footprint. That distinction matters on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. because allocator drift can look like a throughput regression long before it becomes an outright OOM. Profiler and receipts is the article-length explanation, and GPU profile receipt sample is the compact local companion.

Which checked-in files best anchor the claims here?+

Use compile/runtime receipt sample plus compile warmup policy sample for the startup rule, goodput tracker sample for wall-time accounting, measured optimization receipts and GPU profile receipt sample for matched throughput and memory deltas, and the NAM56R NeMo recipe sample plus DSA indexer memory sample for shape-specific pressure.

Where should I go after this if I need the rest of the H200 lane?+

Use Training on 8x H200 SXM: the operator playbook for the full reading order. It keeps the operator bring-up, memory cliffs, and kernel follow-through in one place instead of making you reconstruct the lane article by article.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

Grounding

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

Grounding

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Grounding

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Grounding

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Grounding

SLM training in MegaCpp: what the stack optimizes for and what stays explicit

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Grounding

Mamba

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Grounding

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Topic hubs

Topic Hub

H200 Training and Kernel Bring-Up

A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.

David Gornshtein • MegaCppMore posts →

Training speed anatomy on H200

Start with the lane, not the GPU

Compile policy is part of speed, not just startup overhead

Whole-step speed comes from dominant work, not from the most fashionable kernel

Communication and overlap still matter, but only when the lane can use them

Memory shape is often the hidden governor of H200 throughput

Mamba and expert paths need separate H200 narratives

What to put in an H200 speed receipt

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

H200 Training and Kernel Bring-Up