MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 4 min readDavid Gornshtein
NAM56R
Launch
Runtime
Megatron

NAM56R launch policy

Why a NAM56R launcher is more than translated Megatron arguments, and why runtime policy has to stay explicit alongside the pattern plan.

MegaCpp
Focused on applied C++ model engineering
Article Preview
NAM56R launch policy
Published 4 min readDavid Gornshtein

One mistake repeats in model-porting writeups: the translated model pattern gets published, but the launcher policy stays implicit.

That is not enough for NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample. The layer plan is only one surface. A real launcher also has to pin runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot flags, parallelism choices, MTP depth policy, and the set of custom seams the run still depends on.

The glossary-first version of that claim is simple: NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample is not just a name, it is a depth-52 hybrid recipe family whose public-safe examples use the pattern AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample. In the checked-in composition sample, that means 13 attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns slots, 22 expert slots, 13 Mamba-family slots, and 4 recurrent-tail slots. A launcher policy exists because those block families do not all need the same runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot treatment.

For first touch, keep three launch-policy surfaces separate:

That is why the checked-in example splits the launch contract into two parts: generated MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-facing args and fixed runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot policy. The translation half is the subject of NAM56R Megatron translation; the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot half is where porting to Megatron friction and Muon on Hopper and Blackwell stop being side notes and start changing what the launch actually has to pin.

If you need the naming side before the operator side, start with MegaCpp model glossary and SLM architecture. This article assumes the reader already knows why A/M/E/R are not interchangeable.

Glossary checkpoint: what the launcher is launching

The public-safe samples split the job cleanly:

NAM56R launcher profile sample then shows the operator-facing shape of that contract: layout, r_layer_indices, selected DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample-capable A-layer ranks, TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding/PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample/VPP/EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding, batch geometry, MTP depth, sparse mode, and CUDA-graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample policy. That is the right public-safe level of detail for a launch article. It is concrete enough to be inspected, but it does not expose machine-local paths or private runbooks.

That split matters because A-block policy and E-block policy diverge quickly. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-family slots are where MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample, Engram, and mHC decisions attach. E-blocks are where MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack and MoD policy attach. M-blocks and R-blocks are where custom Mamba and recurrent seams remain visible. A launcher that hides those differences behind one long flag string stops being inspectable.

Why the split is worth documenting

If those two surfaces get mixed together, it becomes hard to tell whether a run is failing because the model plan is wrong or because the launcher policy is wrong. Keeping them separate makes the public contract easier to inspect and easier to change safely. The same separation also keeps operator questions readable: NAM56R Megatron translation answers what lowered, while porting to Megatron friction answers why some seams still need explicit adapters at all.

That split becomes clearer in Throughput vs quality knobs, Sequence, Context, and Expert Splits in the Hybrid Stack, and MoD, MoDA, and MTP: Dynamic Depth and Multi-Token Heads: pattern translation is only one surface, while runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot policy decides which distributed axes, auxiliary heads, and stability guardrails are actually in force.

In the checked-in example set, NAM56R launch contract sample is the compact contract, NAM56R launcher profile sample is the operator-facing profile and environment view, and NAM56R runtime patch-surface sample marks the places where the launch still depends on explicit runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot seams.

Example -> article -> upstream docs

The distributed consequences of this launcher split are easier to read next to sequence, context, and expert splits in the hybrid stack, dualpipe and 3D parallelism on H200 and GB10, and the operator-side H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 posts H200 bringup and naming and training on 8x H200 SXM: the operator playbook.

FAQ

Frequently asked questions

Why separate generated args from runtime policy?+
Because the model pattern and the launch behavior fail differently. Mixing them turns debugging into guesswork.
Is launcher policy just a deployment detail?+
No. Distributed axes, MTP depth, and custom seam activation all change the real runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… contract.
What stays in fixed policy after the recipe emits native args?+
Pre-import cache setup, recipe-mode selection, CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes./NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes. environment choices, graph policy, batch geometry, and MTP depth stay in the launcher surface. They are not layer-pattern facts. They decide when the process initializes runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… state and which distributed or auxiliary-head paths are live, so they belong beside H200 memory geometry and CPU Offload and Startup Memory Calibration on H200 and GB10, not inside the translated NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label. pattern.
Which checked-in file is the fastest proof of the split?+
NAM56R launch contract sample is the shortest proof because it keeps generated_native_args and fixed_policy as different fields instead of flattening them into one launcher string. NAM56R launcher profile sample is the next file to open if you need the operator-facing layout and runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… view.
What does the R in NAM56R change for launch policy?+
It keeps the recurrent-tail seam visible. Even when one emitter rewrites the pattern for a downstream syntax, the launcher still has to know whether the run depends on a custom recurrent path rather than a fully native layer.
Is NAM56R just a size label?+
No. In the checked-in recipe sample it also fixes the hybrid pattern, head layout, sequence length, expert settings, and the launch-side decisions that make those pieces runnable together.
Why does mtp_depths stay in launch policy instead of disappearing into pattern translation?+
Because it changes executed runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… shape, not only recipe syntax. The checked-in NAM56R launcher profile sample keeps mtp_depths beside tpQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node., ppQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors., vpp, and batch geometry for exactly that reason: extra prediction depths change the auxiliary-loss path, memory pressure, and where the training layout has to carry that work. The local follow-ons are MoD, MoDA, and MTP: Dynamic Depth and Multi-Token Heads, which explains why MTP lives with the last pipeline stage, and Throughput vs quality knobs: which trade-offs are real, which treats MTP depth as a reportable runtime knob rather than as invisible recipe metadata.
What should I read next if I care about the distributed side of NAM56R?+
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

PP

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

DualPipe

Bidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

A / M / E / R

MegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.

AEMEAEMEAEMR

A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.

EP

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Runtime boundaries

Why MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.