MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 3 min readDavid Gornshtein
Megatron
Nemotron
Recipes
Launchers
NAM56R

How to express a Nemotron-style recipe as pure Megatron CLI

Why MegaCpp keeps high-level recipe objects and then lowers them into a smaller native Megatron flag surface instead of treating one giant launcher as the source of truth.

MegaCpp
Focused on applied C++ model engineering
Article Preview
How to express a Nemotron-style recipe as pure Megatron CLI
Published 3 min readDavid Gornshtein

The useful question is not whether a model can be launched from a shell script. Almost anything can. The useful question is whether the launch surface still distinguishes native MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split flags from custom model seams once the recipe gets large.

MegaCpp's answer is to keep a recipe-level authoring surface and then lower it into a smaller native MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split CLI bundle. That is what makes the launch contract readable. The recipe carries pattern, topology, and feature intent. The emitted CLI carries the part MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split can actually own.

The checked-in route makes that split concrete. The local path is NAM56R NeMo recipe contract sample for the authoring surface, Nemotron recipe to Megatron sample for the lowering step, and Megatron args sample for the emitted native bundle.

Why a recipe object is better than a shell blob

The two closest neighboring posts are NAM56R Megatron translation, which shows the fail-closed pattern mapping on one concrete hybrid, and What Megatron can and cannot split, which explains why some runtime boundaries stay outside one flat launcher even after the pattern has been lowered.

Hybrid models accumulate too many interacting switches to keep the launcher string as the only source of truth. Pattern layout, Mamba or recurrent seams, MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: public-safe MLA integration patterns Reference: fused MLA on NVIDIA, MTP, CUDA graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample policy, and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack settings are not all the same kind of decision. Treating them as one flat shell string is how teams lose track of which part is native and which part is local.

The public MegaCpp examples keep that separation visible:

  • a Nemotron-style recipe sample holds the full authoring intent
  • a MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-args sample emits only the native runtime flags
  • a launcher sample shows how the two are combined without pretending they are the same layer of abstraction

The launch helper matters too. NAM56R launcher profile sample combines the emitted native bundle with the remaining operator policy without pretending the two layers are the same abstraction. That is the right public contract. Native flags should stay native. Custom seams should stay explicit.

What gets lowered and what does not

The practical lowering rule is simple.

MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-native runtime concerns such as tensor parallelQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding size, pipeline parallelQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample size, sequence parallelQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel, and other well-supported execution flags can be emitted directly. Custom hybrid surfaces such as unsupported block families, custom recurrent seams, or special launch-time graph policy should remain as recipe-level notes or separate launch helpers.

That is why a good recipe translator is not only an arg emitter. It is also a boundary keeper.

In the checked-in examples, TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample, SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel, and the standard MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: public-safe MLA integration patterns Reference: fused MLA on NVIDIA, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, and DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample flag surfaces lower cleanly into the emitted args bundle. Recipe-mode choice, cache or setup policy, and graph-policy decisions stay outside that CLI and remain part of the launch helper. That is the same containment rule discussed in Shared MLA adapter boundaries and Porting to Megatron friction.

A useful test is whether upstream MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split already documents the feature as a first-class runtime surface. MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: public-safe MLA integration patterns Reference: fused MLA on NVIDIA has an explicit --multi-latent-attention path, MTP has its own prediction-depth configuration and pipeline-layout rules, and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack exposes concrete router and dispatcher flags such as expert count, top-k, dispatcher type, and grouped GEMM. Those are good lowering targets because MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split owns both their semantics and their runtime constraints.

Where this lands in MegaCpp

MegaCpp keeps this flow as a stack of explicit surfaces:

That stack is more useful than one long launcher because it makes migration and debugging cheaper. If a feature cannot lower cleanly into native MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split, the translator should say so instead of silently approximating it. The clearest checked-in bridge between the recipe object and the emitted CLI is NAM56R Megatron plan sample, and the wrapper around it stays visible in NAM56R launcher profile sample.

FAQ

Frequently asked questions

What should stay outside pure Megatron CLI?+
Anything MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and… cannot honestly own: unsupported hybrid block families, custom recurrent seams, and launch-policy surfaces that exist only to refuse unsafe approximations. If the emitted CLI pretends those are ordinary native flags, it stops being an honest translation.
How do you decide that a feature should graduate from recipe note to native CLI?+
When upstream MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and… documents it as a supported subsystem with explicit flags and runtime rules. MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion., MTP, and standard MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble. controls clear that bar. Pattern strings, fail-closed translation checks, and launcher-only graph policy do not, because they exist to preserve intent or to reject unsafe approximations rather than to configure a native Megatron subsystem.
What should the recipe object look like before lowering?+
It should be boring structured intent, not a hidden launcher. The checked-in contract sample keeps pattern, depth, model shape, precision mode, and parallelism mode together while staying import-safe away from CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. machines. The lowering step can then emit native flags and leave runtime policy in the launcher profile instead of smuggling it into one shell fragment.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

PP

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

SP

Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Topic hubs