Porting To Megatron-Core Is Harder Than It Looks
Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and the bridge layer that makes them line up.

Porting our hybrid stack into NVIDIA Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample is the single largest framework integration we have done, and it is genuinely hard in ways that are not obvious until the second week. The reason is not that Megatron is badly written - it is not - but that Megatron is a shape, and our models are a slightly different shape. This post is a concrete walk through the adapters we had to build, what they actually paper over, and which pieces are real gaps we could not close.
The first-touch ownership rule is simple. Megatron should own the regular dense
surfaces it already names directly: TransformerConfig, TE-native layer specs,
standard TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding or PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample machinery, and the optimizer or schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention surfaces around
them. The bridge should own the irregular seams that still do not have an
honest native home: hybrid A/M/E/RQuick term guideA / M / E / RMegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.GroundingAbout: SLM architecture Example: hybrid layout notes Example: block taxonomy sample lowering, recurrent mixers, and the
remaining warn-and-map fields.
Why MegaCpp cares about this
The shortest companion read is How to express a Nemotron-style recipe as pure Megatron CLI, because both posts are really about the same rule: keep the native Megatron contract narrow enough that the remaining custom seams stay auditable.
Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer is where the best kernels live on Hopper and Blackwell today: TELayerNormColumnParallelLinear, TERowParallelLinear, TEDotProductAttention, fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries, fused masked softmax, userbuffer TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding comm-overlap, and a working FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper recipe via fp8_autocast. We want all of it for the NVIDIA training lane. Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample also already has a production-tested pipeline-parallel schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention, expert-parallel all-to-all with grouped GEMM, distributed optimizer with overlapped reduce-scatter, and a DDP that buckets aware of TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding and EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding process groups. Writing our own versions of these is possible - we have done pieces of it - but it is years of work to catch up on kernel fusion alone.
So the question was never "should we use Megatron?" - it was "how much of our architecture survives intact when we do?". The hybrid stack is the hard part: dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries blocks, MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: public-safe MLA integration patterns Reference: fused MLA on NVIDIA blocks, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample blocks, MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode/M2RNN blocks, and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack blocks, composed with a mHC residual stream and MTP heads on top. Megatron's TransformerConfig was designed around one reasonably regular layer shape repeated num_layers times with optional MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack and optional MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode. Our layer pattern is irregular on purpose. Everything downstream of that mismatch is friction.
If these bridge terms are new
TransformerConfigis Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample's main model-configuration object for the transformer stack. If a field has no honest equivalent there, the bridge has to warn, map only part of it, or keep it custom.- A bridge layer is the compatibility module that translates MegaCpp's flat runtime/config surface into Megatron-owned objects without making the rest of the model code import Megatron everywhere.
- Warn-and-map means "emit the native subset and loudly document what did not carry across exactly." It is weaker than a perfect native mapping, but stronger than a silent approximation.
- Sequence-first means Megatron's common
(T, B, D)tensor order, while batch-first is MegaCpp's usual(B, T, D)order. Adapters in this lane often exist just to cross that ownership boundary honestly.
The fastest local reading order after those definitions is NAM56R Megatron plan sample for the high-level lowering, NAM56R runtime patch surface sample for the seams that stay custom at runtime, and M2RNN mixer spec sample for the most obviously non-native block contract. If you want the emitted native flag bundle after that, Megatron args sample is the companion surface.
What we built in MegaCpp
The same containment strategy appears again in Shared MLA adapter boundaries, where one narrow adapter seam is cheaper than letting MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: public-safe MLA integration patterns Reference: fused MLA on NVIDIA-specific conditions leak through every builder path.
The checked-in public samples expose the integration as a family of narrow adapters. They are independent on purpose: each one adapts exactly one Megatron surface so features can be turned on and off without rewriting the rest of the model. NAM56R Megatron plan sample, NAM56R Megatron recipe near-copy, NAM56R runtime patch surface sample, and M2RNN mixer spec sample are the shortest checked-in path through that boundary: plan, fail-closed translation, runtime patch layer, and recurrent seam.
The bridge layer is the spine. It lazy-imports Megatron runtime surfaces, guards every call site behind availability checks so non-Megatron environments never trip over missing CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 dependencies, and owns the lifecycle for Megatron's process groups. The centerpiece is a config-mapping function that translates our flat training configuration into Megatron's transformer configuration. The mapping is dense and painful. A minimal version looks like this:
# bridge-layer config mapping (abbreviated)
def get_megatron_config(gpt_config, **overrides):
hidden = gpt_config.n_embd
head_dim = hidden // gpt_config.n_head
# GQA: Megatron's num_query_groups is our n_kv_head, with MHA encoded as None
num_query_groups = None if gpt_config.n_kv_head == gpt_config.n_head else gpt_config.n_kv_head
if gpt_config.activation == "swiglu":
ffn = int(8 * hidden / 3)
ffn = ((ffn + 7) // 8) * 8 # match our SwiGLU round_up_8
else:
ffn = 4 * hidden
cfg = dict(
num_layers=gpt_config.n_layer,
hidden_size=hidden,
num_attention_heads=gpt_config.n_head,
num_query_groups=num_query_groups,
kv_channels=head_dim,
ffn_hidden_size=ffn,
gated_linear_unit=gpt_config.activation == "swiglu",
normalization="RMSNorm",
add_bias_linear=False, add_qkv_bias=False,
bf16=True, params_dtype=torch.bfloat16,
)
cfg.update(overrides)
return TransformerConfig(**cfg)
The comments in the real file are almost longer than the code. Every line encodes a decision: SwiGLU wants gated_linear_unit=True plus activation_func=silu because Megatron folds the gate into the linear; relu2 has no Megatron-native equivalent and falls back to relu with a warning; rope_theta is attached after construction because the base TransformerConfig does not accept rotary_base as a dataclass field (only the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: public-safe MLA integration patterns Reference: fused MLA on NVIDIA subclass does); attn_softcap has no equivalent field at all and has to be applied externally or routed through TE's attn_logit_softcapping.
MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack is where the bridge is leakiest. Sink-token and some expert-choice variants
still have no direct TransformerConfig analogue. Group-capped routing is not
always in that same bucket: if the semantics line up with Megatron's native
router grouping, the bridge can map that subset upstream; if not, it should
stay explicit. Loss-free load balancing is still the cleanest supported subset:
disable the native aux-loss path, enable expert bias, and pass the bias update
rate through explicitly.
Sink-token behavior is even less native than grouped routing, because it belongs to the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries-mask contract rather than the expert router itself. That distinction matters operationally: a bridge can often map grouped load-balancing knobs into Megatron config, while sink-token behavior usually stays an explicit attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries-side seam.
That is the ablation rule in practice. If the constraint is honestly expressible as native router grouping, the bridge should stop pretending it is a bespoke surface forever. If the behavior changes the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries mask, adds a different expert-choice rule, or otherwise escapes the router's native grouping contract, the bridge should keep it fail-closed instead of silently flattening it into the nearest Megatron flag.
The transformer-block adapter makes a single Megatron TransformerLayer look like our native block interface. Three annoyances live in this wrapper. First, layout: Megatron is sequence-first (T, B, D), we are batch-first (B, T, D), so the wrapper transposes on entry and exit. Second, RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries: Megatron's training path does not accept external cosine/sine rotary kwargs on the generic training surface, so we build a rotary-embedding module at initialization time and cache the positional embedding per (seq_len, device) on the first forward. Third, features: the wrapper has to ignore or translate several side-channel arguments such as local-window hints, KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack state, document IDs, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries metadata, augmented-residual kwargs, and extra experimental controls. That is fine for plain attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries layers, but it is exactly why we cannot simply replace the whole block list with generic Megatron block instances.
The MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style mixer adapter is the most honest illustration of what an adapter actually is. Megatron's MambaLayer expects a mixer object that follows a specific protocol: construction receives configuration plus pipeline metadata, forward receives sequence-first tensors plus inference context, and it returns (output, bias) so the outer layer's fused bias-dropout-add path can consume it. Our recurrent mixer takes a different constructor, expects batch-first tensors, and returns a single tensor. The adapter reconciles all of it:
# recurrent-mixer adapter (abbreviated)
class MegaCppM2RNNMixer(nn.Module):
def __init__(self, config, *, d_model, submodules=None,
layer_number=None, pg_collection=None,
pp_layer_offset=0, inner_config=None):
super().__init__()
from .m2rnn import M2RNNLayer
cfg = inner_config or config
if not hasattr(cfg, "n_embd"):
object.__setattr__(cfg, "n_embd", d_model)
self._inner = M2RNNLayer(cfg, layer_idx=layer_number or 0, tp_degree=1)
def forward(self, hidden_states, *, inference_context=None,
packed_seq_params=None):
# [s,b,h] -> [b,s,h] -> inner -> [s,b,h]; bias = None for mamba_bda
y = self._inner(hidden_states.transpose(0, 1).contiguous())
return y.transpose(0, 1).contiguous(), None
Wrapping the mixer, not the whole layer, buys us the things Megatron is
actually good at: participation in Megatron DDP's overlap_grad_reduce and
overlap_param_gather, the fused bias_dropout_add / norm / residual
plumbing, correct [s,b,h] pipelining, and integration with Megatron's
mixer-spec surface. It does not buy us an inference cache. Our matrix state
(B, N, K, V) still does not match the simpler cache tuple Megatron's
generation path expects, so that wiring remains outstanding work.
The deeper issue is topology, not syntax. Sequence-first transport is what lets Megatron keep TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample, and CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample Reference: context parallel and sequence parallel bookkeeping honest, while recurrent caches are usually organized around batch-local state and long-lived mixer state. That means the adapter has to preserve more than a transpose: it has to mark the point where sequence-sharded ownership becomes recurrent-state ownership, and it has to keep cache-shape reporting honest enough that graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample, pipeline preallocation, and generation do not make the wrong assumptions about what the block will save between calls.
The Megatron MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack integration layer is much thinner. It reuses the upstream permutation and top-k utilities where they help, but keeps MegaCpp routing logic intact. The real work is normalizing our routing inputs - indices [num_tokens, top_k] plus aligned weights vs. a dense [num_tokens, num_experts] mask with probabilities - into whichever layout the upstream helper expects. The alternative was porting the whole router onto Megatron's native MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layer and giving up routing seams we still need.
The optimizer bridge is the biggest of the adapters and the one that does real work rather than shape-shifting. It keeps bucket sizing, overlap, and shard-local Muon/AdamW handling aligned with Megatron's distributed optimizer rules, then batches reduce-scatter and all-gather work so the step path stays close to the upstream communication pattern.
How it lands in production
If you are reading this specifically for the NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample migration path, NAM56R Megatron translation is the cleaner follow-on because it shows one concrete fail-closed lowering instead of the whole adapter catalog at once.
In the current MegaCpp architecture the shape is inverted. Megatron is the framework and the custom pieces slot into Megatron-native specs. The checked-in NAM56R Megatron plan sample, NAM56R Megatron recipe near-copy, NAM56R runtime patch surface sample, and M2RNN mixer spec sample keep those seams explicit. The recurrent path now runs through fused chunk and Triton kernels rather than the pure-PyTorch recurrence the early adapter wrapped.
The lift-as-is set is small and boring: the bridge itself, the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack primitive wrappers, and the distributed optimizer's bucket math. Rewrites: the early block wrapper gives way to production ModuleSpec objects that are fully TE-native from end to end. Drops: the pure Python recurrent mixer becomes a Triton fused chunk kernel, and the PyTorch AdamW fallback gives way to TE fused optimizers when FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper is on. Feature flags for block choice, recurrent-path enablement, DDP mode, and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatcher choice survive into production as operator-visible toggles.
Ablations and what we kept
The research comparison adds one more rule for deciding when to upstream or delete a bridge seam. If an irregular pattern can be expressed honestly in native layer-spec selection or native pipeline layout, keeping a Python dispatcher in the middle usually stops being a bridge and starts being a tax. It widens the compile surface, makes recompute boundaries harder to describe, and keeps paying per-step dispatch overhead after Megatron can already own the topology directly.
The bridge surface, with the leaks called out:
| Our field / feature | Megatron TransformerConfig analogue |
Status |
|---|---|---|
n_kv_head |
num_query_groups (None for MHA) |
mapped |
swiglu |
gated_linear_unit=True + activation_func=silu |
mapped |
relu2 |
relu |
warn-and-fallback |
rope_theta |
post-construction attribute on RoPE module | mapped |
attn_softcap |
TE attn_logit_softcapping (when TE owns DPA) |
partial |
| Loss-free LB | moe_router_load_balancing_type="none" + bias hooks |
mapped |
| sink-token routing | none | warn-only |
| dual-top-k expert choice | none | warn-only |
| group-capped routing | moe_router_num_groups + moe_router_group_topk |
partial / map when semantics match |
| TP comm overlap | requires explicit userbuffer init | off by default |
The integration change notes record the messy middle. Our first iteration of the transformer-block adapter set tp_comm_overlap=True by default and was 17x slower than our own block - we reverted it and documented the reason in-line. The fp8_autocast scope audit found a mismatch where Megatron enters the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper region at the transformer block level and we were entering it per-linear; we now match the block-level scope, and MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode is explicitly excluded from FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper because TE's FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper GEMM paths do not compose with our SSD recurrence.
The SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel norm-grad all-reduce fix in the tensor-parallel adapter was another silent-correctness bug: under --sequence_parallel --megatron_tp, norm parameters see only a shard of the sequence on each rank, and without an explicit all-reduce of their grads the training diverges. We now install a hook that mirrors Megatron's own final gradient-reduction path. In current Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample terms, that usually means tagging custom norm parameters with average_gradients_across_tp_domain=True so finalize_model_grads picks them up. The loss-free LB global-sync patch in the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack path was the same kind of bug in a different location.
Things that survived: the bridge itself as a thin CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-only module, the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode mixer adapter pattern (wrap the mixer, not the layer), the distributed optimizer's bucket math, and the process-group lifecycle. Things we dropped: our first attempt at mapping null_rho onto Megatron's capacity-factor fields, because the semantics did not match, and the idea that TransformerConfig could carry all of our routing fields directly.
Production checklist
- Keep the bridge CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-only. The TPU/XLA path must never import
megatron.core. - Gate every Megatron call site behind explicit availability checks so CPU unit tests stay green.
- When a field has no
TransformerConfiganalogue, warn loudly and document the workaround - do not drop it silently. - Drive DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP/TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding/PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample/EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding sizes through one shared Megatron parallel-state initializer; do not hand-roll process groups in the model code.
- When wrapping a mixer for the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode block surface, return
(output, None)so the fused bias-dropout-add path stays valid. - Set
persist_layer_norm=Falsewhenever a custom norm wrapper owns the layer-norm path. - For MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, route through
alltoallwhen EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding > 1 andallgatherotherwise. - For FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper, enter
fp8_autocastat block scope, not per-linear; keep MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode and M2RNN out of the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper region. - Validate loss-free LB mapping on a canary before every release:
moe_router_load_balancing_type="none",moe_aux_loss_coeff=0, expert-bias enabled, bias update rate matches our config.
Frequently asked questions
Why keep TP communication overlap off by default on bridge lanes?+
Why is a translated Megatron flag bundle not the whole port by itself?+
custom_notes, and the runtime patch surface sample keeps schedule, fused-loss, and shared-spec behavior explicit instead of pretending one translated launcher owns all live runtime behavior.When should an irregular layer pattern move out of the bridge?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.
The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.
NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.
Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
MegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.
Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.
Context parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.
Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.
Data parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.
Root-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…