Muon on Hopper and Blackwell: The NVIDIA Lane of the MegaCpp Optimizer Stack
How Muon, MuonClip, and the QK-clip family get from a single-file research implementation into a production AdamW-coexistent optimizer path for the MegaCpp ensemble on H200 and GB10.

Muon is the optimizer we keep circling back to whenever we want a cheaper training run at the same loss. On the NVIDIA lane of the MegaCpp ensemble, "cheaper" means Hopper H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 for the heavy dense baselines and Blackwell GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story for the small-cluster development loop. The single-file Muon reference works on paper, but once the model has mixed fused-QKV projections, deep hyper-connection stacks, and rectangular MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack experts, there is a fairly long list of knobs you have to get right before it stops diverging. This post walks through the optimizer path as it currently stands in the public MegaCpp Muon surface and what ships to the deployment stack.
Why MegaCpp cares about this
Our scaling recipe is an ensemble of specialist SLMs rather than one big model, so every training dollar we save at specialist-scale compounds. Muon gives orthogonalized updates that, per the PyTorch torch.optim.Muon documentation and the later large-scale Muon scaling paper, target the 2D linear layers that dominate attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack blocks. The catch is that Muon's orthogonalization changes gradient magnitude handling, and on our H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 dense baseline we needed three additional controls before the hybrid preset stayed stable: split-QKV orthogonalization, shape-symmetric learning-rate scaling across tall and wide expert matrices, and a post-step Q/K stabilization rule in the style later described by Kimi K2. Everything downstream in this post is about making that combination predictable enough to keep in a production optimizer lane.
The shortest checked-in proof surface for the sharding side is FSDP2 local-shard optimizer sample. The neighboring runtime context is The MegaCpp precision recipe, training speed by feature, H200 bringup and naming, plus the local GPU profile receipt sample and goodput tracker sample, so compile cost and optimizer-step cost do not get mixed together.
What we built in the public MegaCpp Muon path
The public Muon implementation in MegaCpp deliberately tries to be boring. The core step is _muon_step_fused_impl: momentum accumulation, Polar Express orthogonalization, optional variance reduction, cautious weight decay, and the parameter update are all in one compiled graph. The variance reduction uses a factored second-moment buffer (per-row when the matrix is tall, per-column otherwise) so we get AdamW-like variance damping on top of the orthogonalized update without materialising a full second moment. Cautious weight decay is gated on agreement between the update and the current parameter sign, not the pre-orthogonalization raw gradient sign. That is not the convention you will see in most reference code; we tried the raw-gradient gate once, it regressed a deep dense receipt back to immediate NaNs, and we left a comment next to the gate telling the next person to stop trying to "fix" it.
Orthogonalization itself is Polar Express rather than the classic quintic Newton-Schulz iteration. The coefficients in polar_express_coeffs are precomputed for five iterations with a small safety factor and a cushion of two. The iteration runs in bf16 unconditionally. fp32 Polar Express on TPU triggered compile-time HBM OOM with stacked all-reduce buffers on the compiled-optimizer path, and once we confirmed the bf16 path is numerically fine after a 10-to-20-step warmup on NVIDIA too, we stopped maintaining the fp32 branch. The 1.02 * ||X|| pre-normalization is how we keep Polar Express strictly inside its convergence envelope even when gradients spike.
The one piece that is easy to miss is qkv_split_sizes. Muon's Newton-Schulz / Polar Express cross-contaminates Q, K, and V subspaces if we orthogonalize a fused QKV matrix as a whole. MegaCpp mirrors the split-QKV pattern used in production NVIDIA stacks: when a parameter carries a _qkv_split_sizes tag, the orthogonalization splits along the output dim, runs Polar Express per Q/K/V slice, and concatenates. That single change turned a depth-52 hybrid preset from "NaN at step 4 universally" into a lane that could stay on the rails, and it is the reason split-QKV is not a toggle in production.
Three distributed variants coexist in the public Muon module. The plain Muon class is the reference optimizer for single-device debugging. DistMuon uses batched stacking plus vectorized reduce_scatter_tensor and all_gather_into_tensor so the orthogonalization runs on one rank per shard and broadcasts. FSDP2Muon is the one we actually run on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200: it consumes FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample DTensorQuick term guideDTensorPyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.GroundingAbout: EP / PP / TP / CP / SP / DP overview Example: 3D parallelism sample Reference: FSDP2 on XLA TPU local shards directly, infers Shard(0) bounds to re-map grads onto the local shape, and rescales qkv_split_sizes to the local row count so the split-QKV logic keeps working after the parameter has been sharded along the head dimension. The checked-in FSDP2 local-shard optimizer sample makes that boundary explicit: recover the shard bounds first, then scale the split metadata against the rows this rank actually owns. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200-class HBM lanes that extra reconstruct-and-update contract can still be a reasonable systems tax. On GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story-class LPDDR lanes, the same contract is much more likely to turn into a memory-traffic cost first, which is why GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story is a parity and debugging lane for Muon rather than the place we promise the same wall-clock win as H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200.
Per-parameter learning-rate scaling lives in _adjust_lr. Two modes match the PyTorch torch.optim.Muon API: Keller's sqrt(max(1, A/B)) correction for tall matrices, and the later match_rms_adamw-style rule that equalizes update RMS across all shapes. The shape-symmetric mode is the one our MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack presets need. With the older asymmetric rule, expert w1 of shape (D, h) and expert w2 of shape (h, D) can get meaningfully different effective LR even though they are the two halves of one expert MLP. The safer public takeaway is not the formula itself; it is the contract: a Muon lane for experts needs the up- and down-projection to stay on comparable update scales, which is why this article belongs next to expert parallel and MoE sharding rather than as isolated optimizer trivia.
The public AdamW companion module is the coresident reference. It is the same fused-step idiom - a single compiled graph with 0-D CPU scalar tensors for hyperparameters so hyperparameter changes do not trigger recompiles - and it is the fallback whenever a parameter is 1D, an embedding table, the LM head, or otherwise excluded by Muon's shape contract. In practice every training run uses both: Muon for 2D hidden linears, AdamW for embeddings, biases, norms, and the MTP head.
The QK-clip story spans the public clip helper and the optimizer policy. The clip helper is the activation-space clip: before softmax it scales the query tensor by min(1, threshold / (||q||_max * ||k||_max / sqrt(d))). It uses amax on tensors instead of scalar reductions to avoid a host sync on XLA. The MuonClip module is the weight-space clip: MuonClipState.record accumulates a Cauchy-Schwarz upper bound on per-head max logits during the forward pass in O(T) rather than the true O(T^2) max, the post-step QK clip walks attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns modules after the optimizer step, and when a head exceeds tau it rescales the corresponding Q/K rows in weight space. The important reader-safe distinction is ownership: activation-space clipping changes the live forward activations, while MuonClip is a post-step stability rule that changes the next parameter state. That is why the runtime side reads best next to activations and how we split them, while the training-health side belongs with loss curves and the divergence playbook.
How it lands in MegaCpp
The MegaCpp deployment stack is our stack on top of Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample. The optimizer story lands there in three pieces.
First, the core Muon step, the split-QKV orthogonalization, and the shape-symmetric LR mode are lifted as-is. The public MegaCpp Muon module is the source of truth for the step contract; the MegaCpp recipe layer ingests the same hyperparameters through a Megatron-style optimizer config. We do not re-derive Polar Express coefficients in the deployment stack.
Second, the distributed Muon variant we keep in deployment is the FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample-style interface, not DistMuon. Megatron's distributed optimizer handles the reduce-scatter side, so the part of the Muon module we actually run under Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample is the shard-aware local step. This is the hard integration: Megatron ships gradients on shard-local tensor views, and the local step has to accept the same kind of Shard(0) inputs the FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample path already handled. That is why shard-shape helpers are written defensively around local row counts. DistMuon stays as the public reference we can diff against when a Megatron integration regresses.
Third, MuonClip is a feature-flagged hook in the deployment stack, not a loss on the default path. The clip threshold tau=100 matches the Kimi K2 paper default; 0 disables the clip and is the setting for presets that already use a smaller depth or for runs we specifically want to burn to see whether logits still explode. The forward-side recording goes into the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns module; the post-step rescale is a post-optimizer hook mirroring the public MegaCpp training loop. The activation-space QK clip stays as a fallback for experiments that cannot afford the per-step weight rescale and as the XLA-side variant because it does not need a host sync.
What does not land: raw Nesterov-only Muon without variance reduction, the unsplit-QKV path, and the fp32 Polar Express branch. The first two are strictly worse on our deep-dense receipts; the third has no numerical benefit on NVIDIA once the safety factor in the input normalization is in place.
Ablations and what we kept
The interesting part of the history is a narrow one: the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 bring-up and the Kimi K2 follow-through.
The deep dense baseline went NaN at step 1 through step 6 under every Muon configuration we tried that was not "split-QKV plus hybrid architecture". Lowering the matrix LR did not help; a 20x lower LR still NaN'd at step 2. A matched AdamW run at the same base LR completed 20 steps cleanly. That was the evidence that identified Muon's orthogonalized magnitude as the cause of the forward-pass activation explosion in the presence of deep hyper-connections, not any LR or init pathology. The fix that actually worked was split-QKV orthogonalization plus hybrid Transformer + structured-state layer interleaving.
On top of that we added MuonClip. Without MuonClip, long runs with Muon drifted toward rising max attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns logits that, left unchecked, pushed us back into the earlier NaN regime. The muon_clip/max_logit, muon_clip/n_clipped, and muon_clip/n_total metrics are the leading indicator; when the number of clipped heads per step creeps up over a training window, something upstream has changed.
Muon weight decay went through its own small ablation. MegaCpp originally decayed weights even when update and parameter disagreed in sign. The aligned-gate mode (the current behavior) cuts the nominal weight decay by about an order of magnitude on some layers, so we had to re-tune matrix_lr and weight_decay together when we turned it on; the payoff is that long Muon lanes stop drifting in a way that only shows up late in training.
On GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story (Blackwell consumer variant) the story is different: the optimizer path is much more bandwidth-bound, so the optimizer wall-clock fraction barely moves between Muon and AdamW. We still run Muon on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story for numerical parity with H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, but the compiled-step kernel is the cheaper decision, not the difference maker. For the optimizer the only GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story-specific policy difference is letting the smaller board autotune the fused step instead of hard-coding an H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200-sized assumption.
The same caution applies to compiled Muon on NVIDIA more generally. The unstable part is not the Muon math itself; it is the shape-varying stacked-group call pattern around it. If stacked groups keep arriving with new local row counts, the compiler spends its budget re-deriving guards instead of accelerating the step. That is why compiled Muon remains opt-in on NVIDIA and why Dynamo and torch.compile breakage and the compile-time tax we accept for runtime speed are adjacent context rather than a separate story.
Production checklist
- Muon is the default only for 2D hidden linears. Embeddings, LM head, biases, and norms stay on the public AdamW path. No exceptions.
Muon vs AdamW routing in the MegaCpp stack:
| Parameter group | Optimizer | State bytes/param | Notes |
|---|---|---|---|
| 2D hidden linears, experts | Muon | 2 (bf16 momentum) | Polar Express in bf16 |
| Fused QKV projections | Muon | 2 | split-QKV orthogonalization |
| Embeddings, LM head | AdamW | 8 (fp32 m+v) | no exceptions |
| Biases, RMSNorm scales | AdamW | 8 | excluded from Muon's contract |
| MTP head | AdamW | 8 | public AdamW path |
Split-QKV orthogonalization, sketched:
# Muon optimizer: split-QKV orthogonalization (sketch)
def orthogonalize(param, grad):
sizes = getattr(param, "_qkv_split_sizes", None)
if sizes is None:
return polar_express(grad)
q, k, v = grad.split(sizes, dim=0)
return torch.cat([polar_express(q),
polar_express(k),
polar_express(v)], dim=0)
- Fused QKV parameters must carry
_qkv_split_sizes. If a recipe introduces a new attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns projection layout, the split metadata is part of the parameter construction, not of the optimizer. - Shape-symmetric
match_rms_adamwLR scaling is the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack default, with a correspondingly lowermatrix_lr. The older asymmetric mode is documented but gated behind an explicit opt-in. - MuonClip is on for every deep-dense or deep-hybrid preset, with
tau=100.muon_clip/max_logitis a training-health signal, not a debug aid. - Distributed Muon under Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample uses the FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample-style local-shard contract. Shard row counts flow into the local split-scaling helper so the split-QKV invariant survives sharding.
- Compile of the Muon step is opt-in on NVIDIA until the shape-varying stacked-grad recompile limit is healthier. TPU/XLA keeps the compiled path.
- bf16 Polar Express everywhere. The safety factor in the input normalization is not tunable without a new receipt.
- Post-step hook order is fixed: optimizer step, then MuonClip, then LR schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention bookkeeping. Reordering breaks the recording contract.
Frequently asked questions
What is the smallest checked-in proof that Muon is shard-aware rather than pretending every rank owns the full tensor?+
Shard(0) bounds first, then scale fused-QKV split metadata against the rows the rank actually owns before the step runs. That is the same ownership boundary behind FSDP2: pain and payoff and why compiled Muon only pays when local shard shapes stay stable enough to avoid a recompile storm.Why is GB10 a parity and debugging lane for Muon rather than a promised throughput lane?+
Why does MoE expert shape symmetry belong in the Muon receipt?+
match_rms_adamw scaling mode; otherwise an expert can look stable while its paired matrices receive different effective update scales. That is the optimizer-side continuation of Expert parallel and MoE sharding and the shard-local proof in FSDP2 local-shard optimizer sample.Why does compiled Muon stay opt-in on NVIDIA even after the shard-aware local step is working?+
Shard(0) bounds, then rescale fused-QKV split metadata against the rows this rank actually owns. Once those local row counts keep changing across stacked groups, the compiled optimizer path sees new shapes, burns its budget on fresh Dynamo guards, and stops paying back the compile cost. That is why the NVIDIA lane still keeps compiled Muon as an explicit opt-in and why Dynamo and torch.compile breakage and The compile-time tax we accept for runtime speed are the right adjacent receipts.Why gate cautious weight decay on the Muon update instead of the raw gradient?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
PyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.
Root-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.
The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.
The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.