MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 12 min readDavid Gornshtein
MoE
Deep Ep
NVSHMEM
All To All
H200
NVIDIA
Fused Moe

Fused MoE and DeepEP on NVIDIA: the dispatch layer we ship

How MegaCpp dispatches MoE tokens on H200 and GB10: DeepEP NVSHMEM all-to-all on NVLink and IB, fused expert GEMM, expert sharding, drop policies, and how the kernel layer interacts with our eight-specialist routing.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Fused MoE and DeepEP on NVIDIA: the dispatch layer we ship
Published 12 min readDavid Gornshtein

If you ship MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack on NVIDIA at any non-trivial expert count, the dispatch layer is the entire performance story. The router is a few percent of the FLOPs and almost none of the wall clock; the all-to-all and the expert GEMM are everything. This post is the NVIDIA-only counterpart to the routing decisions writeup — the same 64-expert / top-6 / 8-specialist setup, but with the focus on how DeepEP, the Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample flex dispatcher, and the fused expert path actually move tokens on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 multi-node systems and GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story. Routing policy is taken as given; we are talking about the wires under the floor.

Why MegaCpp cares about this

An 8-specialist ensemble runs MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers at every E-block in a depth-52 hybrid preset. With 64 routed experts, top-6 routing, and EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding=4 on a single 8-GPU H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 node (two MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding groups co-located with FSDP), the standard NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 all-to-all path was attributing roughly a third of the step time to the dispatch / combine pair, with another double-digit slice burned on per-call buffer fills and token re-accumulation during the compact-buffer build. The Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample team had already published the cure in Megatron-Core fused all-to-all path — NVIDIA Megatron-LM: NVSHMEM-direct dispatch via DeepEP, with overlap handles for true comm/compute overlap, plus the fused permute kernel from Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer on the local side. That is the path we adapted.

The constraint that shaped the design is mundane: we have to fall back. CI runs on macOS, ablation paths run on a single H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 without IB, GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story has no second host, and the offline harness needs to import the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack module without NVSHMEM in the process. Every layer of the dispatch stack has a non-DeepEP twin, and the choice between them is made at construction time from environment, not at every forward call.

What we built in the public MegaCpp MoE dispatch path

The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack substrate in the codebase is three modules deep, plus a bridge.

The main MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack runtime module is the configuration and the soft-routed reference. Its config object carries the static knobs — routed and shared expert counts, top_k, expert widths, the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper alignment constant, and the auxiliary-loss coefficients — and the layer ships a fully-soft variant that computes all experts and weights the outputs. The soft variant is XLA-safe and exists for ablation, never for deployment. The interesting bookkeeping is the per-rank expert-slice mapping, which is the contract every dispatcher reads to know what slice of the expert table this rank owns. The router itself runs in fp32 because sigmoid logits at bf16 produce just enough scoring noise to flip top-k membership across reruns; we kept fp32 routing as a hard rule across every dispatcher variant.

The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch runtime module is the token-exchange dispatcher. It implements two modes: a CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 variable-split path that packs only active tokens contiguously and exchanges them with per-rank split sizes through the standard all-to-all primitive, and an XLA equal-split path that pads every rank's slice to a fixed capacity so the compile graph stays static. The CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 path is the one that matters for NVIDIA. It owns a warm buffer cache — a Megatron-style GlobalMemoryBuffer clone that pre-allocates flat storage and returns a fresh tensor view per call — to kill the cudaMalloc and buffer-fill storm we saw in nsys. The differentiable backward goes through either the synchronous exchange wrapper or the overlap-friendly async wrapper. On top of that scaffold, three opt-in paths layer:

First, the TE permute path. When the TE permute lane is enabled and TE is importable, the dispatcher swaps the local compact-buffer build for the vendor fused permute kernel. The unpermute side still goes through the argsort inverse for the foreseeable horizon because the combine path is harder to validate against the local baseline.

Second, the Megatron MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack utils path. The Megatron MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack integration module is a thin wrapper around Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample's MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack helpers for permute, unpermute, and group-limited top-k with our index/probability conventions. When the Megatron module is importable and the env flag is on, the dispatcher calls into it instead of the local token re-accumulation path. This is the bridge we use on machines that have Megatron installed but not DeepEP.

Third, the DeepEP path. When the DeepEP lane is enabled, the dispatcher delegates to the public DeepEP bridge for the actual exchange.

The public DeepEP bridge wraps the transport buffer object. The interesting code is the buffer cache: transport buffers are sized once per process group from the vendor's NVLink and RDMA size hints, then memoised by group identity. We never re-create a buffer unless the group changes or the required size grows. This is the same pattern Megatron uses in its buffer helper, and it matters because allocating an NVSHMEM slab is not cheap and doing it on every forward would defeat the point. The dispatcher returns the received activations, routing metadata, per-expert token counts, and the opaque handles needed to reverse the exchange. One handle carries the layout DeepEP needs to undo the dispatch; the overlap handle is the synchronisation point we defer until just before the combine fires, so the expert GEMM runs in parallel with the dispatch wait.

That overlap handle only matters when the surrounding schedule has real work to hide the transport under. In our hybrid plan that usually means an adjacent Mamba or DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample node; on tiny decode-style batches it mostly trims jitter, while the durable wall-clock win shows up once prefill or medium-to-large decode batches leave enough independent work on the other stream.

The fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack kernel module is the device-local expert compute. It picks one of three implementations at runtime: a Triton fused kernel that ties top-k routing, sort-by-expert, jagged grouped GEMM, activation, second GEMM, and weighted scatter into one pipeline; a torch.grouped_mm persistent grouped GEMM path on torch ≥ 2.10; and a pure PyTorch reference loop. The Triton path is the one we evaluated against; its own warm buffer cache keeps the per-call permute / combine buffers hot for the same reason the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch runtime module does. The cuDNN SwiGLU grouped GEMM path on SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro is wired through a Blackwell-specific helper; on SM90 we fall back to the Triton + grouped-mm pair.

The Megatron MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack integration module is the Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample glue mentioned above. The two helpers worth flagging are the permute path and the group-limited top-k path. Group-limited routing — limit routing to a subset of expert groups per token, take top-k within those groups — composes with our 8-specialist hierarchical routing because each specialist owns its own group. We do not invoke it in the main training loop yet (the specialist boundary is enforced higher up), but it is wired through so we can A/B against the global top-k policy when the spec changes.

Drop policy in the public MegaCpp MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack path is simple: capacity-factor drops are off by construction. The variable-split CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 path sends every token, and the XLA equal-split path pads to a static capacity that is provably larger than the worst-case load (cap_per_rank = BT * max_active_slots_per_token). When the auxiliary load-balance loss does its job, the padded headroom is small. When it does not, we surface the imbalance as a metric rather than silently dropping; that decision predates the DeepEP lift and we kept it. DeepEP itself supports a drop mode; we do not enable it.

How it lands in MegaCpp

The deployment substrate is Megatron-LM with the flex MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatcher.

The Megatron argument bridge builds the argument fragment that turns this on. The relevant chunk: --moe-token-dispatcher-type flex, --moe-router-dtype fp32, --moe-permute-fusion, and conditionally --moe-grouped-gemm. The flex dispatcher is Megatron's name for the DeepEP path: pre-allocated fixed buffers instead of the alltoall path that creates very large transient tensors at higher EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding, NVLink for intranode transfer, and NVSHMEM IBGDA for multi-node. It silently falls back to the alltoall dispatcher if deep_ep is not importable, which is the behavior we want on the paths where DeepEP is not installed. --moe-router-dtype fp32 is non-negotiable because DeepEP only consumes fp32 router probabilities; this matches the public MegaCpp hard rule.

The expert sharding is straightforward. With 64 experts and EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding=4, each rank owns 16 experts. The TE GroupedLinear path described in the Transformer Engine bridge writeup is what executes the expert GEMM per rank; jagged token counts dispatch as a single fused kernel and the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper current-scaling recipe carries through.

The hybrid schedule plan is what makes the dispatch overlap actually work in our hybrid layer mix. Megatron's combined-1F1B schedule plan assumes a stack of pure transformer layers; our model interleaves Mamba M-blocks, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack E-blocks, and DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns layers. Opaque schedule wrappers wrap the Mamba and DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample layers so they execute as compute-only nodes while the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch / combine on adjacent layers progresses on the comm stream. The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers stay decomposed into separate dispatch, expert-compute, and combine schedule nodes so the all-to-all overlaps with the next layer's compute, the expert GEMM overlaps with the previous layer's combine wait, and the schedule plan keeps two streams busy across the hybrid pattern. Without this patch, the comm stream goes idle every time we hit an M or DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample layer, and the DeepEP win evaporates.

The selective FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack patch enforces the layer-aware FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper scope: only E-blocks enter the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper compute zone. The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack side of the dispatch (router fp32 → permute fp8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper → grouped GEMM fp8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper → activation fp8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper → grouped GEMM fp8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper → unpermute fp8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper → combine bf16) is the only place FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper fires. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns and Mamba stay bf16. This survives because the loss curve does not drift the way it does when FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper enters early-layer attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns.

The pieces that did not survive the lift: the local Triton fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack kernel from the fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack kernel module is not what runs in deployment. Megatron's TE-backed grouped GEMM is faster on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and integrates with the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper global state manager; the Triton path stays as the single-GPU offline kernel and the comparison baseline. The reference token-exchange dispatcher is also not on the main critical path — Megatron's flex dispatcher owns dispatch / combine. We kept both around because they are the only thing that runs on paths without Megatron, and because the metric instrumentation in the reference dispatcher is what we use to debug routing imbalance.

We keep both the TE permute lane and the Triton jagged lane because they optimize different shapes. TE is the clean dense path when routed tokens already fit the vendor permutation contract; the Triton fused lane is the escape hatch when token counts per expert get ragged and padding waste would dominate.

The expert-bank pad/strip helpers move with the TE GroupedLinear wrapper into the main path. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper tensor cores require splits divisible by 16, and that is true on every path.

Ablations and what we kept

The wins we kept: DeepEP via --moe-token-dispatcher-type flex for any run with EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding>1 and a working deep_ep install, TE permute fusion via --moe-permute-fusion on the local side of the dispatch, fp32 router probabilities everywhere, FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper grouped GEMM for the expert compute, the warm buffer-cache pattern (lifted into Megatron's GlobalMemoryBuffer where it already exists), and the layer-aware FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper scope.

The wins we did not keep: Expert Choice routing (broken at autoregressive batch size 1, see the routing post), capacity-factor token dropping (we surface imbalance as a metric instead), and the post-hoc TE Linear walk on top of Megatron's TE spec (double-wrap).

Expert Choice also stayed out of the deployed lane for a structural reason: at autoregressive batch size 1 there is no meaningful batch for expert-first ranking. That is why the production lane stays with token-choice routing plus explicit imbalance metrics rather than trying to rescue Expert Choice in inference, which is the same split argued in the routing decisions writeup.

The neutral outcomes: blanket --moe-grouped-gemm is essentially a no-op on the TE path because TE GroupedLinear is already the GEMM, but the flag stays on for clarity and for the paths that drop back to Megatron's non-TE GroupedMLP. The overlap handle from DeepEP composes cleanly with the hybrid schedule plan; the gain over the synchronous variant is in the 8-15% range on the depth-52 preset, which is the difference between "DeepEP is a nice-to-have" and "DeepEP is the entire point".

The boring engineering: every PR runs the dispatcher A/B harness — local re-accumulation baseline, TE permute path, Megatron MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack utils path, DeepEP path — on a 2-rank smoke test. If any of the four paths regresses by more than 2% on token-throughput at the same model snapshot, the PR does not land. This is the only reason the fallback paths still work after a year of moving target on the DeepEP and Megatron sides.

Production checklist

DeepEP dispatch surface

Path Transport Where it wins Where it loses
Intra-node DeepEP NVSHMEM over NVLink H200:8 single host, low-jitter offers nothing on a 1-GPU box
Inter-node DeepEP NVSHMEM over IB multi-node MoE training needs IB topology to actually be IB
Megatron-Core a2a fallback NCCL all-to-all small expert counts, debugging jitter at high EP, no overlap with GEMM
GroupedGEMM expert path CUTLASS grouped GEMM dense per-expert tokens thin experts under-utilize the SMs

A representative dispatch shape we use on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8:

dispatch_profile = {
    "num_experts": 64,
    "top_k": 2,
    "capacity_factor": 1.25,
    "dispatch": "DeepEP over NVSHMEM with combine overlap",
    "expert_gemm": "grouped_gemm",
}
FAQ

Frequently asked questions

When does the DeepEP overlap handle actually buy wall-clock time?+
When the schedule has real adjacent work to hide transport under. In the hybrid lane that usually means a neighboring Mamba or DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture. node wrapped as an opaque compute step while the DeepEP dispatch or combine is still in flight. On tiny decode-style batches there is less work to overlap, so the gain is mostly lower jitter rather than a full step-profile rewrite.
Why does Expert Choice stay out of the deployed decode lane?+
Because at autoregressive batch size 1 there is no meaningful expert-first batch to rank across. The deployed lane therefore stays with token-choice routing plus explicit imbalance metrics instead of pretending Expert Choice is solving a batch-structure problem that is not present in single-sequence decode.
Why keep both TE permute and the Triton jagged lane?+
Because they win on different token shapes. TE permute is the clean local path when routed tokens already fit the vendor permutation contract, while the Triton lane is the escape hatch when expert loads are ragged enough that padding waste would dominate a denser permute-first path.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

EP

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

sm_100

Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

CUTLASS

NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Topic hubs