DualPipe and 3D Parallelism on H200 and GB10
How MegaCpp lays out the TP × PP × DP × EP cube on H200 multi-node systems and GB10, integrates DualPipe / DualPipeV with our hybrid layer pattern, accounts for pipeline bubbles, and launches the deployment training job.

Pipeline parallelismQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample is the axis everyone wants to skip and nobody can. With a depth-52 hybrid model and an 8-way specialist ensemble we cannot fit the largest preset under FSDP-only sharding on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8. The 3D cube — TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding × PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample × DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP × EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding — is the answer, and DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview / DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview is the schedule that buys back the bubble pure 1F1B leaves on the floor. This post is the layout we run, the DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview integration that survived our hybrid layer pattern, and the launch story.
One first-touch distinction prevents a lot of confusion here. Megatron virtual pipeline parallelismQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample, usually shortened to VPP, is still PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample with interleaved or virtual stages inside the Megatron schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention. DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview is a public DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview variant with a V-shaped overlap schedule. They attack related bubble costs, but they are not the same runtime surface and not the same schedule owner.
Why MegaCpp cares about this
The constraint is shape, not size. A depth-52 hybrid preset interleaves A-blocks (dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns), M-blocks (Mamba), E-blocks (MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack), and DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns layers in patterns like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample. The blocks have wildly different parameter counts and activation costs: E-blocks dominate parameters (64 experts each), M-blocks dominate activation memory (selective-scan state), DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample layers carry the indexer and absorbed-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries caches. Naive layer-count partitioning produces stages that differ by 5-10x and pushes us off the device on the heaviest stage.
So the job is twofold: lay out a meshQuick term guidemeshThe named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense.GroundingAbout: XLA SPMD sharding annotations Example: 3D parallelism sample Reference: FSDP2 on XLA TPU that fits — TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding for dense, PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample for depth, DP/FSDP2 for replication, EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack — then schedule the pipeline so the bubble stops being the dominant cost. DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview and DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview give us bidirectional F/B overlap plus a W (weight-grad) decomposition that drives the bubble below 1F1B at the same stage count. The codebase built the cube and the DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview wrapper end-to-end; MegaCpp lifts the layout planning while delegating the schedule to a focused DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview integration on top of upstream Megatron.
What we built in the public MegaCpp parallelism path
Five modules carry the parallelism story.
The distributed parallelism surface is the cube. ParallelismConfig(pp, dp, tp, ep) plus pp_schedule and pp_microbatches is the entire surface; validate_3d_config checks the layout against the model config. The validator is load-bearing: PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample must divide layer count, TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding must divide n_head and n_kv_head, EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding must divide moe_n_routed_experts, and the depth-52 preset gets its own checks (pp ∈ {1, 2, 4, 13}, tp ∈ {1, 2, 4, 8} to keep n_kv_head=8 divisible). build_3d_mesh calls init_device_mesh with named axes ("pp", "dp", "tp", "ep"). apply_3d_parallelism applies the strategies in the only order that composes: TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample, FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample, EP. This is the torchtitan-canonical order and what Megatron's parallel_state follows; deviating leaves FSDP trying to shard parameters PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample has already moved off-rank.
pp_microbatches sits in that validator as a planner warning, not as a
shape-validity law. The checked-in
3D parallelism sample only
warns when pp_microbatches < pp because the run can still be correct; what
degrades is overlap and bubble amortization, not tensor legality. That is why
we still allow low-microbatch smoke runs for correctness bringup and keep the
throughput bar in
Gradient accumulation and microbatching.
The pipeline-partitioning module owns the layer partition. partition_model does layer-count balanced splits; partition_model_weighted does parameter-count balanced splits, which is what we use for any preset with E-blocks (a 64-expert E-block is roughly 10× the weight of an A-block). create_pipeline_stage materialises a PipelineStageModule that owns the embedding on stage 0 and the final norm + lm_head on the last stage. The schedule abstraction routes through torch.distributed.pipelining. The aux-loss injector (_AuxLossInjector, mirroring Megatron's MoEAuxLossAutoScaler) attaches MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack / MoD aux losses to the hidden-state backward graph on non-terminal stages; without it those losses get cleared at the next forward and never see a gradient.
The Megatron TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding adapter is the TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding layer swap. After model construction and before FSDP, it walks the module tree and replaces nn.Linear with Megatron's ColumnParallelLinear / RowParallelLinear. The choice is communication direction: column-parallel splits output dim and reduces in backward (QKV), row-parallel splits input dim and reduces in forward (output projection, second MLP linear). The wrapper unwraps Megatron's (output, bias) return so callers see a plain tensor. Faster than DTensorQuick term guideDTensorPyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.GroundingAbout: EP / PP / TP / CP / SP / DP overview Example: 3D parallelism sample Reference: FSDP2 on XLA TPU TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding and the path we use for TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding > 1 on Hopper.
The Megatron block wrapper is the alternative: wrap a Megatron TransformerLayer (TE spec) and call it as a research-project-shaped block. It does not own our hybrid features (Engram, MHC, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample, Mamba, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack), so it is only useful for A-blocks where the architectural features live elsewhere. The Megatron bridge is the lazy import firewall; same pattern as the TE bridge, TPU path unaffected.
The checked-in STP sample is the Semantic Tube Prediction auxiliary-loss reminder in this bundle. It is here because layer placement matters: Variant C samples hidden states from layers [0, 4, 8, 12], and that aggregation must live on a stage that owns those layers. Under PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample > 1 we restrict STP to single-layer mode because the cross-stage gather forces an extra P2P that defeats the bubble reduction.
The research DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview schedule layer is the schedule. Standard DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview pairs ranks: N ranks hold N stages, each rank stores 2 stages (its own + its mirror's), mirror ranks must sync grads, memory per rank is 2/N — meaningful only at N ≥ 4. DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview is the V-shape variant: N ranks hold 2N stages, each rank holds 2 unique stages (no mirror), only rank 0 provides input. Memory per rank is 1/N — half DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview's at the same rank count. The bubble formula (PP/2 - 1) × (F + B - 3W) collapses to zero at PP=2 (no real overlap, only two stages); at PP ≥ 4 the bidirectional schedule eats most of the bubble and the W decomposition fills the rest. The stage builder constructs the stage modules, the P2P configurator sets the inter-stage shape, the DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview loss wrapper owns the criterion, and the gradient sync closes the loop on reduction.
Non-obvious bookkeeping inside the integration: non-terminal stages must enable the aux-loss injector (otherwise MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack / MoD aux losses are dropped at the next forward), output deallocation must be disabled (DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview overlaps F and B across micro-batches; a stage's output can be needed after the next micro-batch's forward starts, and Megatron's deallocation pattern would corrupt the autograd graph), and loss must be scaled by grad_accum_steps × num_chunks so accumulated gradients match the non-PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample path. These are the bugs we hit in that order.
How it lands in MegaCpp
The deployment substrate is Megatron-LM with a 2-rank DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview carved out of the world.
The deployment DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview wrapper is the bridge. The decision that shaped everything else: Megatron runs with --pipeline-model-parallel-size 1. Each Megatron rank holds the full 52-layer model; we do not let Megatron split the pipeline. We carve a dedicated 2-rank DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview process group out of the world — on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8 the pipe groups are (0,1), (2,3), (4,5), (6,7), giving DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP=4 replicas. DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview splits the 52 layers into 4 virtual stages of 13 layers, assigning (stage0, stage3) to pipe_rank=0 and (stage1, stage2) to pipe_rank=1 — the canonical V-shape. DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP gradient sync runs across same-pipe_rank peers via standard new_group collectives.
Why PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample=1 inside Megatron and DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview outside? Megatron's PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample schedule and DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview both want to own the forward/backward dance; layering them produces an intractable cross-product schedule. Carving DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview out of the world keeps TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding and DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP entirely under Megatron's parallel_state. The cost is that each rank holds the full model, so DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview's 1/N per-rank memory relief is what buys us the depth.
The DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview activation patch wires setup_dualpipev_from_args into Megatron's setup_model_and_optimizer. Activation is gated by an explicit runtime flag; failures crash loudly because silent fallback on a 2-rank pipeline is worse than a hard error.
The hybrid schedule plan patch handles our layer mix under Megatron's combined-1F1B path. The stock plan assumes a uniform transformer stack and stalls the comm stream every time we hit a Mamba or DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample layer. The patch wraps Mamba and DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample layers as opaque schedule nodes on the compute stream, leaves MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers decomposed into dispatch / MLP / combine nodes on the comm stream, and keeps both streams busy across the hybrid pattern. What Megatron can and cannot split is the narrower continuation for why that heterogenous ownership boundary matters.
The family-aware spec builder and TE spec builder assemble the MambaStackSubmodules spec consumed by Megatron's MambaStack. They override the Mamba mixer (Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode vs M²RNN per layer), keep upstream TE submodules for everything else, and route attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns through the DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample-absorbed MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries spec on DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample layers.
The Megatron launch-argument builder emits the runtime flags. The 8-GPU H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 single-node fragment is --tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 --expert-model-parallel-size 4 --sequence-parallel. Multi-node H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 moves TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding up to 4 and EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding up to 8. DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP is implicit (world_size / (TP × PP × EP)). GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story collapses the cube to TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding=1, PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample=1, EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding=1, DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP=1 — on that path the parallelism story is "all the work is in the kernel, not the cube".
MegaCpp pieces that did not survive the lift: our early pipeline partitioner, the Megatron block wrapper, and the in-process aux-loss injector — Megatron's TransformerLayer and parallel_state already do those jobs. The stage modules in the deployment DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview wrapper are a focused re-implementation that knows how to slice a Megatron MambaModel into the 4 virtual stages DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview needs.
Bubble accounting
The honest version. DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview's bubble formula (PP/2 - 1) × (F + B - 3W) is a lower bound, not a measurement. On our depth-52 preset with pp=2 (4 stages V-shape, pipe_rank=0 holds stages (0, 3) and pipe_rank=1 holds stages (1, 2)), the bubble term collapses to zero by the formula but the real bubble is the warmup + cooldown of the schedule plus the P2P round-trip latency, which is in the 3-5% range on a single 8-GPU H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 node. At pp=4 (8 stages V-shape) on a multi-node H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 cluster, the formula bubble is small and the measured bubble is dominated by the cross-host P2P, which we keep in the 5-8% range with bf16 P2P tensors. The W decomposition matters because it lets the schedule fill the would-be bubble with weight-gradient compute on micro-batches whose backward-input has already finished.
The hybrid pattern adds a confound the formula does not see. Mamba layers have a different F + B - W ratio than dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, and a stage that contains a high Mamba count is slower per micro-batch than a stage of pure attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns. We balance stages by parameter count (partition_model_weighted) and verify the actual per-stage step time matches within 5% via instrumentation. When it does not, we adjust the explicit boundary list (pp_boundaries) until it does. There is no algorithm here; it is a feedback loop on instrumentation.
The other easy way to lie to yourself is to count only physical pipe ranks when
writing pp_boundaries. The checked-in
DualPipe stage contract sample
makes the contract explicit: DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview with pp_degree = N expects 2 * N
total stage segments because each physical rank carries two virtual stages. A
PP=4 plan with four explicit segments is not a rough draft of the right
schedule; it is a different schedule description.
Ablations and what we kept
The wins: DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview over standard DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview (1/N vs 2/N memory at the same rank count), Megatron ColumnParallelLinear / RowParallelLinear over DTensorQuick term guideDTensorPyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.GroundingAbout: EP / PP / TP / CP / SP / DP overview Example: 3D parallelism sample Reference: FSDP2 on XLA TPU TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding (faster custom NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 overlap, async allreduce), the TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding-then-PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample-then-FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample-then-EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding application order, parameter-weighted layer partitioning for any MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-bearing preset, the aux-loss injector for non-terminal PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample stages, output-deallocation off for DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview stages, the explicit-boundary path through pp_boundaries for hybrid presets where the equal split is too uneven, and bf16 P2P tensors (fp32 P2P is bandwidth-bound on cross-host).
The losses: DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingExample: DualPipe schedule sample Example: DualPipe stage contract sample Reference: parallelism map overview (non-V) at pp=2 (no memory relief, no real overlap), DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview at pp=1 on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story (no second rank, the schedule degenerates), pipe-delimited pp_boundaries patterns that did not produce the right total stage count for DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview (4 segments for PP=4 is wrong, we need 8), and Megatron PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample > 1 layered under DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview (the cross-product schedule is not tractable).
The neutrals: TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding=2 vs TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding=4 on a single 8-GPU H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 node is roughly a wash for the dense layers; we run TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding=2 because it leaves more room for EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding=4 and DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP=2 inside the same world. Sequence parallelismQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel is on whenever TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding > 1; the activation savings on MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries projections justify the allgather/reduce-scatter, as discussed further in context parallel and sequence parallel. STP Variant C is disabled under PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample > 1.
The boring engineering: the validator in validate_3d_config runs on every job submission, not just at startup. Every smoke run also executes the dualpipev_forward_backward path on a depth-13 toy model so the schedule code itself has continuous coverage. The mirror-grad and DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP-grad sync paths are tested on a 2-rank and 4-rank smoke, respectively, on every PR.
Production checklist
- Apply parallelism in TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding → PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample → FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample → EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding order. Out-of-order produces silent sharding corruption.
- The depth-52 hybrid preset uses Megatron
--pipeline-model-parallel-size 1and a 2-rank DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview carved out of the world. - Each Megatron rank holds the full model; DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview splits 52 layers into 4 virtual stages of 13 layers, assigned in V-shape to
pipe_rank0 and 1. - DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP gradient sync runs across same-
pipe_rankpeers via standardtorch.distributed.new_groupcollectives. - Aux-loss injection is enabled on non-terminal stages.
- Output deallocation is disabled on DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview stages (otherwise the autograd graph corrupts under bidirectional overlap).
- Loss is scaled by
grad_accum_steps × num_chunksto match the non-PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample gradient magnitude. - The hybrid schedule plan patch is applied; opaque nodes wrap Mamba and DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample layers so the comm stream stays busy through the AEMEM... pattern.
- Layer partition is parameter-weighted on any MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-bearing preset; explicit
pp_boundariesoverrides apply when the auto-balanced split exceeds 5% per-stage step-time imbalance. - TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding layers go through Megatron
ColumnParallelLinear/RowParallelLinear. Sequence parallelismQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel is on whenever TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding > 1. - P2P tensors are bf16; the inter-stage shape is
(B, T, D)set viaconfigure_p2p_tensorsbefore the first step. - The 3D validator runs on every job submission; failures are hard errors, warnings are logged.
- GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story paths ignore DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingExample: DualPipe stage contract sample Example: DualPipe schedule sample Reference: parallelism map overview entirely. The cube collapses to TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding=1, PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample=1, EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding=1, DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP=1; throughput comes from the kernels.
- STP is restricted to single-layer mode under PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample > 1.
Application order and what it costs
| Axis | What it splits | Comm | Notes |
|---|---|---|---|
| TP | weight/activation | all-reduce per layer | Megatron-Core preferred over DTensor on CUDA |
| PP (DualPipeV) | layers (V-shape) | P2P across pipe_rank |
bidirectional overlap |
| FSDP2 | params + grads + optimizer | all-gather + reduce-scatter | applied after TP+PP |
| EP | experts | all-to-all | applied last |
# Apply order: TP -> PP -> FSDP2 -> EP. Out-of-order silently corrupts shards.
apply_megatron_tp(model, tp_size=tp)
apply_dualpipe_v(model, num_chunks=4, virtual_stages=4, pipe_rank=rank)
apply_fully_shard(model, mesh=fsdp_mesh)
apply_expert_parallel(model, ep_size=ep)
Frequently asked questions
Why keep Megatron PP at 1 and run DualPipeV outside it?+
Why is parameter-weighted partitioning emphasized so much?+
Where do the microbatch and loss-scaling details for this schedule live?+
grad_accum_steps × num_chunks divisor, BF16-vs-FP16 scaling, and the planner loop we use before promoting a DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you. lane.Why does the bringup checklist care so much about configure_p2p_tensors?+
Where is context parallelism in this specific layout?+
TP -> PP -> FSDP2 -> EP DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you. lane, where the hard problem is stage ownership and bidirectional overlap across depth. Context parallelismQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job. is a different contract: it changes who owns the sequence when one-rank context residency stops being viable, which is why the follow-on reads are Context parallel and sequence parallel and EP, PP, TP, CP, SP, DP: The Parallelism Map We Actually Use, not another DualPipeV launch flag inside this article.Why does the GB10 path collapse the cube instead of keeping DualPipeV?+
pp=1 gives DualPipeV no second rank to overlap against, and the same box is already operating under a tighter unified-memory ceiling than H200.When do you stop trusting the auto split and write explicit pp_boundaries?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.
Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.
Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.
Data parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.
DualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.
Bidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.
A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
Context parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.
Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.
The named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
PyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.
The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.
NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.