MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 8 min readDavid Gornshtein
NCCL
XLA
Fsdp
Performance
H200
TPU v6e

Communication cost and overlap: NCCL on H200, XLA collectives on TPU v6e

How MegaCpp budgets all-reduce, reduce-scatter, and all-gather against compute on the hybrid stack, including bucket sizing, launch coalescing, alignment, and the overlap windows that actually matter.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Communication cost and overlap: NCCL on H200, XLA collectives on TPU v6e
Published 8 min readDavid Gornshtein

Distributed training on a hybrid CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 and TPU stack is a communication problem first and a compute problem second. Dense and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack configurations live well inside the regime where a careless bucket boundary can erase a large share of step throughput, and where one mis-tuned communication setting on a multi-host H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 lane decides whether a reduce-scatter overlaps the backward pass or stalls it. This post focuses on what the communication cost looks like on the two backends, how bucket sizing and launch coalescing affect overlap, what alignment buys in practice, and which ideas survive the move from research-stack to production.

Why MegaCpp cares about this

Our hybrid stack runs on two very different fabrics. On NVIDIA we target H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8 single-host with NVLink/NVSwitch as the bread-and-butter rung, and 4xH200:8 multi-host as the upper bound for any preset that does not already burn its budget on memory. On Google we target TPU v6e-8 and v6e-32, where collectives are issued by the XLA SPMDQuick term guideXLA SPMDThe explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.GroundingAbout: XLA SPMD sharding annotations About: XLA SPMD tokenizer and vocab on TPU Example: TPU backend ownership note compiler and the unit of optimization is graph shape rather than NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 call ordering. The dense path is comms-bound past dpQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP=8 with bf16 grads. The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack path is comms-bound from the very first step because expert parallelismQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding adds an AlltoAll on top of the AllReduce/ReduceScatter we already pay for FSDP-style sharding.

The brief is simple: every microsecond of compute that is not also a microsecond of communication is wasted, and every collective that we cannot hide behind compute is a tax on every step for the rest of the run. The implementation is less simple, because it splits across PyTorch's NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 bindings, the FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample wrapping logic, our Megatron-style overlapped reducer, and the XLA SPMDQuick term guideXLA SPMDThe explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.GroundingAbout: XLA SPMD sharding annotations About: XLA SPMD tokenizer and vocab on TPU Example: TPU backend ownership note lowering on the TPU side.

What MegaCpp kept after repeated overlap tuning

On the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 path, the central idea is a Megatron-style distributed optimizer with explicit gradient buckets, an overlapped reducer, and a per-bucket all-gather ladder that keeps the next parameter shard in flight by the time the previous wait returns. The chain is built once at initialization and then walked from model hooks so communication is launched from one place instead of racing between multiple call sites.

Bucket size is the first knob. The public default bucket-size helper mirrors Megatron's max(40_000_000, 1_000_000 * dp_world_size) parameters formula, multiplied by dtype bytes. At dpQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP=8 that lands around 76 MB; at dpQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP=64 around 122 MB. The old static 40 MB ceiling we used to ship was leaving busbw on the table at high dpQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP because the ring reduce-scatter never reached its asymptote before the next bucket fired.

The second knob is alignment. Padding the flat gradient buffer to a multiple of world_size * 65536 makes each per-rank shard end on a 64K-element boundary. That mirrors a well-known Megatron-style hint: NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 ring reduce-scatter on NVLink and NVSwitch tends to hit peak bandwidth on shards that are clean multiples of 64K elements, while leftover tails slow the last chunk. In practice, the lesson is simple: shard alignment is not cosmetic. It changes achieved bandwidth.

The third knob is launch coalescing. Wrapping per-bucket reduce-scatter calls in a coalescing layer lets NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 fuse multiple kernel dispatches into one launch. That usually helps, but it is still worth keeping a rollback path because communication regressions in specific framework builds do happen.

The fourth knob is grad dtype. --grad_reduce_in_fp32 allocates _flat_grad/_flat_grad_shard in fp32, up-casts bf16 grads on copy, runs the reduce-scatter as fp32 AVG, then casts back to param.dtype on write_reduced_grads_to_params. For our fp32 master-grad optimizer paths (AdamW, Muon) the down-cast is a no-op. The trade is bandwidth: fp32 reduce-scatter doubles the wire bytes versus bf16, so we only flip it on for the dense lane where reductions in low precision were measurably perturbing loss curves on long runs.

On the FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample lane, the same overlap intent shows up through prefetching. Disabling reshard-after-forward and allowing a small prefetch window issues more all-gathers early, at the cost of memory. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, a prefetch limit of two was a practical ceiling for the shapes discussed here.

On the TPU path the picture is different because we do not call collectives by hand. The public TPU FSDP notes document the contract: SPMD ZeRO-3 via XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations, where the compiler inserts the all-gather of sharded parameters before forward, the reduce-scatter of gradients after backward, and the optimizer state stays sharded across the data axis. We do not get to hand-tune launch ordering; we get to shape the graph so XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations can. That means avoiding host-device syncs such as .item() and .nonzero() in hot paths, avoiding data-dependent shapes that would break SPMD propagation, and explicitly replicating tensors that would otherwise pick up the wrong sharding. The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack path on TPU adds AlltoAll on the expert axis, and the planner has to respect user-pinned tensor and expert parallelQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding choices when the chosen meshQuick term guidemeshThe named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense.GroundingAbout: XLA SPMD sharding annotations Example: 3D parallelism sample Reference: FSDP2 on XLA TPU fits.

How it lands in production

The lift into production is a careful subset. The Megatron-style overlap discipline carries over directly: the full training runtime drives a distributed optimizer that already knows the bucket-chain semantics, the reset points, and the alignment rules. What remains at the recipe level is discipline: pin the overlap flags, keep the normalization runtime settings compatible with collective scheduling, and export the small set of NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 defaults that consistently help on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200.

Three things are being rewritten on the way in. First, a hand-rolled coalescing wrapper retires in favor of an upstream coalesced reduce-scatter path. Second, the bucket-alignment flag becomes a recipe default rather than a per-run knob, because we have not found a workload that benefits from turning it off. Third, the FP32 gradient-reduction choice becomes part of the precision recipe rather than a standalone performance knob, so it composes cleanly with mixed-precision settings.

The two parallelism modes in the production recipe are explicit about the trade. The nemo_native mode (TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding=2, SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel=True) uses Megatron's built-in Mamba mixer and gets full TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding-comm overlap. The author_dp mode (TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding=1, PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample=1, DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP=8) uses our selective mixer with the Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode/M2RNN extensions but accepts a slightly lower comm-overlap potential because there is no TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding collective to overlap with the next layer's compute. The recipe documents this trade in plain English; we do not pretend it does not exist.

On TPU, the production lift is mostly graph hygiene. The XLA SPMDQuick term guideXLA SPMDThe explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.GroundingAbout: XLA SPMD sharding annotations About: XLA SPMD tokenizer and vocab on TPU Example: TPU backend ownership note path does not change shape; what changes is that we treat .item(), .nonzero(), and any data-dependent control flow as a hard CI failure on the TPU lane. A safe fallback helper pattern is the template: on XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations the fallback branch is taken, no host-device sync is forced, and no graph break is induced. The TPU comm cost story is then determined entirely by the SPMD-inserted collectives, and our job is to make sure the graph the compiler sees is the graph we intended.

Ablations and what we kept

The rough expectation on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 was that aggressive FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample overlap plus launch coalescing could buy a noticeable step-time improvement, while a smaller set of communication-friendly runtime settings added a few more percent on top. It is better to keep that claim qualitative unless you have side-by-side traces for attribution.

What survived contact with real hardware:

What did not survive:

The multi-host story is the most painful one to characterize cleanly. Allreduce stragglers — one rank 100-300 ms slower than the others, often correlating with a specific PCIe slot — show up as the median step time creeping up while the fast ranks idle on the collective. We do not have a magic fix; the playbook is the boring one: pin the slow rank, swap the bench, and re-run. The collective hangs and watchdog tuning are covered separately in the NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 hangs post; what matters here is that comm cost on multi-host is a distribution, not a number, and the tail dominates.

Public checklist

  • Pin pad_buckets_for_high_nccl_busbw=True and confirm the rank-0 announce fires on every CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 run.
  • Use the dpQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP-scaled bucket size formula from the public distributed-optimizer helper; do not hardcode bucket size unless profiling a specific config.
  • Set NCCL_P2P_NET_CHUNKSIZE=524288, TORCH_NCCL_HIGH_PRIORITY=1, CUDA_DEVICE_MAX_CONNECTIONS=1, TORCH_NCCL_AVOID_RECORD_STREAMS=1 on every CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 rank by default.
  • On FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample, default to reshard_after_forward=False and prefetch_limit=2; raise only if memory headroom permits.
  • On TPU v6e, treat any new .item(), .nonzero(), or data-dependent shape in a hot path as a CI failure on the XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations lane.
  • Keep --grad_reduce_in_fp32 opt-in per precision recipe; do not flip it globally.
  • Ship the _coalescing_manager opt-out env var until the upstream coalesced-collective path is clean on every wheel we support.
  • Record the comm-overlap flag set in the run summary so historical comparisons stay honest.
FAQ

Frequently asked questions

Why keep NCCL_P2P_NET_CHUNKSIZE=524288 and TORCH_NCCL_HIGH_PRIORITY=1 as recipe defaults instead of treating them as generic speed flags?+
Because they tune different parts of the communication path. The high-priority setting controls whether the NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes. communicator uses a high-priority stream, which matters when overlap depends on collectives getting out quickly. The chunk-size setting changes the message size used by NCCL's network send and receive path, so it is only relevant on the parts of the job that actually traverse the network. That is why this article keeps them as H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. recipe defaults with trace-based revalidation instead of presenting them as universal "faster NCCL" folklore.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

EP

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

DP

Data parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.

XLA SPMD

The explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

mesh

The named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense.

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

PP

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

SP

Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

XLA

The compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…