Gradient Accumulation and Microbatching Under FSDP2: How We Stopped Guessing the Knobs
Microbatch sizing under FSDP2, accumulation boundaries that respect TP/EP/SP, loss scaling under FP16/BF16, and the tuning loop that finally converged on H200.

Gradient Accumulation and Microbatching Under FSDP2
Most of the time, "tune the microbatch" sounds like a one-line job: pick the
largest device_batch_size that fits in HBM, divide your global token budget
by it, and call the quotient grad_accum_steps. On a clean dense model that is
roughly true. In a hybrid trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 stack, where FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample sits next to TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding, SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel,
MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch, fused attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, and mixed optimizer state, the same one-line job
can eat weeks.
This post is the honest version of how we sized microbatches, where the accumulation boundary actually lives, what loss scaling looks like in a world that is mostly BF16 with FP16 and FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper islands, and the inner loop we now use to converge on a config without brute-forcing the search space. It pairs naturally with training on 8x H200 SXM, FSDP2 pain and payoff, FSDP on CUDA and Megatron DDP, and regional compile without losing the plot.
In this article, device_batch_size means the per-rank microbatch shape that
actually enters the forward pass, grad_accum_steps means how many of those
microbatches are accumulated before the optimizer may step, and pipeline
num_chunks means the extra schedule-side split that matters only when PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample or
DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingAbout: DualPipe and 3D parallelism on NVIDIA Example: DualPipe schedule sample Example: DualPipe stage contract sample overlap is active. Keeping those three counts separate is the
difference between a real receipt and a runbook that hides where the loss is
scaled.
The shortest checked-in decoder is pipeline parallel sample for where microbatches actually flow, pipeline activation sample for warmup-count math, and DualPipe schedule sample plus DualPipe stage contract sample for where an overlapped schedule changes the divisor.
What Sets the Microbatch in Practice
The naive formula starts with
tokens_per_microstep = device_batch_size * max_seq_len * dp, and
grad_accum_steps = total_batch_size // tokens_per_microstep. A useful planner
starts there and then layers on the constraints that actually decide whether a
configuration survives:
- HBM headroom after FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample sharded params, FP32 buffers, and optimizer state.
- MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch capacity. The dispatch buffer scales with effective tokens per step, not per microbatch, because XLA can fuse across grad-accum micro-steps into a single graph.
- Comm shapes.
device_batch_sizecontrols the size of every reduce-scatter and all-gather inside the FSDP2 group, the all-to-all volumes inside EP, and the SP halo. - Compile graph identity. Under regional compile, every distinct microbatch shape is a new graph.
The auto-fit ranker scores candidates roughly in this order: fits at all, then
larger dbs, then gradient_checkpointing off, then fewer grad_accum_steps,
then more headroom, then smaller TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, then smaller EP. The order matters: we
explicitly prefer dbs=2 / ga=8 to dbs=1 / ga=16 even when memory is
identical, because the comm and kernel-launch overhead per micro-step is
non-trivial on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200.
The public-safe companions for this section are FSDP sharding sample, Expert-parallel routing sample, measured optimization receipts sample, and goodput tracker sample.
Where the accumulation boundary actually lives
Once you have FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample and a real TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding/EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding/SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel topology, "accumulate gradients for
N microbatches, then step" is no longer a single line of trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 code. The
boundary has at least four moving pieces.
First, FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample's delayed-gradient-sync switch has to be set on every microbatch
except the last, or the reduce-scatter fires N times and you pay N times
the communication. Missing it is silent: the math is fine, but throughput
collapses.
The important subtlety is that parameter materialization and gradient
accumulation are different clocks. fully_shard decides when a parameter shard
is gathered for compute; grad_accum_steps decides when gradient communication
is allowed to finish and when the optimizer may step.
Second, the optimizer step has to sit outside the accumulation loop, after the
final reduce-scatter has produced sharded gradients. Under our Megatron
optimizer wrapper this is enforced by the wrapper itself, but the loss scaling
is divided by grad_accum_steps inside the loop. For pipeline parallelQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample, the
divisor is grad_accum_steps * num_chunks instead, which is a bug we hit and
fixed in the pipeline integration layer. The pipeline-side version of that same
rule is spelled out in
DualPipe and 3D parallelism on NVIDIA.
Third, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack auxiliary losses live on a different scaler. We follow Megatron's
MoEAuxLossAutoScaler.set_loss_scale pattern: the LM loss is divided by
grad_accum_steps, the auxiliary load-balancing loss is multiplied through the
same factor, and the two are summed before backward. Forgetting the aux scaler
turns a 1e-3 aux coefficient into something effectively N times that.
Fourth, the data loader has to advance one microbatch at a time, not one trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 step at a time. Our pipeline-parallel path also has to advance the chunk schedule independently. Conflating these has been the cause of more than one "loss is fine but it does not learn" puzzle.
There is a schedule-side constraint that belongs in the same bucket: on the
current DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingAbout: DualPipe and 3D parallelism on NVIDIA Example: DualPipe schedule sample Example: DualPipe stage contract sample and DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingAbout: DualPipe and 3D parallelism on NVIDIA Example: DualPipe stage contract sample Example: DualPipe schedule sample lane, the pipeline validator requires at least
2 * pp_degree microbatches and exact
device_batch_size % pp_microbatches == 0 divisibility so the declared P2P
tensor shapes match on every receiving stage.
Loss scaling: BF16, FP16, and the FP8 islands
Keep three dtype stories separate here: activation dtype, gradient communication dtype, and optimizer or master-state dtype. They interact, but they are not one knob. The MegaCpp precision recipe is the adjacent post for that wider picture.
We run BF16 by default and have for the entire H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 program. BF16 has the
exponent range of FP32, so we do not run a GradScaler for the BF16 path.
FP16 is a different story. When we exercise the FP16 lane, we use a dynamic
GradScaler with the standard powers-of-two backoff and a 2000-step growth
interval. The interesting interaction is with skipped-step handling: the
wrapper detects non-finite gradients before sanitization and, when the flag is
on, skips the optimizer step entirely. With FP16 plus GradScaler, the scaler
also halves the loss scale on the same event.
The FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper islands need their own discipline. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper master copies live in FP32; the model still produces BF16 activations and gradients; the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper scales are owned by TE and updated each forward. None of this changes the gradient-accumulation math, but it changes what "loss scaling" means at the kernel boundary.
Pre-reduce NaN sanitization
There is one more piece that touches both accumulation and loss scaling: the
pre-collective NaN or Inf sanitization. In the optimizer layer we
nan_to_num the flattened gradient buffer before the reduce-scatter when the
pre-reduce NaN check remains enabled.
The reasoning is concrete: a single rank producing a NaN will, without
sanitization, poison every other rank's gradient through the average. The
post-collective nan_to_num cannot recover from that.
The cost is real: an extra full-tensor pass per accumulation boundary. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 it is in the noise; on smaller links it is not. Sanitization belongs at the accumulation boundary rather than at every microbatch. Running it before the final reduce-scatter protects the whole DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP group while keeping the extra tensor pass proportional to optimizer steps rather than to micro-steps.
The tuning loop that actually converged
After a year of ad-hoc spreadsheets we standardized on a five-step inner loop:
- Run the planner. It proposes a ranked list of
(tp, ep, dp, dbs, ga, gradient_checkpointing)tuples. - Smoke each candidate for four steps with metrics and save disabled.
- If a candidate fails OOM, prefer the next ranked candidate that preserves
the current
gradient_checkpointingstate before dropping rematerialization. - Read the steady-state median of steps
1..3, not step0. - Validate gradient norm against an EMA and remove lanes that never settle.
The loop is short but it has rules. We do not change two knobs at once. We do not promote a config to a long run on the strength of step-0 throughput. We do not accept a config that the planner did not score.
Batch warmup
One subtlety we did not initially expect: a "batch warmup" schedule sometimes
outperforms a fixed accumulation count for the first few thousand steps. The
trainer exposes batch_warmup_steps; when set, the grad-accum count ramps from
max(1, max_grad_accum_steps // 8) to max_grad_accum_steps over the warmup
window.
There is one important constraint: on XLA we force batch_warmup_steps = 0,
because varying the per-step micro-step count would invalidate the compiled
graph at every change. CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 is permissive; XLA is not.
What we threw away
- Manual
dbsselection by humans. dbs=1as a default. It is a fallback for OOM, not a goal.- Treating
grad_accum_stepsas independent ofgradient_checkpointing. - A separate FP16 default path.
What still hurts
The biggest remaining sharp edge is that grad_accum_steps and pipeline
parallelQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample num_chunks co-divide the loss, and the divisor depends on which
stage and which microbatch you are in. We also still rely on a stack-wide
environment flag for the sanitization toggle, where it ought to be a per-run
flag.
The shorter version of the whole post: microbatching is a planning problem, not a guessing problem; the accumulation boundary is a contract that touches FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample, the optimizer, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack aux losses, and the data loader; and the inner loop that converges is the one that fails fast and changes one knob at a time.
Inner-loop knobs at a glance
| Knob | Owner | Default | When to change |
|---|---|---|---|
dbs (device microbatch) |
planner | planner top-3 | only via planner |
grad_accum_steps |
trainer + PP scheduler | derived from target tokens/step | co-tuned with gradient_checkpointing |
gradient_checkpointing |
trainer | on for deep or large | retry ladder preserves it before dropping dbs |
batch_warmup_steps |
trainer | 0 on XLA |
early-step compile pressure |
| pre-reduce NaN-check toggle | optimizer env | unset (sanitize on) | only when the stack is provably clean |
Smoke-test shape we use for every planner candidate:
python -m <training-entrypoint> --config <candidate-config> --dbs 4 --grad_accum_steps 8 --save_every 0 --core_metric_every 0 --sample_every 0 --eval_every 0 --warmup_ratio 0.0 --max_steps 4
Frequently asked questions
Is dbs=1 the safe default?+
dbs=2 the more honest baseline.Does BF16 still need dynamic loss scaling here?+
GradScaler behavior mostly matters on the explicit FP16 lane.Why prefer dbs=2 + ga=8 over dbs=1 + ga=16 if total tokens are equal?+
Why does the divisor become grad_accum_steps × num_chunks under DualPipeV?+
What changes on TPU XLA?+
batch_warmup_steps=0 there and describe the TPU lane separately in ZeRO-3-shaped sharding on the XLA backend. The checked-in TPU backend ownership notes and PP compile warmup sample, plus the graph-safe batch warmup sample, are the compact local reminders for that backend-specific rule.Does grad_accum_steps change when FSDP2 all-gathers parameters?+
fully_shard still decides parameter materialization per compute use. grad_accum_steps only decides when gradient communication is allowed to finish and when the optimizer may step.Why can a missed delayed-gradient-sync boundary tank throughput without changing the math?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.
Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.
Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.
Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.
Data parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.
DualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.
Bidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.
The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…