MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20268 min readDavid Gornshtein

FSDP2

Gradient Accumulation

Microbatching

Training

H200

Gradient Accumulation and Microbatching Under FSDP2: How We Stopped Guessing the Knobs

Q: Does BF16 still need dynamic loss scaling here?

Usually no. BF16 is the default path; the dynamic GradScaler behavior mostly matters on the explicit FP16 lane.

Q: Why does the divisor become grad_accum_steps × num_chunks under DualPipeV?

Because DualPipeV has no later cleanup pass that rescales gradients for you. The stage-local loss and any injected MoE aux-loss path have to match the overlapped schedule directly.

Q: Does grad_accum_steps change when FSDP2 all-gathers parameters?

No. fully_shard still decides parameter materialization per compute use. grad_accum_steps only decides when gradient communication is allowed to finish and when the optimizer may step.

Microbatch sizing under FSDP2, accumulation boundaries that respect TP/EP/SP, loss scaling under FP16/BF16, and the tuning loop that finally converged on H200.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Published April 18, 2026•8 min read•David Gornshtein

Gradient Accumulation and Microbatching Under FSDP2

Most of the time, "tune the microbatch" sounds like a one-line job: pick the largest device_batch_size that fits in HBM, divide your global token budget by it, and call the quotient grad_accum_steps. On a clean dense model that is roughly true. In a hybrid trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 stack, where FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample sits next to TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding, SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch, fused attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, and mixed optimizer state, the same one-line job can eat weeks.

This post is the honest version of how we sized microbatches, where the accumulation boundary actually lives, what loss scaling looks like in a world that is mostly BF16 with FP16 and FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper islands, and the inner loop we now use to converge on a config without brute-forcing the search space. It pairs naturally with training on 8x H200 SXM, FSDP2 pain and payoff, FSDP on CUDA and Megatron DDP, and regional compile without losing the plot.

In this article, device_batch_size means the per-rank microbatch shape that actually enters the forward pass, grad_accum_steps means how many of those microbatches are accumulated before the optimizer may step, and pipeline num_chunks means the extra schedule-side split that matters only when PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample or DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingAbout: DualPipe and 3D parallelism on NVIDIA Example: DualPipe schedule sample Example: DualPipe stage contract sample overlap is active. Keeping those three counts separate is the difference between a real receipt and a runbook that hides where the loss is scaled.

The shortest checked-in decoder is pipeline parallel sample for where microbatches actually flow, pipeline activation sample for warmup-count math, and DualPipe schedule sample plus DualPipe stage contract sample for where an overlapped schedule changes the divisor.

What Sets the Microbatch in Practice

The naive formula starts with tokens_per_microstep = device_batch_size * max_seq_len * dp, and grad_accum_steps = total_batch_size // tokens_per_microstep. A useful planner starts there and then layers on the constraints that actually decide whether a configuration survives:

HBM headroom after FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample sharded params, FP32 buffers, and optimizer state.
MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch capacity. The dispatch buffer scales with effective tokens per step, not per microbatch, because XLA can fuse across grad-accum micro-steps into a single graph.
Comm shapes. device_batch_size controls the size of every reduce-scatter and all-gather inside the FSDP2 group, the all-to-all volumes inside EP, and the SP halo.
Compile graph identity. Under regional compile, every distinct microbatch shape is a new graph.

The auto-fit ranker scores candidates roughly in this order: fits at all, then larger dbs, then gradient_checkpointing off, then fewer grad_accum_steps, then more headroom, then smaller TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, then smaller EP. The order matters: we explicitly prefer dbs=2 / ga=8 to dbs=1 / ga=16 even when memory is identical, because the comm and kernel-launch overhead per micro-step is non-trivial on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200.

The public-safe companions for this section are FSDP sharding sample, Expert-parallel routing sample, measured optimization receipts sample, and goodput tracker sample.

Where the accumulation boundary actually lives

Once you have FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample and a real TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding/EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding/SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel topology, "accumulate gradients for N microbatches, then step" is no longer a single line of trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 code. The boundary has at least four moving pieces.

First, FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample's delayed-gradient-sync switch has to be set on every microbatch except the last, or the reduce-scatter fires N times and you pay N times the communication. Missing it is silent: the math is fine, but throughput collapses.

The important subtlety is that parameter materialization and gradient accumulation are different clocks. fully_shard decides when a parameter shard is gathered for compute; grad_accum_steps decides when gradient communication is allowed to finish and when the optimizer may step.

Second, the optimizer step has to sit outside the accumulation loop, after the final reduce-scatter has produced sharded gradients. Under our Megatron optimizer wrapper this is enforced by the wrapper itself, but the loss scaling is divided by grad_accum_steps inside the loop. For pipeline parallelQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample, the divisor is grad_accum_steps * num_chunks instead, which is a bug we hit and fixed in the pipeline integration layer. The pipeline-side version of that same rule is spelled out in DualPipe and 3D parallelism on NVIDIA.

Third, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack auxiliary losses live on a different scaler. We follow Megatron's MoEAuxLossAutoScaler.set_loss_scale pattern: the LM loss is divided by grad_accum_steps, the auxiliary load-balancing loss is multiplied through the same factor, and the two are summed before backward. Forgetting the aux scaler turns a 1e-3 aux coefficient into something effectively N times that.

Fourth, the data loader has to advance one microbatch at a time, not one trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 step at a time. Our pipeline-parallel path also has to advance the chunk schedule independently. Conflating these has been the cause of more than one "loss is fine but it does not learn" puzzle.

There is a schedule-side constraint that belongs in the same bucket: on the current DualPipeQuick term guideDualPipeBidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.GroundingAbout: DualPipe and 3D parallelism on NVIDIA Example: DualPipe schedule sample Example: DualPipe stage contract sample and DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingAbout: DualPipe and 3D parallelism on NVIDIA Example: DualPipe stage contract sample Example: DualPipe schedule sample lane, the pipeline validator requires at least 2 * pp_degree microbatches and exact device_batch_size % pp_microbatches == 0 divisibility so the declared P2P tensor shapes match on every receiving stage.

Loss scaling: BF16, FP16, and the FP8 islands

Keep three dtype stories separate here: activation dtype, gradient communication dtype, and optimizer or master-state dtype. They interact, but they are not one knob. The MegaCpp precision recipe is the adjacent post for that wider picture.

We run BF16 by default and have for the entire H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 program. BF16 has the exponent range of FP32, so we do not run a GradScaler for the BF16 path.

FP16 is a different story. When we exercise the FP16 lane, we use a dynamic GradScaler with the standard powers-of-two backoff and a 2000-step growth interval. The interesting interaction is with skipped-step handling: the wrapper detects non-finite gradients before sanitization and, when the flag is on, skips the optimizer step entirely. With FP16 plus GradScaler, the scaler also halves the loss scale on the same event.

The FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper islands need their own discipline. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper master copies live in FP32; the model still produces BF16 activations and gradients; the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper scales are owned by TE and updated each forward. None of this changes the gradient-accumulation math, but it changes what "loss scaling" means at the kernel boundary.

Pre-reduce NaN sanitization

There is one more piece that touches both accumulation and loss scaling: the pre-collective NaN or Inf sanitization. In the optimizer layer we nan_to_num the flattened gradient buffer before the reduce-scatter when the pre-reduce NaN check remains enabled.

The reasoning is concrete: a single rank producing a NaN will, without sanitization, poison every other rank's gradient through the average. The post-collective nan_to_num cannot recover from that.

The cost is real: an extra full-tensor pass per accumulation boundary. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 it is in the noise; on smaller links it is not. Sanitization belongs at the accumulation boundary rather than at every microbatch. Running it before the final reduce-scatter protects the whole DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP group while keeping the extra tensor pass proportional to optimizer steps rather than to micro-steps.

The tuning loop that actually converged

After a year of ad-hoc spreadsheets we standardized on a five-step inner loop:

Run the planner. It proposes a ranked list of (tp, ep, dp, dbs, ga, gradient_checkpointing) tuples.
Smoke each candidate for four steps with metrics and save disabled.
If a candidate fails OOM, prefer the next ranked candidate that preserves the current gradient_checkpointing state before dropping rematerialization.
Read the steady-state median of steps 1..3, not step 0.
Validate gradient norm against an EMA and remove lanes that never settle.

The loop is short but it has rules. We do not change two knobs at once. We do not promote a config to a long run on the strength of step-0 throughput. We do not accept a config that the planner did not score.

Batch warmup

One subtlety we did not initially expect: a "batch warmup" schedule sometimes outperforms a fixed accumulation count for the first few thousand steps. The trainer exposes batch_warmup_steps; when set, the grad-accum count ramps from max(1, max_grad_accum_steps // 8) to max_grad_accum_steps over the warmup window.

There is one important constraint: on XLA we force batch_warmup_steps = 0, because varying the per-step micro-step count would invalidate the compiled graph at every change. CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 is permissive; XLA is not.

What we threw away

Manual dbs selection by humans.
dbs=1 as a default. It is a fallback for OOM, not a goal.
Treating grad_accum_steps as independent of gradient_checkpointing.
A separate FP16 default path.

What still hurts

The biggest remaining sharp edge is that grad_accum_steps and pipeline parallelQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample num_chunks co-divide the loss, and the divisor depends on which stage and which microbatch you are in. We also still rely on a stack-wide environment flag for the sanitization toggle, where it ought to be a per-run flag.

The shorter version of the whole post: microbatching is a planning problem, not a guessing problem; the accumulation boundary is a contract that touches FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample, the optimizer, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack aux losses, and the data loader; and the inner loop that converges is the one that fails fast and changes one knob at a time.

Inner-loop knobs at a glance

Knob	Owner	Default	When to change
`dbs` (device microbatch)	planner	planner top-3	only via planner
`grad_accum_steps`	trainer + PP scheduler	derived from target tokens/step	co-tuned with `gradient_checkpointing`
`gradient_checkpointing`	trainer	on for deep or large	retry ladder preserves it before dropping `dbs`
`batch_warmup_steps`	trainer	`0` on XLA	early-step compile pressure
pre-reduce NaN-check toggle	optimizer env	unset (sanitize on)	only when the stack is provably clean

Smoke-test shape we use for every planner candidate:

python -m <training-entrypoint> --config <candidate-config> --dbs 4 --grad_accum_steps 8 --save_every 0 --core_metric_every 0 --sample_every 0 --eval_every 0 --warmup_ratio 0.0 --max_steps 4

FAQ

Frequently asked questions

Is dbs=1 the safe default?+

No. It is the fallback when memory is tight, not the target. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. the communication and launch overhead often make dbs=2 the more honest baseline.

Does BF16 still need dynamic loss scaling here?+

Usually no. BF16 is the default path; the dynamic GradScaler behavior mostly matters on the explicit FP16 lane.

Why prefer dbs=2 + ga=8 over dbs=1 + ga=16 if total tokens are equal?+

Because equal token counts do not imply equal step cost. The smaller microbatch sends more, smaller collectives and pays launch overhead more often, so H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. usually prefers the larger microbatch if memory and schedule constraints allow it.

Why does the divisor become grad_accum_steps × num_chunks under DualPipeV?+

Because DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you. has no later cleanup pass that rescales gradients for you. The stage-local loss and any injected MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble. aux-loss path have to match the overlapped schedule directly.

What changes on TPU XLA?+

The main difference is that XLA treats micro-step count and batch-shape variation as compile-shaping inputs. That is why we force batch_warmup_steps=0 there and describe the TPU lane separately in ZeRO-3-shaped sharding on the XLA backend. The checked-in TPU backend ownership notes and PP compile warmup sample, plus the graph-safe batch warmup sample, are the compact local reminders for that backend-specific rule.

Does grad_accum_steps change when FSDP2 all-gathers parameters?+

No. fully_shard still decides parameter materialization per compute use. grad_accum_steps only decides when gradient communication is allowed to finish and when the optimizer may step.

Why can a missed delayed-gradient-sync boundary tank throughput without changing the math?+

Because the expensive mistake is communicative, not algebraic. If the FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism. lane forgets to defer gradient communication until the last microbatch in the window, the DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job. group pays the same reduce-scatter boundary on every microbatch instead of once per optimizer step.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Grounding

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

Grounding

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Grounding

Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

Grounding

Data parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.

Grounding

DualPipeV

DualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.

Grounding

DualPipe

Bidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.

Grounding

scheduler

The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Grounding

David Gornshtein • MegaCppMore posts →

Gradient Accumulation and Microbatching Under FSDP2: How We Stopped Guessing the Knobs

Gradient Accumulation and Microbatching Under FSDP2

What Sets the Microbatch in Practice

Where the accumulation boundary actually lives

Loss scaling: BF16, FP16, and the FP8 islands

Pre-reduce NaN sanitization

The tuning loop that actually converged

Batch warmup

What we threw away

What still hurts

Inner-loop knobs at a glance

Read next

References

Frequently asked questions

Terms used in this article