Why a 4B-8B model fills an H200 and still OOMs
A detailed accounting of where 141 GB of HBM goes when you train a 4B-8B hybrid Mamba 3, Transformer, and MoE specialist: parameters, gradients, optimizer state, activations, KV cache, MoE routing buffers, and allocator fragmentation.

The first time an engineer used to LLaMA-class trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 looks at a hybrid trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 stack, they say the same thing: "It is a 4B model, you have 141 GB per GPU, why are we OOMing." The short answer is that the parameter count is almost the smallest term in the memory budget once every contributor is written down. The long answer is this post. The same arithmetic that makes a dense 7B look comfortable on an 80 GB card makes a hybrid Mamba 3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode + Transformer + MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack specialist at 4K-16K context push the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200's 141 GB ceiling, and turns the 80 GB H100 into "literally does not fit at our target microbatch". The cleanest adjacent lenses are H200 memory geometry, distributed optimizer stress, and expert parallel and MoE sharding. The quickest public-safe proof surfaces are the checked-in memory budget sample, OOM triage sample, parallelism glossary sample, and EP capacity planning sample.
For first touch, separate four buckets before reading any incident report: parameters, optimizer state, activations, and routing or allocator overhead. The checked-in memory budget sample and distributed memory notes use exactly that split, which is why this article treats "4B model" as an incomplete memory description rather than the answer.
Why this matters
We train specialists, not generalists. A 4B-8B model for code-like domains is the right size for quality, latency and cost. That choice is non-negotiable; what we negotiate is every other knob that lets the model actually fit on the device. The pre-flight estimator is not a nice-to-have, it is the gate on whether a preset is allowed to run at all, and it is the tool we reach for before buying more compute. An operator who cannot articulate where the last 10 GB went is one step from "oh, it worked on Monday".
Where the bytes go
Three components do this job: an analytical pre-flight memory estimator, runtime allocator inspection via torch.cuda.memory_stats, and a persisted calibration record that the auto-fit retry ladder reads. The estimator carries per-component fields on a structured memory estimate, and the rest of the system reasons over those fields rather than over a single peak number. If you want the bucket legend before the incident report, the checked-in memory budget sample, distributed memory notes, and OOM triage sample use the same categories this post does. If you want the ownership rules behind the same buckets, open parallelism glossary sample, FSDP sharding sample, and EP capacity planning sample.
| Component | Typical share at TP=2 EP=4 | Lever |
|---|---|---|
| Parameters (bf16) | 2-4 GB / device | TP, EP shard |
| Gradients | ~params (eager); /dp (FSDP2) | FSDP2 |
| Optimizer state | 8-32 GB (Muon vs pure AdamW) | Muon, FSDP2, DP degree |
| Activations (4K ctx, depth 52) | 10-20 GB | Gradient checkpointing, recompute |
| MoE routing/dispatch | several GB | Capacity factor, EP, dispatcher |
| KV cache (training: none; eval: hot) | 0 / variable | Paged KV at eval only |
| Allocator overhead | 5-10% of peak | expandable_segments:True |
Parameters
The parameter footprint is the term operators intuit correctly. At bf16, a 4B-8B model is 8-16 GB of weights before sharding. A serious estimator breaks this down by shard class such as replicated, TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding-sharded, EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding-sharded, or CPU-only, and by optimizer class such as Muon or AdamW. First touch: TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding means tensor parallelismQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, where compatible tensor dimensions are split across ranks, and EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding means expert parallelismQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding, where the routed-expert bank is partitioned across expert peers. That matters because TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding shards Q/K/V/O and MLP weights across the tensor-parallel axis while EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding shards the routed-expert bank across the expert-parallel axis. Replicated weights (embeddings, norm affines, router projections, some feature banks) stay full-size on every rank, which is the same ownership split described in tensor parallel and sharding and the checked-in parallelism glossary sample.
Gradients
For the eager CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 path, the estimator counts materialised gradients at the same byte-for-byte size as the sharded parameter set. FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample keeps gradients sharded along the DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP axis after the reduce-scatter, so the gradient term scales down with dp_degree for anything that is not replicated. First touch: data parallelQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP sharding here means each rank owns only a shard after reduction, not that every rank keeps a full copy forever. For replicated tensors, gradients are not sharded; that is a small tax but not negligible once the feature stack gets dense. On XLA SPMDQuick term guideXLA SPMDThe explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.GroundingAbout: XLA SPMD sharding annotations About: XLA SPMD tokenizer and vocab on TPU Example: TPU backend ownership note we deliberately do not count a full extra gradient copy, because the compiled step folds update scratch into a fused region; counting it separately would overstate TPU HBM. On CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, we count materialised, because it is materialised. That is why FSDP CUDA and Megatron DDP and FSDP2 on XLA TPU do not produce the same memory story even when the model is nominally the same. The checked-in FSDP sharding sample and FSDP2 Muon local shard sample are the local proof surfaces for that ownership story.
The official PyTorch FSDP material describes the same residency pattern in general terms: parameters are all-gathered into unsharded views before forward and backward work, gradients are reduce-scattered afterward, and the optimizer updates local sharded state. That is why the resting per-rank footprint can stay low even though short-lived full-parameter views still exist around compute.
Optimizer state
This is where the arithmetic surprises the newcomer. AdamW keeps m and v moments in fp32, which is 8 bytes per parameter. Muon keeps a single bf16 momentum buffer at ~2 bytes per parameter. First touch: Muon only makes sense on matrix-like 2D weights, so the practical comparison is not "all AdamW versus all Muon" but "Muon on the large matrix surfaces, AdamW on the scalar and embedding-heavy leftovers." A 4B model under pure AdamW needs 32 GB of optimizer state before sharding; the same model under Muon needs 8 GB. On a well-sharded FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample run the optimizer state divides across the DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP group, but it is the single largest memory component on any configuration with a small DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP degree. Our production route is Muon on most matrix parameters and selective AdamW on specific output projections; the 4x saving over pure AdamW is the single reason the 4B-8B size is feasible on this topology at all. The optimizer-side drift and sharding assumptions behind that choice are unpacked separately in Distributed Optimizer Stress, and the checked-in CPU optimizer offload sample shows the same pressure from the memory side.
The part that still surprises people is that this is mostly a steady-state win, not a promise that the optimizer step is flat. Orthogonalizing a sharded 2D weight can briefly need an all-gathered matrix view before the update, so the real launch gate is peak step memory, not only resting optimizer residency.
Activations
Activation memory scales with batch * seq * hidden * layers * dtype * K where K is the implementation-specific constant of how many intermediates per layer are saved for backward. For a dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns block with flash attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns (no materialised QK^T), the Megatron formula gives roughly 34 half-precision elements per token-channel per layer. In bytes, B * T * D * 34 * n_layer * bpe / 2, At B=1, T=4096, D=1536, n_layer=52, bpe=2 that is on the order of 10-20 GB before any hybrid-specific extras. First touch: checkpointing or recompute does not make activations free; it changes how long those intermediates stay resident by replaying some forward work in backward instead of saving every tensor. With gradient checkpointing the estimator switches to a much smaller formula, which is the single biggest memory lever available.
The hybrid stack adds real terms. MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode layers in fp32 are roughly 2x the per-layer activation of a bf16 attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns layer (we scale by the M-layer fraction). MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing adds per-layer router logits in fp32 plus gathered input and expert output buffers sized by capacity factor, summed over the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layer count. DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample indexer and Engram add their own intermediates. The accumulated total dominates when context grows.
MoE routing, KV cache, and allocator overhead
MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing buffers are the term newcomers forget. With capacity factor 1.25, top-6 routing on 64 experts, and a 4 K context, the per-layer dispatch buffers are several hundred MB; summed over the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layer count of the deep preset, that is several GB. First touch: capacity factor is the router's slack multiplier that reserves extra per-expert token slots above the ideal average load so a hot expert does not overflow immediately. The estimator computes per-layer buffers as B * T * N * 4 (router logits, fp32, replicated) plus local_experts * C_e * D * bpe twice (gathered input and expert output), with C_e = ceil(1.25 * B * T * top_k / N). This is also why the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack preset cares about the dispatcher: the alltoall path holds a smaller working set than the padded path, but the padded path is faster on small EP. The operational side of that tradeoff lines up with MoE routing we actually shipped, training speed by feature, and the checked-in EP capacity planning sample.
One easy mistake is to assume that more experts automatically means proportionally larger dispatch buffers. The routed-activation side does not really work that way. At fixed batch, sequence length, hidden size, and capacity factor, the hot dispatch-buffer term is driven much more directly by token load and reserved slack than by raw expert count. Extra experts usually hurt the static parameter budget first; capacity factor is what directly inflates the hot routing buffer.
KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack during trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 is zero because we recompute attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns via flash. KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack during eval is variable, capped by paged KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack, and is what makes the inference memory budget look completely different from the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 memory budget. The estimator does not count it on the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 path; the eval path has its own calibration.
Allocator overhead is the last term and the one that breaks late. Without PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack activation churn fragments the caching allocator into a state where reserved - allocated drifts upward and a step that had headroom at step 50 OOMs at step 500. With it on, fragmentation is bounded; we still budget 5-10% of peak as allocator overhead because the allocator does not give back what it does not have to.
The official CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 allocator note describes the same flag in slightly lower-level terms: it lets a segment grow when allocation sizes drift, which avoids creating lots of partially filled segments and unusable memory slivers when each slightly larger request would otherwise need a fresh allocation.
That flag is allocator hygiene, not a secretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingAbout: Modal training platform overview Reference: Modal debugging playbook Reference: Modal Secrets docs model-size reduction. It does not shrink parameters, activations, or routing buffers, and it does not change trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 math. What it changes is whether fragmented reserved memory turns into a fifth hidden budget line that grows until a previously healthy run OOMs for allocator reasons rather than model reasons.
from memory_estimator import estimate_memory
est = estimate_memory(
model_config=cfg, world_size=8, tp_degree=2, ep_degree=4,
dp_degree=1, batch_size=1, seq_len=4096,
grad_checkpoint=True, optimizer="muon+adamw",
)
for name, gb in est.as_dict().items():
print(f"{name:24s} {gb:6.2f} GiB")
How the estimator drives auto-fit
The startup calibration pass writes a small JSON record at the end of every successful step 0 with the per-component sizes the estimator predicted versus what torch.cuda.memory_stats actually reported. The auto-fit retry ladder reads that record before the next launch: if predicted activations were within 5% of measured, the next launch trusts the estimator; if they were off by more, the next launch widens the safety margin. The same record is what lets dbs step down from 8 to 4 to 2 without re-discovering the memory cliff every time. Without the calibration loop, the estimator is just a calculator; with it, the estimator is the contract that decides whether a launch is allowed.
The runtime forensics path pairs torch.cuda.memory_stats with memory snapshotsQuick term guideMemory SnapshotsModal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.GroundingAbout: Modal vs owned hardware History: multi-GPU Modal benchmarks Reference: Modal image and cold start. When a step OOMs, we capture a memory snapshotQuick term guideMemory SnapshotsModal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.GroundingAbout: Modal vs owned hardware History: multi-GPU Modal benchmarks Reference: Modal image and cold start, diff it against the previous step's snapshot, and look for the largest growing bucket. Most late-stage OOMs are not "model too big"; they are "MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch buffer grew because a new expert got hot" or "MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode conv1d state grew because the chunk size changed". The forensics has to be granular enough to tell those apart, which is the same shorter triage flow condensed in OOM Debugging Playbook for H200 Training Runs.
What proved worth keeping
The analytical estimator stays as the launch gate, the calibration loop stays as the auto-fit input, Muon plus selective AdamW remains the optimizer baseline, gradient checkpointing remains standard above shallow presets, expandable_segments:True stays the allocator default, and any new feature should ship with an estimator entry before it ships with a launcher entry.
Pure AdamW at the 4B-8B size is not attractive because a 32 GB optimizer state for a 4 GB parameter set is hard to justify. Per-step memory snapshotsQuick term guideMemory SnapshotsModal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.GroundingAbout: Modal vs owned hardware History: multi-GPU Modal benchmarks Reference: Modal image and cold start in the hot path are also too expensive. It is also a mistake to assume attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns dominates activation memory on every deep preset, or to plan capacity while ignoring MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch buffers and allocator overhead. That discipline is not glamorous, but it is the reason a 4B model can fit on an H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 without relying on guesswork.
The 16K context wall
At 4K context the deep preset fits on an H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 with room to breathe; at 16K the same preset is on the edge of OOM, and the contributors that grow are predictable. Activation memory scales linearly with sequence length, so the bf16 attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns term roughly quadruples between 4K and 16K. MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode's per-layer fp32 activation also grows linearly. MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch buffers grow linearly because C_e is proportional to B * T. KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack during trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 is still zero (we recompute), so the long-context tax is entirely in activations and dispatch. Gradient checkpointing absorbs most of the activation growth; what it does not absorb is the dispatch buffer growth, which is why the 16K configuration shifts the dominant memory term from "MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode activations" to "MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch".
The implication for capacity planning is that the 16K configuration cannot run at the same EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding and TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding shapes as the 4K configuration. We documented two operating points: 4K at TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding=2 EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding=4 dpQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP=1 with full FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample, and 16K at TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding=2 EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding=4 dpQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP=1 with FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample plus selective recompute on the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack blocks. The 16K configuration is not faster per token; it produces longer sequences for the data mixes that need them, and the memory budget is the binding constraint.
Operationally, this is why we move one long-context lever at a time. Recompute policy, capacity factor, top-k, and microbatch push different budget lines, so changing two together makes it harder to tell whether you hit an activation wall or a routing wall; H200 Memory Geometry for the Hybrid Stack and OOM Debugging Playbook for H200 Training Runs are the quickest companion reads when the estimate and the live run disagree.
What CUDA reports versus what is true
torch.cuda.memory_allocated() and torch.cuda.memory_reserved() are different numbers and they diverge during a run. Allocated is what tensors currently hold; reserved is what the caching allocator has held onto. The gap is allocator overhead, and it grows under MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack workloads because the dispatch buffer cycles through different shapes per expert popularity wave. With expandable_segments:True the gap is bounded; without it, the gap drifts upward indefinitely.
Two diagnostics matter. The first is torch.cuda.memory_stats(), which exposes per-segment counters that let us tell "we held a 2 GB chunk for 30 seconds and freed it" from "we held it for the whole run". The second is the snapshot diff: capture a snapshot at step N and step N+100, sort buckets by growth, and the largest growing bucket is almost always the diagnosis. We snapshot every 1000 steps in production, not every step, because the snapshot itself takes a few hundred milliseconds and we do not want to perturb the steady-state.
The measurement record should capture peak memory_reserved per rank and the largest growing bucket between the first and last steady-state snapshot. When a regression appears, that pair of numbers is usually enough to point at the cause.
What this implies for the next architecture
The memory accounting above is what disqualifies a number of attractive architecture moves. Doubling the expert count from 64 to 128 is not free, but at fixed token load and capacity factor the first pain usually lands in static parameter storage rather than in a direct doubling of the hot dispatch buffer. Moving to a denser MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack (top-k = 8 instead of 6) or a larger capacity factor pushes the live routed-token buffer directly, which the deep preset cannot absorb on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 at 16K context. Adding a second MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode branch per layer doubles the per-layer fp32 activation, which the gradient checkpointing absorbs but only at the cost of an extra recompute pass.
Each of these is a real proposal someone has made; each was rejected on the strength of the analytical estimator, not on a benchmark. The estimator is the gate. The benchmark exists to confirm the estimator and to catch the cases where the estimator is wrong about a specific shape. If we lose the estimator we lose the gate, and architecture proposals start landing on hope rather than arithmetic.
The clearest companion posts are H200 Memory Geometry for the Hybrid Stack, Activation Recompute Boundaries in Hybrid Stacks, and Sequence, Context, and Expert Splits in the Hybrid Stack: one names the memory surfaces, one explains which activations can be replayed safely, and one explains which distributed axes can actually move the large terms.
Frequently asked questions
Why is parameter size not the main memory term here?+
Why does 16K context hurt even with Flash Attention?+
Does adding more experts automatically double the dispatch-buffer memory?+
Why is capacity factor usually the first MoE knob to revisit at 16K?+
CF=1.0 to CF=1.25 buys 25% more routing slack and roughly 25% more live routing-buffer memory; pushing to CF=2.0 can nearly double that line. That makes capacity factor a more direct long-context memory lever than raw expert count when the lane is already near the wall. EP capacity planning sample and Expert parallel and MoE sharding are the shortest local follow-on reads.If lowering capacity factor saves memory, why not always push it down?+
Why can a small microbatch change still push a run over the wall?+
dbs, sequence length, and routing capacity as first-class inputs instead of assuming that a "4B model" has one stable per-rank footprint. Memory budget sample, distributed memory notes, and Gradient accumulation and microbatching are the shortest local follow-ons.What does allocator overhead mean in practice?+
overhead_gb and runtime_reserved_gb separate for exactly that reason, and distributed memory notes explains why reserved and allocated diverge. It is also why expandable_segments:True should be read as a fragmentation control, not as a substitute for arithmetic. If the model only fits after that flag but the routed-token and activation lines are already at the wall, the lane is still structurally too large; the allocator just stopped being the first thing to fail.Why can a run fit at step 50 and still OOM later?+
allocated can look flat while reserved keeps drifting upward. Distributed memory notes, OOM triage sample, and OOM debugging playbook for H200 training runs are the shortest local readback for that late-failure pattern.Does Muon plus FSDP2 ever create a temporary memory spike?+
Why is peak step memory a better launch gate than resting memory?+
Which checked-in files should I inspect first if a small-model H200 run still looks too large?+
Where should I go next if the problem is clearly MoE routing pressure rather than dense activations?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.
Data parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…
The explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
Modal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.
Modal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.
The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.