OOM Debugging Playbook for H200 Training Runs
A practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation surfaces, and apply the cheapest fix first.

Out-of-memory failures on modern accelerators are often diagnosed too loosely. "Needs a smaller batch" is only one of several possibilities. In practice, most H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 OOMs fall into one of four buckets:
- True activation pressure: the model really does not fit at the current geometry.
- Fragmentation: the allocator has enough total free memory but cannot serve the next large request cleanly.
- Workspace spikes: a fused kernel, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend, or MoE path asks for a temporary buffer much larger than usual.
- Optimizer or cache bursts: the step fits most of the time, then a later phase produces a short-lived peak.
The playbook is to identify which bucket you are in before changing everything at once.
Step 1: separate fragmentation from real exhaustion
Look at allocator retry counts, inactive split bytes, and the largest failed allocation request. If retries are climbing and inactive split memory is large, the run is fragmentation-bound. If retries are low but the requested allocation itself is too large relative to free space, the run is truly out of memory.
That distinction matters because the fixes are different. Fragmentation problems usually respond to allocator configuration or a slightly smaller peak burst. True exhaustion requires reducing retained state or batch geometry.
A quick readback is reserved versus allocated memory. If reserved keeps widening away from live allocated bytes while the working set is otherwise flat, the allocator tail is becoming the story rather than model geometry. On the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 side, allocator growth comes first and split-size tuning is a follow-on move only if retries and inactive split bytes still climb after the safer fragmentation fixes. Why a 4B-8B model fills an H200 and still OOMs and A Memory-Budget Anatomy for One Specialist on H200:8 are the local companions for that split.
Step 2: locate the largest activation surface
The next question is where the peak comes from.
- If the largest temporary is attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns workspace, reduce the number of layers using the most memory-hungry attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend or lower the sequence-related pressure that drives workspace size.
- If the largest temporary is inside MoE, check whether dispatch scratch and expert activations are being reused efficiently.
- If the largest burst appears during the optimizer step, focus on optimizer partitioning rather than activation checkpointing.
- If the peak comes from evaluation or serving mixed into the same process, check KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack growth before touching trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit settings.
The fastest way to waste time is to tune recompute when the real culprit is a cache or optimizer burst.
Step 3: apply the cheapest structural fix first
Once the peak surface is known, fix it in order of cost.
Allocator-level fixes
If fragmentation is the issue, use allocator settings that favor expandable segments and avoid unnecessarily aggressive segment splitting. These changes are low-risk and should be confirmed before reshaping the model.
Activation-memory fixes
If attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns dominates, use selective recompute or reduce the set of layers using the heaviest attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend.
If MoE dominates, prefer selective expert recompute over full-block checkpointing. Replaying dispatch and collective-heavy paths is usually too expensive.
If recurrent or Mamba-style blocks dominate, look for narrow in-module recompute before wrapping the whole block.
Optimizer-state fixes
If the burst appears on the optimizer step, verify that optimizer state is actually partitioned as intended. Silent regressions here can multiply the optimizer footprint by the data-parallel degree.
Cache and serving fixes
If a long-context evaluation or serving path shares the node, cap KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack explicitly. Unbounded cache growth is one of the most avoidable OOM sources in mixed workloads.
Step 4: verify buffer reuse and temporary sizing
Large scratch buffers are often supposed to be reused. When reuse breaks because shapes drift between calls, every block allocates a fresh buffer and the run drowns in temporaries.
That is especially important for MoE dispatch scratch and attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns workspace. Check whether metadata stays shape-stable across layers and whether temporary buffers are sized to actual per-rank token counts rather than pessimistic worst-case maxima.
AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns workspaces have the same problem on packed or ragged batches: the temporary can scale with true valid-token counts plus backend bookkeeping such as split-k or tile metadata, not only with one padded B x T headline. If the receipt records only padded shape, it can hide the thing that actually spiked. Packed rows as the real training contract is the local continuation for that boundary.
Step 5: keep the debug loop minimal
A good OOM debugQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures loop is short.
- Run one step with allocator and memory debuggingQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures enabled.
- Read retry counts, inactive split bytes, and top allocation sites.
- Decide whether the failure is fragmentation, activations, workspace, optimizer, or cache.
- Change one class of fix at a time.
- Re-run and compare the same signals.
Per-step memory tracing is usually too expensive to leave on. A targeted early-step snapshot is usually enough to identify the dominant pressure source. When you do sample allocator counters, do it at stable step boundaries such as the end of a microbatch or the optimizer step so you compare like with like instead of chasing transient kernel noise.
What we generally keep
- Selective activation recompute as the default, rather than full checkpointing everywhere.
- Partitioned optimizer state for larger trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit jobs.
- Allocator settings that reduce fragmentation under bursty demand.
- Explicit KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack limits on mixed trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit and evaluation workloads.
- Shape-stable temporary-buffer reuse on MoE and attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy paths.
What we avoid by default
- Full checkpointing everywhere.
- Over-provisioned temporary buffers sized to worst-case token counts when actual traffic is much lower.
- Long-running, per-step memory tracing during normal trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit.
- Mixing serving-style KV growth into a trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit node without an explicit cap.
Fast triage checklist
| Signal | Likely issue | First move |
|---|---|---|
| High allocator retries and large inactive split memory | Fragmentation | Favor expandable segments; reduce peak burst slightly |
| Largest allocation inside attention workspace | Attention backend pressure | Lower heavy-backend usage or sequence-related pressure |
| Largest allocation inside MoE scratch | Dispatch or expert temporary growth | Verify scratch reuse and actual token-based sizing |
| Peak arrives on optimizer step | Optimizer-state burst | Verify optimizer partitioning |
| Peak arrives during co-located eval or serving | KV-cache growth | Apply an explicit KV cap |
The point of the checklist is not to be exhaustive. It is to keep you from treating every OOM as the same bug.
Frequently asked questions
Why can attention workspace spike even when padded sequence length did not change?+
B x T can therefore ask for different temporary buffers once the token distribution changes inside that shape. Packed rows as the real training contract is the local continuation for that seam.What should the allocator snapshot record before changing the batch shape?+
When should max_split_size_mb enter the OOM debug loop?+
expandable_segments is available and the evidence says fragmentation, use that path first; reach for max_split_size_mb only when the native allocator still shows rising inactive split bytes and allocation retries. Tune it near the largest recurring temporary you are trying to preserve, rather than copying a small value from another workload. A Memory-Budget Anatomy for One Specialist on H200:8 keeps the same rule in the capacity-planning checklist.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
A grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…
What actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…