MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 5 min readDavid Gornshtein
Oom
H200
Memory
Debugging
Training

OOM Debugging Playbook for H200 Training Runs

A practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation surfaces, and apply the cheapest fix first.

MegaCpp
Focused on applied C++ model engineering
Article Preview
OOM Debugging Playbook for H200 Training Runs
Published 5 min readDavid Gornshtein

Out-of-memory failures on modern accelerators are often diagnosed too loosely. "Needs a smaller batch" is only one of several possibilities. In practice, most H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 OOMs fall into one of four buckets:

  1. True activation pressure: the model really does not fit at the current geometry.
  2. Fragmentation: the allocator has enough total free memory but cannot serve the next large request cleanly.
  3. Workspace spikes: a fused kernel, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend, or MoE path asks for a temporary buffer much larger than usual.
  4. Optimizer or cache bursts: the step fits most of the time, then a later phase produces a short-lived peak.

The playbook is to identify which bucket you are in before changing everything at once.

Step 1: separate fragmentation from real exhaustion

Look at allocator retry counts, inactive split bytes, and the largest failed allocation request. If retries are climbing and inactive split memory is large, the run is fragmentation-bound. If retries are low but the requested allocation itself is too large relative to free space, the run is truly out of memory.

That distinction matters because the fixes are different. Fragmentation problems usually respond to allocator configuration or a slightly smaller peak burst. True exhaustion requires reducing retained state or batch geometry.

A quick readback is reserved versus allocated memory. If reserved keeps widening away from live allocated bytes while the working set is otherwise flat, the allocator tail is becoming the story rather than model geometry. On the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 side, allocator growth comes first and split-size tuning is a follow-on move only if retries and inactive split bytes still climb after the safer fragmentation fixes. Why a 4B-8B model fills an H200 and still OOMs and A Memory-Budget Anatomy for One Specialist on H200:8 are the local companions for that split.

Step 2: locate the largest activation surface

The next question is where the peak comes from.

The fastest way to waste time is to tune recompute when the real culprit is a cache or optimizer burst.

Step 3: apply the cheapest structural fix first

Once the peak surface is known, fix it in order of cost.

Allocator-level fixes

If fragmentation is the issue, use allocator settings that favor expandable segments and avoid unnecessarily aggressive segment splitting. These changes are low-risk and should be confirmed before reshaping the model.

Activation-memory fixes

If attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns dominates, use selective recompute or reduce the set of layers using the heaviest attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend.

If MoE dominates, prefer selective expert recompute over full-block checkpointing. Replaying dispatch and collective-heavy paths is usually too expensive.

If recurrent or Mamba-style blocks dominate, look for narrow in-module recompute before wrapping the whole block.

Optimizer-state fixes

If the burst appears on the optimizer step, verify that optimizer state is actually partitioned as intended. Silent regressions here can multiply the optimizer footprint by the data-parallel degree.

Cache and serving fixes

If a long-context evaluation or serving path shares the node, cap KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack explicitly. Unbounded cache growth is one of the most avoidable OOM sources in mixed workloads.

Step 4: verify buffer reuse and temporary sizing

Large scratch buffers are often supposed to be reused. When reuse breaks because shapes drift between calls, every block allocates a fresh buffer and the run drowns in temporaries.

That is especially important for MoE dispatch scratch and attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns workspace. Check whether metadata stays shape-stable across layers and whether temporary buffers are sized to actual per-rank token counts rather than pessimistic worst-case maxima.

AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns workspaces have the same problem on packed or ragged batches: the temporary can scale with true valid-token counts plus backend bookkeeping such as split-k or tile metadata, not only with one padded B x T headline. If the receipt records only padded shape, it can hide the thing that actually spiked. Packed rows as the real training contract is the local continuation for that boundary.

Step 5: keep the debug loop minimal

A good OOM debugQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures loop is short.

  1. Run one step with allocator and memory debuggingQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures enabled.
  2. Read retry counts, inactive split bytes, and top allocation sites.
  3. Decide whether the failure is fragmentation, activations, workspace, optimizer, or cache.
  4. Change one class of fix at a time.
  5. Re-run and compare the same signals.

Per-step memory tracing is usually too expensive to leave on. A targeted early-step snapshot is usually enough to identify the dominant pressure source. When you do sample allocator counters, do it at stable step boundaries such as the end of a microbatch or the optimizer step so you compare like with like instead of chasing transient kernel noise.

What we generally keep

What we avoid by default

Fast triage checklist

Signal Likely issue First move
High allocator retries and large inactive split memory Fragmentation Favor expandable segments; reduce peak burst slightly
Largest allocation inside attention workspace Attention backend pressure Lower heavy-backend usage or sequence-related pressure
Largest allocation inside MoE scratch Dispatch or expert temporary growth Verify scratch reuse and actual token-based sizing
Peak arrives on optimizer step Optimizer-state burst Verify optimizer partitioning
Peak arrives during co-located eval or serving KV-cache growth Apply an explicit KV cap

The point of the checklist is not to be exhaustive. It is to keep you from treating every OOM as the same bug.

FAQ

Frequently asked questions

Why can attention workspace spike even when padded sequence length did not change?+
Because some backends size workspace from real valid-token traffic plus extra split-k or tile metadata rather than from one flat padded-shape number. Two batches with the same outer B x T can therefore ask for different temporary buffers once the token distribution changes inside that shape. Packed rows as the real training contract is the local continuation for that seam.
What should the allocator snapshot record before changing the batch shape?+
Keep the snapshot small: live allocated bytes, reserved bytes, inactive split bytes, allocation retries, OOM count, largest failed request, and the phase boundary where the sample was taken. That is enough to tell whether the next move should be allocator tuning, temporary-buffer reduction, optimizer partitioning, or a real batch/sequence cut.
When should max_split_size_mb enter the OOM debug loop?+
Treat it as a fallback, not as the first allocator knob. If expandable_segments is available and the evidence says fragmentation, use that path first; reach for max_split_size_mb only when the native allocator still shows rising inactive split bytes and allocation retries. Tune it near the largest recurring temporary you are trying to preserve, rather than copying a small value from another workload. A Memory-Budget Anatomy for One Specialist on H200:8 keeps the same rule in the capacity-planning checklist.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Debugging

A grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…

Training

What actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

Topic hubs