MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20265 min readDavid Gornshtein

Oom

H200

Memory

Debugging

Training

OOM Debugging Playbook for H200 Training Runs

A practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation surfaces, and apply the cheapest fix first.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

OOM Debugging Playbook for H200 Training Runs

Published April 18, 2026•5 min read•David Gornshtein

Out-of-memory failures on modern accelerators are often diagnosed too loosely. "Needs a smaller batch" is only one of several possibilities. In practice, most H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 OOMs fall into one of four buckets:

True activation pressure: the model really does not fit at the current geometry.
Fragmentation: the allocator has enough total free memory but cannot serve the next large request cleanly.
Workspace spikes: a fused kernel, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend, or MoE path asks for a temporary buffer much larger than usual.
Optimizer or cache bursts: the step fits most of the time, then a later phase produces a short-lived peak.

The playbook is to identify which bucket you are in before changing everything at once.

Step 1: separate fragmentation from real exhaustion

Look at allocator retry counts, inactive split bytes, and the largest failed allocation request. If retries are climbing and inactive split memory is large, the run is fragmentation-bound. If retries are low but the requested allocation itself is too large relative to free space, the run is truly out of memory.

That distinction matters because the fixes are different. Fragmentation problems usually respond to allocator configuration or a slightly smaller peak burst. True exhaustion requires reducing retained state or batch geometry.

A quick readback is reserved versus allocated memory. If reserved keeps widening away from live allocated bytes while the working set is otherwise flat, the allocator tail is becoming the story rather than model geometry. On the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 side, allocator growth comes first and split-size tuning is a follow-on move only if retries and inactive split bytes still climb after the safer fragmentation fixes. Why a 4B-8B model fills an H200 and still OOMs and A Memory-Budget Anatomy for One Specialist on H200:8 are the local companions for that split.

Step 2: locate the largest activation surface

The next question is where the peak comes from.

If the largest temporary is attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns workspace, reduce the number of layers using the most memory-hungry attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend or lower the sequence-related pressure that drives workspace size.
If the largest temporary is inside MoE, check whether dispatch scratch and expert activations are being reused efficiently.
If the largest burst appears during the optimizer step, focus on optimizer partitioning rather than activation checkpointing.
If the peak comes from evaluation or serving mixed into the same process, check KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack growth before touching trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit settings.

The fastest way to waste time is to tune recompute when the real culprit is a cache or optimizer burst.

Step 3: apply the cheapest structural fix first

Once the peak surface is known, fix it in order of cost.

Allocator-level fixes

If fragmentation is the issue, use allocator settings that favor expandable segments and avoid unnecessarily aggressive segment splitting. These changes are low-risk and should be confirmed before reshaping the model.

Activation-memory fixes

If attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns dominates, use selective recompute or reduce the set of layers using the heaviest attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend.

If MoE dominates, prefer selective expert recompute over full-block checkpointing. Replaying dispatch and collective-heavy paths is usually too expensive.

If recurrent or Mamba-style blocks dominate, look for narrow in-module recompute before wrapping the whole block.

Optimizer-state fixes

If the burst appears on the optimizer step, verify that optimizer state is actually partitioned as intended. Silent regressions here can multiply the optimizer footprint by the data-parallel degree.

Cache and serving fixes

If a long-context evaluation or serving path shares the node, cap KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack explicitly. Unbounded cache growth is one of the most avoidable OOM sources in mixed workloads.

Step 4: verify buffer reuse and temporary sizing

Large scratch buffers are often supposed to be reused. When reuse breaks because shapes drift between calls, every block allocates a fresh buffer and the run drowns in temporaries.

That is especially important for MoE dispatch scratch and attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns workspace. Check whether metadata stays shape-stable across layers and whether temporary buffers are sized to actual per-rank token counts rather than pessimistic worst-case maxima.

AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns workspaces have the same problem on packed or ragged batches: the temporary can scale with true valid-token counts plus backend bookkeeping such as split-k or tile metadata, not only with one padded B x T headline. If the receipt records only padded shape, it can hide the thing that actually spiked. Packed rows as the real training contract is the local continuation for that boundary.

Step 5: keep the debug loop minimal

A good OOM debugQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures loop is short.

Run one step with allocator and memory debuggingQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures enabled.
Read retry counts, inactive split bytes, and top allocation sites.
Decide whether the failure is fragmentation, activations, workspace, optimizer, or cache.
Change one class of fix at a time.
Re-run and compare the same signals.

Per-step memory tracing is usually too expensive to leave on. A targeted early-step snapshot is usually enough to identify the dominant pressure source. When you do sample allocator counters, do it at stable step boundaries such as the end of a microbatch or the optimizer step so you compare like with like instead of chasing transient kernel noise.

What we generally keep

Selective activation recompute as the default, rather than full checkpointing everywhere.
Partitioned optimizer state for larger trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit jobs.
Allocator settings that reduce fragmentation under bursty demand.
Explicit KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack limits on mixed trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit and evaluation workloads.
Shape-stable temporary-buffer reuse on MoE and attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy paths.

What we avoid by default

Full checkpointing everywhere.
Over-provisioned temporary buffers sized to worst-case token counts when actual traffic is much lower.
Long-running, per-step memory tracing during normal trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit.
Mixing serving-style KV growth into a trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit node without an explicit cap.

Fast triage checklist

Signal	Likely issue	First move
High allocator retries and large inactive split memory	Fragmentation	Favor expandable segments; reduce peak burst slightly
Largest allocation inside attention workspace	Attention backend pressure	Lower heavy-backend usage or sequence-related pressure
Largest allocation inside MoE scratch	Dispatch or expert temporary growth	Verify scratch reuse and actual token-based sizing
Peak arrives on optimizer step	Optimizer-state burst	Verify optimizer partitioning
Peak arrives during co-located eval or serving	KV-cache growth	Apply an explicit KV cap

The point of the checklist is not to be exhaustive. It is to keep you from treating every OOM as the same bug.

FAQ

Frequently asked questions

Why can attention workspace spike even when padded sequence length did not change?+

Because some backends size workspace from real valid-token traffic plus extra split-k or tile metadata rather than from one flat padded-shape number. Two batches with the same outer B x T can therefore ask for different temporary buffers once the token distribution changes inside that shape. Packed rows as the real training contract is the local continuation for that seam.

What should the allocator snapshot record before changing the batch shape?+

Keep the snapshot small: live allocated bytes, reserved bytes, inactive split bytes, allocation retries, OOM count, largest failed request, and the phase boundary where the sample was taken. That is enough to tell whether the next move should be allocator tuning, temporary-buffer reduction, optimizer partitioning, or a real batch/sequence cut.

When should max_split_size_mb enter the OOM debug loop?+

Treat it as a fallback, not as the first allocator knob. If expandable_segments is available and the evidence says fragmentation, use that path first; reach for max_split_size_mb only when the native allocator still shows rising inactive split bytes and allocation retries. Tune it near the largest recurring temporary you are trying to preserve, rather than copying a small value from another workload. A Memory-Budget Anatomy for One Specialist on H200:8 keeps the same rule in the capacity-planning checklist.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

Debugging

A grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…

Grounding

Modal Debugging Guide for Training and Benchmark Failures

Training

What actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…

Grounding

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

Grounding

Topic hubs

Topic Hub

H200 Training and Kernel Bring-Up

A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.

David Gornshtein • MegaCppMore posts →

OOM Debugging Playbook for H200 Training Runs

Step 1: separate fragmentation from real exhaustion

Step 2: locate the largest activation surface

Step 3: apply the cheapest structural fix first

Allocator-level fixes

Activation-memory fixes

Optimizer-state fixes

Cache and serving fixes

Step 4: verify buffer reuse and temporary sizing

Step 5: keep the debug loop minimal

What we generally keep

What we avoid by default

Fast triage checklist

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

H200 Training and Kernel Bring-Up