Modal Multi-GPU Pain and the Fixes That Actually Landed
NCCL topology, GPU isolation, eviction and OOM-kill behavior, observability gaps, and the guide we follow when a Modal multi-GPU job hangs on the first forward pass.

Single-GPU ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is mostly straightforward. Multi-GPU ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is where the failure modes become expensive. The useful distinction in our records is not "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts was flaky" versus "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts was fine." It is whether the lane failed at startup, at the first compiled collective, or because one rank died and the others only reported the NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 consequence afterward. This post is the narrow runbook for that boundary.
That failure surface reads more clearly next to Modal benchmark receipts, Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes, and Modal debugging playbook.
Why MegaCpp cares about this
ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is where we run ad-hoc multi-GPU smokes and bounded benchmark waves between longer runs on owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 systems and TPU slices. If the 8-GPU surface cannot tell us quickly whether a new block or kernel path is viable, it stops being a useful test surface and starts burning budget.
That is why we separate "multi-GPU hang" into a few smaller questions:
- did the launch reach the real training lane at all
- did the first compiled collective diverge across ranks
- did one rank die first and leave the others to time out
The cold-start side of the same story lives in Modal image and cold-start. This article starts once the multi-GPU path itself is the problem.
What MegaCpp uses today
First touch definition: on this ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts lane, H200:8 means eight GPUs attached to one host for one container. It is a single-host surface, not a multi-node claim.
That single-host boundary matters because it keeps one part of the diagnosis simple. When a run hangs here, the first question is usually compile timeline or rank death, not cross-node fabric behavior.
The working hardening is intentionally narrow:
- cap Triton staging so autotune does not ask for an invalid shared-memory budget
- keep structured per-rank failure output so one dead rank is visible as more than a generic timeout
- keep distributed debug at
INFO, not the heavier validation mode that changes bootstrap behavior - relax or disable the compile-phase heartbeat so long warmup does not look like a dead worker
- turn FSDP autotune off on the fragile lane where the autotune subprocess competes for the same memory budget as the model
The data path is equally simple. Bucket-backed storage is the cold import surface; the hot training path reads copied shards from a VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start Reference: Modal training platform overview Reference: Modal Volumes docs. That keeps the multi-GPU run from turning into a filesystem benchmark.
The same startup rule applies to snapshots. A snapshot can reduce import-time overhead. It does not replace a real warmed distributed compile cache, because the first-forward collective boundary is still about whether ranks reach the same compiled state at roughly the same time.
How MegaCpp currently uses this
The live blocker on this lane is still cold-cache compile divergence. Different ranks can spend different wall time inside first-use compilation, which means some reach the first collective while others are still compiling. That looks like a generic NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 hang if the receipt is weak and like distributed compile divergence if the receipt is good enough.
Our practical fix stack is smaller than the upstream idea space:
- require a real multi-GPU cache seed before the heavy launch
- keep a last-known-good seed in the image for fresh deployments
- keep a reduced-complexity diagnostic preset so the hang can be separated from broader model complexity
The useful receipt seam is the first-forward one. A durable record that keeps distributed mode, GPU count, starting cache state, and per-rank first-forward timing does far more for this lane than another generic timeout log.
The checked-in compile runtime env sample, PP compile warmup sample, GPU profile receipt sample, and CUDA graph env defaults sample are the compact public-safe examples of that contract.
Ablations and what we kept
We did not keep the story that this is simply "an NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 bug." The useful failure class is compile divergence at the first compiled collective, with rank death as the other main branch.
We did not keep sequential rank-by-rank warmup as the fix. It sounds attractive and does not match the real distributed graph boundary that the launch actually exercises.
We also did not keep the idea that bucket-mounted storage can stay on the hot path for eight active workers. It is a good import surface and a weak training surface.
What survived is narrower:
- warm the real multi-GPU cache for the real launch shape
- keep the hot data path on a VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start Reference: Modal training platform overview Reference: Modal Volumes docs
- preserve a first-forward receipt rich enough to separate compile skew from rank death
- move the lane to owned hardware when the problem stops being a fair single-host ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts question
Guide: when a Modal multi-GPU job hangs
When an 8-GPU job goes quiet, we use this order:
- Read the last useful log or app-state line and ask whether some ranks were still compiling while others were already near a collective.
- Pull the benchmark record and compare distributed mode, starting cache state, and any per-rank first-forward timing.
- Check the per-rank failure dump if one exists. One real traceback plus several timeouts is a rank-death story, not a transport mystery.
- Check whether the cache seed was actually present and large enough to represent the intended launch.
- If the evidence still points at compile divergence, warm the cache properly and rerun.
- If the signal is still ambiguous, move to the reduced-complexity diagnostic preset.
The point of that order is to stop treating every silent job as the same category of failure.
Frequently asked questions
Why can an empty compile cache look like a hang instead of a slow launch?+
Why not turn on distributed GEMM autotune as the main sync fix?+
Why keep distributed debug at INFO instead of the heavier validation mode?+
What do you use when INFO is not enough?+
DETAIL. That keeps collective timing and timeout evidence available without widening the bootstrap surface that already makes this lane fragile. If the lane has narrowed that far, NCCL and collective hangs and Distributed debugging notes are the next local reads.Where should the distributed and compile environment flags be set?+
Why is this described as compile divergence rather than generic NCCL pain?+
Why are Memory Snapshots not the main fix for this lane?+
When should this stop being a Modal problem and move to owned hardware?+
What does a Modal H200:8 result prove and not prove?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
A container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.
NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
Modal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.
PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.
Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.
CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.
Modal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.
A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.