Why can an empty compile cache look like a hang instead of a slow launch?

Because different ranks do not finish first-use compile work at exactly the same moment. Under a fragile first collective, that timing skew can produce a deadlock-looking receipt instead of a clean "still warming up" phase.

Why not turn on distributed GEMM autotune as the main sync fix?

Distributed GEMM autotune is a real PyTorch knob, but it is not the default fix for this Modal H200 lane. It adds another cross-rank synchronization surface while Inductor is still benchmarking kernels, so we treat it as a controlled experiment after the cache seed and first-forward receipts are clean. The longer trade-off lives in NCCL and collective hangs and The Compile-Time Tax We Accept for Runtime Speed.

Why keep distributed debug at INFO instead of the heavier validation mode?

Because on this lane the heavier mode changes bootstrap behavior enough to become part of the problem. The first goal is a truthful launch, then deeper collective diagnosis if the lane has already narrowed that far.

What do you use when INFO is not enough?

We prefer the lighter NCCL timing and Flight Recorder path plus per-rank receipts before turning on DETAIL. That keeps collective timing and timeout evidence available without widening the bootstrap surface that already makes this lane fragile. If the lane has narrowed that far, NCCL and collective hangs and Distributed debugging notes are the next local reads.

Where should the distributed and compile environment flags be set?

Treat them as launch-time settings, not late toggles inside the training body. On Modal, that means setting image or launcher environment before worker code imports PyTorch or initializes a process group, so the same NCCL, TorchInductor, and distributed-debug contract is visible to every rank from bootstrap onward.

Why is this described as compile divergence rather than generic NCCL pain?

Because the operational fix is different. A transport problem wants one class of evidence. A first-forward compile-timeline mismatch wants another. Calling them both "NCCL" hides the remedy.

Why are Memory Snapshots not the main fix for this lane?

Because they target startup overhead, not the real distributed compile boundary. They can help the launch feel lighter without solving the first-forward skew that matters most here.

When should this stop being a Modal problem and move to owned hardware?

When the diagnostic preset still fails after a proper warm-cache rerun, or when the question has turned into a fabric, persistence, or multi-node claim that the single-host Modal lane no longer represents honestly.

What does a Modal H200:8 result prove and not prove?

It proves single-host multi-GPU behavior on one machine. It does not prove cross-node transport behavior or anything about a different cluster surface.

MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20265 min readDavid Gornshtein

Modal

Multi-GPU

NCCL

FSDP2

H200

B200

Runbook

Modal Multi-GPU Pain and the Fixes That Actually Landed

NCCL topology, GPU isolation, eviction and OOM-kill behavior, observability gaps, and the guide we follow when a Modal multi-GPU job hangs on the first forward pass.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Modal Multi-GPU Pain and the Fixes That Actually Landed

Published April 18, 2026•5 min read•David Gornshtein

Single-GPU ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is mostly straightforward. Multi-GPU ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is where the failure modes become expensive. The useful distinction in our records is not "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts was flaky" versus "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts was fine." It is whether the lane failed at startup, at the first compiled collective, or because one rank died and the others only reported the NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 consequence afterward. This post is the narrow runbook for that boundary.

That failure surface reads more clearly next to Modal benchmark receipts, Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes, and Modal debugging playbook.

Why MegaCpp cares about this

ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is where we run ad-hoc multi-GPU smokes and bounded benchmark waves between longer runs on owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 systems and TPU slices. If the 8-GPU surface cannot tell us quickly whether a new block or kernel path is viable, it stops being a useful test surface and starts burning budget.

That is why we separate "multi-GPU hang" into a few smaller questions:

did the launch reach the real training lane at all
did the first compiled collective diverge across ranks
did one rank die first and leave the others to time out

The cold-start side of the same story lives in Modal image and cold-start. This article starts once the multi-GPU path itself is the problem.

What MegaCpp uses today

First touch definition: on this ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts lane, H200:8 means eight GPUs attached to one host for one container. It is a single-host surface, not a multi-node claim.

That single-host boundary matters because it keeps one part of the diagnosis simple. When a run hangs here, the first question is usually compile timeline or rank death, not cross-node fabric behavior.

The working hardening is intentionally narrow:

cap Triton staging so autotune does not ask for an invalid shared-memory budget
keep structured per-rank failure output so one dead rank is visible as more than a generic timeout
keep distributed debug at INFO, not the heavier validation mode that changes bootstrap behavior
relax or disable the compile-phase heartbeat so long warmup does not look like a dead worker
turn FSDP autotune off on the fragile lane where the autotune subprocess competes for the same memory budget as the model

The data path is equally simple. Bucket-backed storage is the cold import surface; the hot training path reads copied shards from a VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start Reference: Modal training platform overview Reference: Modal Volumes docs. That keeps the multi-GPU run from turning into a filesystem benchmark.

The same startup rule applies to snapshots. A snapshot can reduce import-time overhead. It does not replace a real warmed distributed compile cache, because the first-forward collective boundary is still about whether ranks reach the same compiled state at roughly the same time.

How MegaCpp currently uses this

The live blocker on this lane is still cold-cache compile divergence. Different ranks can spend different wall time inside first-use compilation, which means some reach the first collective while others are still compiling. That looks like a generic NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 hang if the receipt is weak and like distributed compile divergence if the receipt is good enough.

Our practical fix stack is smaller than the upstream idea space:

require a real multi-GPU cache seed before the heavy launch
keep a last-known-good seed in the image for fresh deployments
keep a reduced-complexity diagnostic preset so the hang can be separated from broader model complexity

The useful receipt seam is the first-forward one. A durable record that keeps distributed mode, GPU count, starting cache state, and per-rank first-forward timing does far more for this lane than another generic timeout log.

The checked-in compile runtime env sample, PP compile warmup sample, GPU profile receipt sample, and CUDA graph env defaults sample are the compact public-safe examples of that contract.

Ablations and what we kept

We did not keep the story that this is simply "an NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 bug." The useful failure class is compile divergence at the first compiled collective, with rank death as the other main branch.

We did not keep sequential rank-by-rank warmup as the fix. It sounds attractive and does not match the real distributed graph boundary that the launch actually exercises.

We also did not keep the idea that bucket-mounted storage can stay on the hot path for eight active workers. It is a good import surface and a weak training surface.

What survived is narrower:

warm the real multi-GPU cache for the real launch shape
keep the hot data path on a VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start Reference: Modal training platform overview Reference: Modal Volumes docs
preserve a first-forward receipt rich enough to separate compile skew from rank death
move the lane to owned hardware when the problem stops being a fair single-host ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts question

When an 8-GPU job goes quiet, we use this order:

Read the last useful log or app-state line and ask whether some ranks were still compiling while others were already near a collective.
Pull the benchmark record and compare distributed mode, starting cache state, and any per-rank first-forward timing.
Check the per-rank failure dump if one exists. One real traceback plus several timeouts is a rank-death story, not a transport mystery.
Check whether the cache seed was actually present and large enough to represent the intended launch.
If the evidence still points at compile divergence, warm the cache properly and rerun.
If the signal is still ambiguous, move to the reduced-complexity diagnostic preset.

The point of that order is to stop treating every silent job as the same category of failure.

FAQ

Frequently asked questions

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

Memory Snapshots

Modal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.

Grounding

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Grounding

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

Grounding

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Volume

Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.

Grounding

cloud bucket mount

Modal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.

Grounding

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

Grounding

Topic hubs

Entity Hub

Modal Training and Benchmark Operations

A curated Modal reading path: when the rented-GPU surface was useful, what broke on multi-GPU launches, how receipts were recorded, and how we kept the lane debuggable.

David Gornshtein • MegaCppMore posts →

Modal Multi-GPU Pain and the Fixes That Actually Landed

Why MegaCpp cares about this

What MegaCpp uses today

How MegaCpp currently uses this

Ablations and what we kept

Guide: when a Modal multi-GPU job hangs

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

Modal Training and Benchmark Operations