MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 5 min readDavid Gornshtein
Modal
Multi-GPU
NCCL
FSDP2
H200
B200
Runbook

Modal Multi-GPU Pain and the Fixes That Actually Landed

NCCL topology, GPU isolation, eviction and OOM-kill behavior, observability gaps, and the guide we follow when a Modal multi-GPU job hangs on the first forward pass.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Modal Multi-GPU Pain and the Fixes That Actually Landed
Published 5 min readDavid Gornshtein

Single-GPU ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is mostly straightforward. Multi-GPU ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is where the failure modes become expensive. The useful distinction in our records is not "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts was flaky" versus "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts was fine." It is whether the lane failed at startup, at the first compiled collective, or because one rank died and the others only reported the NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 consequence afterward. This post is the narrow runbook for that boundary.

That failure surface reads more clearly next to Modal benchmark receipts, Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes, and Modal debugging playbook.

Why MegaCpp cares about this

ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is where we run ad-hoc multi-GPU smokes and bounded benchmark waves between longer runs on owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 systems and TPU slices. If the 8-GPU surface cannot tell us quickly whether a new block or kernel path is viable, it stops being a useful test surface and starts burning budget.

That is why we separate "multi-GPU hang" into a few smaller questions:

  • did the launch reach the real training lane at all
  • did the first compiled collective diverge across ranks
  • did one rank die first and leave the others to time out

The cold-start side of the same story lives in Modal image and cold-start. This article starts once the multi-GPU path itself is the problem.

What MegaCpp uses today

First touch definition: on this ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts lane, H200:8 means eight GPUs attached to one host for one container. It is a single-host surface, not a multi-node claim.

That single-host boundary matters because it keeps one part of the diagnosis simple. When a run hangs here, the first question is usually compile timeline or rank death, not cross-node fabric behavior.

The working hardening is intentionally narrow:

  • cap Triton staging so autotune does not ask for an invalid shared-memory budget
  • keep structured per-rank failure output so one dead rank is visible as more than a generic timeout
  • keep distributed debug at INFO, not the heavier validation mode that changes bootstrap behavior
  • relax or disable the compile-phase heartbeat so long warmup does not look like a dead worker
  • turn FSDP autotune off on the fragile lane where the autotune subprocess competes for the same memory budget as the model

The data path is equally simple. Bucket-backed storage is the cold import surface; the hot training path reads copied shards from a VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start Reference: Modal training platform overview Reference: Modal Volumes docs. That keeps the multi-GPU run from turning into a filesystem benchmark.

The same startup rule applies to snapshots. A snapshot can reduce import-time overhead. It does not replace a real warmed distributed compile cache, because the first-forward collective boundary is still about whether ranks reach the same compiled state at roughly the same time.

How MegaCpp currently uses this

The live blocker on this lane is still cold-cache compile divergence. Different ranks can spend different wall time inside first-use compilation, which means some reach the first collective while others are still compiling. That looks like a generic NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 hang if the receipt is weak and like distributed compile divergence if the receipt is good enough.

Our practical fix stack is smaller than the upstream idea space:

  1. require a real multi-GPU cache seed before the heavy launch
  2. keep a last-known-good seed in the image for fresh deployments
  3. keep a reduced-complexity diagnostic preset so the hang can be separated from broader model complexity

The useful receipt seam is the first-forward one. A durable record that keeps distributed mode, GPU count, starting cache state, and per-rank first-forward timing does far more for this lane than another generic timeout log.

The checked-in compile runtime env sample, PP compile warmup sample, GPU profile receipt sample, and CUDA graph env defaults sample are the compact public-safe examples of that contract.

Ablations and what we kept

We did not keep the story that this is simply "an NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 bug." The useful failure class is compile divergence at the first compiled collective, with rank death as the other main branch.

We did not keep sequential rank-by-rank warmup as the fix. It sounds attractive and does not match the real distributed graph boundary that the launch actually exercises.

We also did not keep the idea that bucket-mounted storage can stay on the hot path for eight active workers. It is a good import surface and a weak training surface.

What survived is narrower:

Guide: when a Modal multi-GPU job hangs

When an 8-GPU job goes quiet, we use this order:

  1. Read the last useful log or app-state line and ask whether some ranks were still compiling while others were already near a collective.
  2. Pull the benchmark record and compare distributed mode, starting cache state, and any per-rank first-forward timing.
  3. Check the per-rank failure dump if one exists. One real traceback plus several timeouts is a rank-death story, not a transport mystery.
  4. Check whether the cache seed was actually present and large enough to represent the intended launch.
  5. If the evidence still points at compile divergence, warm the cache properly and rerun.
  6. If the signal is still ambiguous, move to the reduced-complexity diagnostic preset.

The point of that order is to stop treating every silent job as the same category of failure.

FAQ

Frequently asked questions

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Modal

A container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Memory Snapshots

Modal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

PP

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Volume

Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.

cloud bucket mount

Modal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

Topic hubs