Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes
What we learned running the training stack on rented H100, H200, and B200 boxes through Modal: three benchmark lanes, an 8-GPU FSDP2 hang, and the bookkeeping that lets the numbers survive a week.

ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is, for us, a benchmarking surface and overflow capacity pool, not the production training cluster. The useful output from that surface is not one headline throughput number. It is a bounded benchmark record that says which lane ran, which hardware class was requested, what hardware class was actually observed, what startup state the run began from, and which nearby lanes were still outside the safe comparison class.
That evidence-first posture is spelled out more directly in Modal benchmark receipts and contrasted with the warm-host lane in Modal vs owned hardware.
Why this matters
Rented Hopper- and Blackwell-class boxes expose exactly the things a warm long-lived host can hide: cold compile state, image drift, and a storage path that may or may not survive the first serious multi-GPU run gracefully. That makes ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts useful for reality checks and dangerous for lazy benchmarking.
The operational question is not "what was the fastest number." It is "what exactly did that number prove."
1. Three lanes, not one
There is no single "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts benchmark." We keep three separate lanes because they answer different questions.
Whole-model training benchmarks
This is the lane for steady-state training throughput. The number belongs to the measured loop, not step 0, and it only makes sense when distributed mode and startup state are preserved alongside it.
Exact-token sparse detached benchmark
This lane is about runtime identity and telemetry more than raw throughput. The detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts Reference: Modal debugging playbook Reference: Modal batch processing docs launcher and later collector preserve durable run identity so the result can be reread after the fact instead of reconstructed from a terminal.
Sparse validation and FA4 promotion
This lane is about acceptance and promotion status, not the fastest throughput figure. A green promotion receipt does not automatically imply that the throughput lane is healthy on the same image.
That is why the three lanes stay separate. A good result from one is not evidence for the others.
2. What worked and what it proved
The single-GPU story was the easy one. Full training of the dense preset ran end to end on H100, H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, and B200, and the B200 lane showed the strongest single-device throughput in this wave. The interesting point was not that one GPU class was faster. It was that the receipt still needed to say which class was requested and which one the run actually saw.
That requested-versus-observed seam matters because a hosted platform can make the table look cleaner than the launch really was. For apples-to-apples benchmarking we keep the exact requested GPU class explicit and preserve the observed class in the record beside it.
The B200 result also did not automatically settle procurement. A stronger single-device number is not the same thing as "default fleet answer." Once startup class, saturation, and the surrounding engineering cost enter the picture, the comparison gets more nuanced. That is why the routing continuation is Modal vs owned hardware, not a bigger chart.
3. The 8-GPU hang
The honest part of this benchmark wave is the failure mode that cost the most time: the 8-GPU FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample plus compile lane hanging on the first forward pass. The useful interpretation is not "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts multi-GPU is broken." It is "cold compile work reached the first collective unevenly across ranks."
The same code path behaves differently on warm owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 systems because the compile cache is already populated. That is the main reason Modal image and cold-start and Modal multi-GPU issues and fixes sit next to this article rather than far away from it.
We evaluated a few practical options and kept the narrow ones:
- seed a real multi-GPU compile cache before the heavy launch
- keep a last-known-good seed in the image for fresh deployments
- use a reduced-complexity diagnostic preset when the lane needs disentangling
What we did not keep is a generic story about offline precompile solving the training lane by itself. The real launch still depends on the actual runtime shape.
4. Data plumbing
Object storage is the cold source of truth. It is not the hot path for eight active training workers. The benchmark lane got more believable once the training shards were staged into a VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview before the measured run and the hot read path stopped depending directly on the mounted object store.
That staging rule is part of benchmark validity, not just storage hygiene. If the run is still paying storage turbulence inside the measured loop, the record is saying as much about the filesystem as it is about the model.
Fused-kernel artifacts follow the same rule. The useful receipt keeps the runtime surface tied to the image and the staged state rather than letting them drift apart silently.
5. Bookkeeping is the deliverable
A throughput number without provenance is a rumor. A benchmark record is the durable receipt for one lane. For the training lane that means keeping at least:
- requested and observed GPU class
- distributed mode
- measured-loop throughput rather than raw step-0 noise
- startup state such as cold boot, warm cache, or other staged starting line
- the hot data path class used by the run
For the detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts Reference: Modal debugging playbook Reference: Modal batch processing docs sparse lane it means runtime telemetry and backend identity. For the promotion lane it means pass or promotion status rather than pretending every green validation receipt is a throughput claim.
The local GPU profile receipt sample, FA4 receipt summary sample, compile runtime env sample, compile warmup policy sample, and exact-token sparse telemetry sample are the compact public-safe examples of those record shapes.
6. Practical routing
The useful routing is straightforward:
- use the whole-model lane for training throughput questions
- use the detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts Reference: Modal debugging playbook Reference: Modal batch processing docs sparse lane for exact runtime identity and telemetry
- use the validation lane for promotion or acceptance questions
When the question becomes "is this record even comparable," the next read is Modal benchmark receipts. When it becomes "why did this run hang," the next read is Modal multi-GPU issues and fixes.
What we kept and what we will not claim
We kept three explicit benchmark lanes, explicit requested-versus-observed hardware fields, warm-cache awareness, and staged VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview data for the hot path.
We did not keep the convenience story that one nice number from one easy lane says the whole stack is healthy. We also do not treat older successful outputs as proof that the current heavy multi-GPU compile lane is universally solved.
Frequently asked questions
Why split Modal work into three benchmark lanes?+
What caused the 8-GPU FSDP2 hang?+
Why pin the exact GPU type in the record?+
Why stage shards into a Volume before the measured run?+
Which starting-line state belongs in the benchmark record?+
Do Modal Memory Snapshots replace the warm-cache lane?+
torch.compile caveats. For this benchmark lane, that makes snapshots useful to record, not a substitute for the warmed distributed compile cache described in Modal image and cold-start and Modal multi-GPU issues and fixes.Why not replace the cache-baked image with a generic offline precompile story?+
Why not just raise the distributed timeout?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
A container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.
PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.
A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.
Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.
Modal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
Continue with a curated reading path
Modal Training and Benchmark Operations
A curated Modal reading path: when the rented-GPU surface was useful, what broke on multi-GPU launches, how receipts were recorded, and how we kept the lane debuggable.
Evaluation, Benchmarks, and Verifier Loops
A curated evaluation reading path: verifier-first harnesses, ablation structure, benchmark receipts, and the evidence rules that keep comparisons from collapsing into anecdotes.