Benchmarking the MegaCpp Stack on Modal: Multi-GPU Lessons From Rented Boxes
What we learned running our training stack on rented H100, H200, and B200 boxes through Modal — three benchmark lanes, an 8-GPU FSDP2 hang, and the bookkeeping that lets the numbers survive a week.

Benchmarking the MegaCpp Stack on Modal: Multi-GPU Lessons From Rented Boxes
Modal is, for us, a benchmarking surface and an overflow capacity pool — not the production training cluster. We use it to answer questions of the form "is the current MegaCpp stack still the best variant on H200?" and "how much faster is B200 than H100 on a real training step, not a synthetic kernel?" The numbers we get back have a much shorter half-life than people assume; the bookkeeping around them is what keeps them honest after a week.
This post is the operating manual we wish we'd had before the first 8-GPU run. It covers the three Modal lanes we treat as distinct, why we will not let them collapse into one "Modal benchmark" story, and the multi-GPU failure mode that ate the most time so far.
Three lanes, not one
There is a recurring temptation to wave at "the Modal benchmark" as if it were a single number. It is not. The repo carries three distinct Modal surfaces, and treating them as interchangeable is how stale claims sneak back into reports.
The first lane is whole-model training benchmarks. This is where we measure
H200 throughput in real tok/sec from the global batch (not per-GPU), compare
structural variants under a controlled launch regime, and answer questions like
"does removing MTP hurt steady-state throughput?" or "is FSDP2 still worth it
versus DDP on this geometry?" Steady-state means post-warmup steps; step 0 is
discarded by hygiene. Eval is disabled for the duration of the benchmark
window so the number reported is throughput, not throughput plus eval overhead.
The second lane is the exact-token sparse detached benchmark. This one
benchmarks the sparse attention path in isolated eval/no-grad form so we can
record exact runtime telemetry and the backend identity that actually ran. The
supported launcher uses an explicit app.run(detach=True) lifecycle plus a
collector script; the local modal run ... --detach shortcut is intentionally
not the accepted contract here, because we want lifecycle objects we can audit
later. The artifacts on this lane are not throughput numbers; they are
bench_result, bench_telemetry, backend_identity, the remote runtime
provenance blob, and the saved app_id plus function_call_id.
The third lane is sparse validation and FA4 promotion. This is bounded
acceptance: parity checks, promotion-readiness, summary manifests for a wave.
The success criterion is not "highest tok/sec" — it is promotion_status,
promotion_ready, and pass/fail summaries. A receipt here showing Triton
promotion success does not imply that FA4 runtime import and bootstrap are
healthy on the same image; that's a different field on the manifest, and you
have to read it.
The reason we keep these separate is purely operational: collapsing them produces wrong claims. A green sparse validation does not justify a number on the training lane. A throughput number from the training lane does not prove the sparse acceptance lane works. The cost of conflation is that someone quotes one as the other in a Slack thread two weeks later, and we end up re-running the box to disprove the misquote.
What works today
The single-GPU story is straightforward. Full training of our current dense NAM52-class model runs end-to-end on H100, H200, and B200 instances. The full pytest suite — 914 tests — passes on a Modal H100 image without skips.
Multi-host throughput is where Modal earned its keep. Same model, same recipe, same image, three GPU classes:
| GPU | Best tok/sec (single device) |
Relative to H100 |
|---|---|---|
| H100 | 2,780 | 1.00x |
| H200 | (intermediate) | varies by recipe |
| B200 | 4,316 | 1.55x |
The B200 number is the one that actually matters for capacity planning. It is 1.55x H100 on the same recipe — not 2x, not "Blackwell magic", just 1.55x — and at the spot price quoted below it does not pay for itself versus H200 unless we can keep it saturated. We mention that explicitly because the internal version of this table had a "B200 is the future" caption that was charitable to silicon and unfair to procurement.
Pricing context for capacity planning, per GPU-hour at the time of writing:
| GPU | $/hr | 8-GPU $/hr |
|---|---|---|
| B200 | 6.25 | 50.00 |
| H200 | 4.54 | 36.32 |
| H100 | 3.95 | 31.60 |
These move; the ratios move slower. The decision rule we use is: B200 only when we are bottlenecked on memory bandwidth or HBM capacity for the specific recipe, H200 by default, H100 only for cheap regression sweeps where a 1.55x gap is irrelevant.
The 8-GPU hang
The honest part of any benchmarking post is the failure mode that cost the
most time. For us, on Modal, that was the 8-GPU FSDP2 path with
regional_compile enabled. The symptom is the worst kind: the run launches,
ranks initialize, the first forward pass enters — and then nothing. No
traceback, no NCCL timeout for a long while, no useful log slice.
The root cause is mundane once you see it. With a cold inductor cache, Triton JITs each kernel on first use. JIT time is not deterministic across ranks. Eight ranks therefore enter the first NCCL collective at eight different moments, and the collective deadlocks because some ranks are still inside the compiler.
The same code does not hang on our long-lived H200 training hosts because those hosts have warm inductor caches from prior runs; the JIT path is effectively a cache lookup, the rank skew collapses, and the collective proceeds. Modal containers are clean by default, so they hit the slow path every time.
There is no clever single-line fix. The options are:
- Pre-bake the inductor cache into the Docker image, sourced from a warm H200 VM. This is the cleanest fix and the one we are converging on; it moves the variance off the hot path and into image build time.
- Mount a Modal Volume with a pre-populated cache from a prior 8-GPU run on the same image. This works, but the cache has to come from an 8-GPU run on the matching image; an 8-GPU cache from a different image, or a 1-GPU cache from the right image, does not cover the kernel set.
- Sequential compile warmup. Tempting, but FSDP2 changes the graph in ways that make a "compile once on rank 0 and fan out" strategy unsafe, so we discarded it.
- Reduce model complexity for the Modal lane — fewer MoE experts, no MoD — to shrink the kernel set the first compile has to JIT. This is what we actually do for quick acceptance runs while the cache-baked image is being prepared.
The relevant lesson is that "works on the long-lived training host" was hiding a real determinism gap. Modal forced it into the open by giving us a fresh container every time.
Data plumbing
Training data lives in a private GCS bucket (workspace placeholder name). For multi-GPU training, GCS FUSE parallel reads from inside Modal's container did not survive eight concurrent readers — we saw read stalls and partial-shard reads, not corruption, but the training step took the latency hit. The fix is dull and effective: pre-copy the relevant shards into a Modal Volume once, then mount the Volume and read from local disk. Throughput becomes deterministic and the GCS bill drops.
Fused kernel wheels live in the same GCS area. We pin them by image so the "which kernels does this run actually use" question always has a single answer in the run manifest.
Bookkeeping is the deliverable
A throughput number with no provenance is a rumor. A throughput number with the right metadata is a receipt that survives the next stack upgrade. For each lane we record a different set of fields, deliberately.
For the whole-model training lane: app_id, function_call_id, the exact
launch flags (verbatim, not paraphrased), the parsed steady-state step
metrics, and the exact distributed mode. Without the distributed mode, "180k
tok/sec on H200" is unreproducible — it could be DDP, FSDP2, FSDP2 with
compile, or Megatron-style.
For the exact-token sparse lane: launcher args, case metadata, the exact
sparse env selectors, the runtime telemetry payload, the backend_identity
the run actually used, the remote runtime provenance, and the detached
collector's state transitions. The last one is what lets us reconstruct
"did this finish on its own or did the collector reattach?"
For the sparse validation / FA4 lane: the validation/promotion mode,
promotion_status, promotion_ready, the saved summary manifests, and
whether the run was detached or blocking.
If a Modal artifact does not carry these, we treat it as anecdote. If it does, it stays useful for months.
Practical routing
For anyone using the same surfaces, the routing is:
scripts/modal_matrix.pyfor whole-model benchmark intent.scripts/modal_bench_dsa_backend_detach.py(paired with the collect script) for exact-token sparse acceptance.scripts/modal_sparse_validation_detach.pyfor sparse and FA4 promotion waves.
The convenience harness modal_benchmark.py is fine for one-off curiosity
runs. It is not the source of truth for distributed H200 claims, and we do
not let it become one.
What we will not claim
We will not claim that older checked-in Modal training JSON artifacts prove the current training lane is universally healthy. They are dated evidence on the wave they came from. The multi-GPU FSDP2 + compile lane is alive on warm-cache H200 hosts and is being made reliable on Modal via the cache-baked image; it is not yet a one-command experience for anyone who clones the repo and types "go". When it is, we will say so on the receipt, not in a tweet.
References
MODAL_BENCHMARK_PLAN.mdMODAL_MULTI_GPU_STATUS.mdH200_STACK_SETUP.mdtp_sp_ep_fsdp_h200_bringup_2026-04-07.md