Why split Modal work into three benchmark lanes?

Because throughput, runtime identity, and promotion status are different claims. Combining them makes the resulting benchmark record less trustworthy.

What caused the 8-GPU FSDP2 hang?

Cold per-rank compile skew near the first collective. Some ranks were ready for communication while others were still paying first-use compile work.

Why pin the exact GPU type in the record?

Because hosted benchmarking needs one stable hardware label. The record should say both what was requested and what the run actually observed.

Why stage shards into a Volume before the measured run?

Because the measured loop should be a model benchmark, not a direct object-storage stress test.

Which starting-line state belongs in the benchmark record?

At minimum: requested and observed GPU shape, whether the run started cold or warm, and whether the hot data path was already staged locally.

Do Modal Memory Snapshots replace the warm-cache lane?

No. Treat a snapshot restore as another starting-line state in the receipt. Modal documents Memory Snapshots as a cold-start tool, while GPU snapshots are still an alpha path with multi-GPU and torch.compile caveats. For this benchmark lane, that makes snapshots useful to record, not a substitute for the warmed distributed compile cache described in Modal image and cold-start and Modal multi-GPU issues and fixes.

Why not replace the cache-baked image with a generic offline precompile story?

Because the real training lane still depends on the actual runtime shape. Exported or ahead-of-time artifacts are useful only when their graph and shape contract matches the launch being measured; they are not a substitute for proving the cache state on the same distributed lane. The useful fix stayed narrower: warm the image-aligned cache that the real launch will use.

Why not just raise the distributed timeout?

Because a timeout only changes how long the job waits before admitting failure. If rank-local compile work removes, delays, or reshapes the collective sequence, the fix is graph symmetry: keep the compiled region narrower, pad tiny shards that would otherwise disappear, or warm the same distributed shape before the measured launch. That is a different claim from "the cluster was slow."

MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20265 min readDavid Gornshtein

Modal

Benchmarks

Multi-GPU

Fsdp

H200

B200

Reproducibility

Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes

What we learned running the training stack on rented H100, H200, and B200 boxes through Modal: three benchmark lanes, an 8-GPU FSDP2 hang, and the bookkeeping that lets the numbers survive a week.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes

Published April 18, 2026•5 min read•David Gornshtein

ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is, for us, a benchmarking surface and overflow capacity pool, not the production training cluster. The useful output from that surface is not one headline throughput number. It is a bounded benchmark record that says which lane ran, which hardware class was requested, what hardware class was actually observed, what startup state the run began from, and which nearby lanes were still outside the safe comparison class.

That evidence-first posture is spelled out more directly in Modal benchmark receipts and contrasted with the warm-host lane in Modal vs owned hardware.

Why this matters

Rented Hopper- and Blackwell-class boxes expose exactly the things a warm long-lived host can hide: cold compile state, image drift, and a storage path that may or may not survive the first serious multi-GPU run gracefully. That makes ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts useful for reality checks and dangerous for lazy benchmarking.

The operational question is not "what was the fastest number." It is "what exactly did that number prove."

1. Three lanes, not one

There is no single "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts benchmark." We keep three separate lanes because they answer different questions.

Whole-model training benchmarks

This is the lane for steady-state training throughput. The number belongs to the measured loop, not step 0, and it only makes sense when distributed mode and startup state are preserved alongside it.

Exact-token sparse detached benchmark

This lane is about runtime identity and telemetry more than raw throughput. The detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts Reference: Modal debugging playbook Reference: Modal batch processing docs launcher and later collector preserve durable run identity so the result can be reread after the fact instead of reconstructed from a terminal.

Sparse validation and FA4 promotion

This lane is about acceptance and promotion status, not the fastest throughput figure. A green promotion receipt does not automatically imply that the throughput lane is healthy on the same image.

That is why the three lanes stay separate. A good result from one is not evidence for the others.

2. What worked and what it proved

The single-GPU story was the easy one. Full training of the dense preset ran end to end on H100, H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, and B200, and the B200 lane showed the strongest single-device throughput in this wave. The interesting point was not that one GPU class was faster. It was that the receipt still needed to say which class was requested and which one the run actually saw.

That requested-versus-observed seam matters because a hosted platform can make the table look cleaner than the launch really was. For apples-to-apples benchmarking we keep the exact requested GPU class explicit and preserve the observed class in the record beside it.

The B200 result also did not automatically settle procurement. A stronger single-device number is not the same thing as "default fleet answer." Once startup class, saturation, and the surrounding engineering cost enter the picture, the comparison gets more nuanced. That is why the routing continuation is Modal vs owned hardware, not a bigger chart.

3. The 8-GPU hang

The honest part of this benchmark wave is the failure mode that cost the most time: the 8-GPU FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample plus compile lane hanging on the first forward pass. The useful interpretation is not "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts multi-GPU is broken." It is "cold compile work reached the first collective unevenly across ranks."

The same code path behaves differently on warm owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 systems because the compile cache is already populated. That is the main reason Modal image and cold-start and Modal multi-GPU issues and fixes sit next to this article rather than far away from it.

We evaluated a few practical options and kept the narrow ones:

seed a real multi-GPU compile cache before the heavy launch
keep a last-known-good seed in the image for fresh deployments
use a reduced-complexity diagnostic preset when the lane needs disentangling

What we did not keep is a generic story about offline precompile solving the training lane by itself. The real launch still depends on the actual runtime shape.

4. Data plumbing

Object storage is the cold source of truth. It is not the hot path for eight active training workers. The benchmark lane got more believable once the training shards were staged into a VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview before the measured run and the hot read path stopped depending directly on the mounted object store.

That staging rule is part of benchmark validity, not just storage hygiene. If the run is still paying storage turbulence inside the measured loop, the record is saying as much about the filesystem as it is about the model.

Fused-kernel artifacts follow the same rule. The useful receipt keeps the runtime surface tied to the image and the staged state rather than letting them drift apart silently.

5. Bookkeeping is the deliverable

A throughput number without provenance is a rumor. A benchmark record is the durable receipt for one lane. For the training lane that means keeping at least:

requested and observed GPU class
distributed mode
measured-loop throughput rather than raw step-0 noise
startup state such as cold boot, warm cache, or other staged starting line
the hot data path class used by the run

For the detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts Reference: Modal debugging playbook Reference: Modal batch processing docs sparse lane it means runtime telemetry and backend identity. For the promotion lane it means pass or promotion status rather than pretending every green validation receipt is a throughput claim.

The local GPU profile receipt sample, FA4 receipt summary sample, compile runtime env sample, compile warmup policy sample, and exact-token sparse telemetry sample are the compact public-safe examples of those record shapes.

6. Practical routing

The useful routing is straightforward:

use the whole-model lane for training throughput questions
use the detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts Reference: Modal debugging playbook Reference: Modal batch processing docs sparse lane for exact runtime identity and telemetry
use the validation lane for promotion or acceptance questions

When the question becomes "is this record even comparable," the next read is Modal benchmark receipts. When it becomes "why did this run hang," the next read is Modal multi-GPU issues and fixes.

What we kept and what we will not claim

We kept three explicit benchmark lanes, explicit requested-versus-observed hardware fields, warm-cache awareness, and staged VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview data for the hot path.

We did not keep the convenience story that one nice number from one easy lane says the whole stack is healthy. We also do not treat older successful outputs as proof that the current heavy multi-GPU compile lane is universally solved.

FAQ

Frequently asked questions

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Grounding

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

Grounding

Volume

Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.

Grounding

Memory Snapshots

Modal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

Topic hubs

Entity Hub

Modal Training and Benchmark Operations

A curated Modal reading path: when the rented-GPU surface was useful, what broke on multi-GPU launches, how receipts were recorded, and how we kept the lane debuggable.

Topic Hub

Evaluation, Benchmarks, and Verifier Loops

A curated evaluation reading path: verifier-first harnesses, ablation structure, benchmark receipts, and the evidence rules that keep comparisons from collapsing into anecdotes.

David Gornshtein • MegaCppMore posts →

Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes

Why this matters

1. Three lanes, not one

Whole-model training benchmarks

Exact-token sparse detached benchmark

Sparse validation and FA4 promotion

2. What worked and what it proved

3. The 8-GPU hang

4. Data plumbing

5. Bookkeeping is the deliverable

6. Practical routing

What we kept and what we will not claim

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

Modal Training and Benchmark Operations

Evaluation, Benchmarks, and Verifier Loops