Modal image construction and the cold-start tax we actually pay
How we layer the Modal training image, why every wheel is pinned to the training stack, how persistent volumes absorb the inductor-cache hit, and the 30-90 second startup tax we accept as the price of burst compute.

A ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts container is only "fast" if you already paid for the expensive parts somewhere else. For our training lane that usually means a pinned base image, a writable compile-cache VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs, and a data path that keeps object storage out of the hot loop. This post is the narrow operational version of that story: which startup layers we keep warm, which ones we still accept as tax, and why "fast boot" and "ready to benchmark" are not the same thing.
Why MegaCpp cares about this
We use ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts for short or medium jobs where startup time can dominate the value of the run: a quick FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample smoke, a bounded preset sanity check, a regression probe after a kernel change. If startup costs twenty minutes, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is a worse version of a warm reserved host. If startup stays under about a minute and a half on the warm path, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts becomes a genuinely better surface for that kind of work.
The second reason is compatibility. The training image is where the pinned torch build, Triton build, fused-kernel wheels, Python version, and CUDA-facing package set are forced to agree. If that contract drifts between runs, the result is not "the same benchmark with a different startup time." It is a different runtime surface entirely. The platform-level map lives in Modal training platform overview; this post stays on the image and cache side.
What MegaCpp built around this workflow
The working lane uses one pinned registry image plus a very small copied source overlay. The image already contains the training stack; the overlay only carries the code that actually changed. That separation matters because a large runtime-mounted tree can make startup noisy enough that the image contract stops being the thing you are measuring.
There is still a slower rebuild-from-scratch path, but it is a recovery tool, not the normal receipt-bearing path. The benchmarkable surface is the pinned image.
The next boundary is runtime eligibility. A container can boot cleanly and still be wrong for the intended kernel lane. That is why the startup lane keeps a small runtime check alongside the image: the local compile runtime env sample and CUDA graph env defaults sample are the compact public-safe examples of "runtime surface is healthy enough to proceed."
Storage is the other half of the cold-start design. We keep four classes of state separate:
- bucket-backed storage for large external artifacts such as datasets and wheel mirrors
- a checkpoint VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs for long-lived run outputs
- an inductor-cache VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs for writable compile artifacts
- a local-data VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs for copied hot shards used during training
That split is not decorative. A writable compile cache belongs on a VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs because it changes during the run. A bucket mountQuick term guidecloud bucket mountModal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Cloud Bucket Mount docs is better as the cold source of truth for large read-mostly inputs. The same distinction shows up operationally in Modal multi-GPU issues and fixes, where training stops reading the hot path directly from mounted object storage.
The compile-cache seed itself is layered. The warm path tries the nearest durable seed first, then falls back outward:
- a tarball captured from a previous good run
- a pre-warmed tarball stored with the other external artifacts
- a seed directory already baked into the image
- a slower object-by-object recovery path
The important point is not the exact implementation detail. It is that warm startup is treated as an explicit artifact flow, not as luck.
How it lands in MegaCpp
The image lane stays honest by keeping the overlay small and the mutable state off the image. That means copied code for the receipt-bearing lane, not a large runtime-mounted working tree; writable VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs for compile artifacts, copied shards, and checkpoints; and bucket-backed storage for the large external inputs that should not be rebuilt into every image.
The warm path also has limits. A run can look warm at step 0 and still pay a smaller compile bill later when new backward graphs materialize. That is why the local compile warmup policy sample and PP compile warmup sample matter: the useful question is not only "did startup look fast" but also "which compile work was deferred rather than removed."
That is also where startup snapshots and compile caches part ways. A snapshot can help with import-time and initialization work. It does not replace a real warmed compile cache, because the first-forward and later backward graph surfaces still depend on the actual launch shape and runtime state. In practice we treat snapshots as startup accelerators and warmed cache state as the thing that keeps the measured lane comparable.
The copied-overlay versus runtime-mounted-tree boundary matters for the same reason. Both can be convenient. Only one gives a small, stable, receipt-bearing startup contract. For benchmark or evidence lanes we keep the copied path boring on purpose.
Ablations and what we kept
We did not keep the idea that rebuilding from scratch on every run is cleaner. It is slower, noisier, and much less useful for comparisons.
We did not keep the idea that a bucket mountQuick term guidecloud bucket mountModal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Cloud Bucket Mount docs should hold the active compile cache. That surface is excellent for large reads and the wrong fit for a hot writable working set.
We also did not keep the idea that an offline export or other static precompile story can replace the warmed image-plus-cache path for this training lane. The real training graph still depends on the actual launch geometry and runtime surface, so the public-safe answer remains smaller: use a pinned image, warm the cache that belongs to that image, and record the starting state honestly.
What survived is narrower:
- one pinned registry image as the normal path
- a small copied source overlay
- layered compile-cache seeding
- writable VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs for hot mutable state
- explicit separation between "startup was fast" and "the measured lane stayed warm later"
Frequently asked questions
Why not rebuild the image from scratch on every run?+
What actually has to persist for a warm start to stay warm?+
Why keep the active compile cache on a Volume instead of only on a bucket mount?+
Why can a run still pause later after a warm step 0?+
How do you keep a warmed cache from turning into stale state?+
Where do Memory Snapshots fit in this story?+
Why is a booted container not enough to start a benchmark receipt?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
A container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.
FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.
Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.
Modal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.
Modal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.