Why not rebuild the image from scratch on every run?

Because the point of this lane is comparable startup, not maximum rebuild purity. The rebuild path is useful as recovery and provenance. The pinned image is the thing that keeps repeated runs readable.

What actually has to persist for a warm start to stay warm?

The compile cache is the biggest piece, but not the only one. Local shard copies and checkpoint-adjacent state matter too. Without them, a run can avoid the biggest compile cliff and still waste time on avoidable setup work.

Why keep the active compile cache on a Volume instead of only on a bucket mount?

Because the cache is a live writable working set. Bucket-backed storage is the cold source of truth for large artifacts; a Volume is the right place for state that changes during the run.

Why can a run still pause later after a warm step 0?

Because the warm start usually removes the biggest compile hit, not every later one. New backward graphs or later-shape subgraphs can still appear deeper into the run.

How do you keep a warmed cache from turning into stale state?

By treating cache warmth as a property of the image and launch shape, not as a reusable global good. A stack bump, changed graph shape, or suspect Volume starts a new cache lineage; the run is either a deliberate cold compile or a new seed, and the receipt says which one. The same provenance rule shows up in Modal benchmark receipts and Compile-time vs runtime tradeoffs.

Where do Memory Snapshots fit in this story?

They help with startup work such as imports and initialization. They do not replace the compile-cache boundary for training. The useful mental model is snapshot for startup reduction, warmed cache for compile comparability.

Why is a booted container not enough to start a benchmark receipt?

Because boot only proves that the image launched. The receipt-bearing lane still has to prove the CUDA, TorchInductor, distributed, and CUDA-graph environment is the one the run expects; otherwise a warm image can hide a different runtime contract. The local compile runtime env sample and CUDA graph env defaults sample are the public-safe checks we point readers to for that boundary.

MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20265 min readMegaCpp Engineering

Modal

Docker

Cold Start

Inductor Cache

Triton

H200

Modal image construction and the cold-start tax we actually pay

How we layer the Modal training image, why every wheel is pinned to the training stack, how persistent volumes absorb the inductor-cache hit, and the 30-90 second startup tax we accept as the price of burst compute.

By MegaCpp Engineering

MegaCpp

Focused on applied C++ model engineering

Article Preview

Modal image construction and the cold-start tax we actually pay

Published April 18, 2026•5 min read•MegaCpp Engineering

A ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts container is only "fast" if you already paid for the expensive parts somewhere else. For our training lane that usually means a pinned base image, a writable compile-cache VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs, and a data path that keeps object storage out of the hot loop. This post is the narrow operational version of that story: which startup layers we keep warm, which ones we still accept as tax, and why "fast boot" and "ready to benchmark" are not the same thing.

Why MegaCpp cares about this

We use ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts for short or medium jobs where startup time can dominate the value of the run: a quick FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample smoke, a bounded preset sanity check, a regression probe after a kernel change. If startup costs twenty minutes, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is a worse version of a warm reserved host. If startup stays under about a minute and a half on the warm path, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts becomes a genuinely better surface for that kind of work.

The second reason is compatibility. The training image is where the pinned torch build, Triton build, fused-kernel wheels, Python version, and CUDA-facing package set are forced to agree. If that contract drifts between runs, the result is not "the same benchmark with a different startup time." It is a different runtime surface entirely. The platform-level map lives in Modal training platform overview; this post stays on the image and cache side.

What MegaCpp built around this workflow

The working lane uses one pinned registry image plus a very small copied source overlay. The image already contains the training stack; the overlay only carries the code that actually changed. That separation matters because a large runtime-mounted tree can make startup noisy enough that the image contract stops being the thing you are measuring.

There is still a slower rebuild-from-scratch path, but it is a recovery tool, not the normal receipt-bearing path. The benchmarkable surface is the pinned image.

The next boundary is runtime eligibility. A container can boot cleanly and still be wrong for the intended kernel lane. That is why the startup lane keeps a small runtime check alongside the image: the local compile runtime env sample and CUDA graph env defaults sample are the compact public-safe examples of "runtime surface is healthy enough to proceed."

Storage is the other half of the cold-start design. We keep four classes of state separate:

bucket-backed storage for large external artifacts such as datasets and wheel mirrors
a checkpoint VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs for long-lived run outputs
an inductor-cache VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs for writable compile artifacts
a local-data VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs for copied hot shards used during training

That split is not decorative. A writable compile cache belongs on a VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs because it changes during the run. A bucket mountQuick term guidecloud bucket mountModal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Cloud Bucket Mount docs is better as the cold source of truth for large read-mostly inputs. The same distinction shows up operationally in Modal multi-GPU issues and fixes, where training stops reading the hot path directly from mounted object storage.

The compile-cache seed itself is layered. The warm path tries the nearest durable seed first, then falls back outward:

a tarball captured from a previous good run
a pre-warmed tarball stored with the other external artifacts
a seed directory already baked into the image
a slower object-by-object recovery path

The important point is not the exact implementation detail. It is that warm startup is treated as an explicit artifact flow, not as luck.

How it lands in MegaCpp

The image lane stays honest by keeping the overlay small and the mutable state off the image. That means copied code for the receipt-bearing lane, not a large runtime-mounted working tree; writable VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs for compile artifacts, copied shards, and checkpoints; and bucket-backed storage for the large external inputs that should not be rebuilt into every image.

The warm path also has limits. A run can look warm at step 0 and still pay a smaller compile bill later when new backward graphs materialize. That is why the local compile warmup policy sample and PP compile warmup sample matter: the useful question is not only "did startup look fast" but also "which compile work was deferred rather than removed."

That is also where startup snapshots and compile caches part ways. A snapshot can help with import-time and initialization work. It does not replace a real warmed compile cache, because the first-forward and later backward graph surfaces still depend on the actual launch shape and runtime state. In practice we treat snapshots as startup accelerators and warmed cache state as the thing that keeps the measured lane comparable.

The copied-overlay versus runtime-mounted-tree boundary matters for the same reason. Both can be convenient. Only one gives a small, stable, receipt-bearing startup contract. For benchmark or evidence lanes we keep the copied path boring on purpose.

Ablations and what we kept

We did not keep the idea that rebuilding from scratch on every run is cleaner. It is slower, noisier, and much less useful for comparisons.

We did not keep the idea that a bucket mountQuick term guidecloud bucket mountModal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Cloud Bucket Mount docs should hold the active compile cache. That surface is excellent for large reads and the wrong fit for a hot writable working set.

We also did not keep the idea that an offline export or other static precompile story can replace the warmed image-plus-cache path for this training lane. The real training graph still depends on the actual launch geometry and runtime surface, so the public-safe answer remains smaller: use a pinned image, warm the cache that belongs to that image, and record the starting state honestly.

What survived is narrower:

one pinned registry image as the normal path
a small copied source overlay
layered compile-cache seeding
writable VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingHistory: Modal multi-GPU issues and fixes Reference: Modal training platform overview Reference: Modal Volumes docs for hot mutable state
explicit separation between "startup was fast" and "the measured lane stayed warm later"

FAQ

Frequently asked questions

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Grounding

Volume

Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.

Grounding

cloud bucket mount

Modal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.

Grounding

Memory Snapshots

Modal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

Topic hubs

Entity Hub

Modal Training and Benchmark Operations

A curated Modal reading path: when the rented-GPU surface was useful, what broke on multi-GPU launches, how receipts were recorded, and how we kept the lane debuggable.

MegaCpp Engineering • MegaCppMore posts →

Modal image construction and the cold-start tax we actually pay

Why MegaCpp cares about this

What MegaCpp built around this workflow

How it lands in MegaCpp

Ablations and what we kept

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

Modal Training and Benchmark Operations