When does Modal beat a reserved H200 box?

When the job is detached, reproducibility matters more than warm local state, and operator time is part of the real cost. On the MegaCpp side that usually means a fresh single-GPU smoke, a detached validation wave, or a bounded benchmark lane whose receipt is more important than absolute warm-start speed. Modal vs owned hardware is the direct continuation when the question becomes surface selection instead of platform shape.

Why not publish one Modal-versus-H200 crossover number here?

Because the crossover depends on the same receipt fields the benchmark lane already records: requested and observed GPU, cold or warm start, staged data path, compile-cache state, and whether the claim is still single-host. This overview keeps the rule qualitative; the quantitative answer belongs beside Modal benchmark receipts and Modal vs owned hardware, where cost, startup, storage, and topology can stay separate instead of becoming one blended comparison.

Should credentials or cache state live in the image?

No. Secrets belong in Modal Secrets or the cloud secret manager, and persistent state belongs in explicitly mounted Volumes. Secrets inject credentials; Volumes preserve writable runtime state; images hold the pinned software stack. Modal image and cold start is the sibling post for the image/cache split in practice.

Why split compiler cache, checkpoints, and scratch into separate volumes?

Because they have different retention and failure rules. Compiler cache can be cheaply rebuilt, checkpoints need resume semantics, and scratch data should be easy to rotate without risking the state that makes later runs reproducible. The local compile runtime env sample and compile warmup policy sample are the fastest checked-in proof surfaces for why compile state is its own lane.

What turns a detached Modal job into a real benchmark receipt?

Keeping the launch metadata, GPU choice, mounted state, and collected outputs together. The local GPU profile receipt sample shows the shape directly: measured loop only, requested-vs-observed dispatch, and peak memory next to throughput. The broader write-ups are Modal benchmark receipts and multi-GPU Modal benchmarks.

Which public Modal primitives does this overview lean on most?

GPU selection and count, Volumes, Secrets, cloud bucket mounts, and background or detached execution. The public docs also make two edge rules worth remembering: standard multi-GPU stays on one machine, and benchmarking should pin the exact GPU class instead of relying on auto-upgrades. MegaCpp layers its own lane-specific receipts and runbooks on top of those documented Modal surfaces, which is why this overview hands off to Modal image and cold start, Modal debugging playbook, and Modal benchmark receipts.

What does H200:8 mean in this Modal lane?

On Modal's standard multi-GPU surface it means one container with eight GPUs on the same physical machine. That is enough for the single-host FSDP2 and DDP stories in this lane, but it is not a claim about cross-node fabric behavior.

Why is a warm Modal Volume still not the same as a warm reserved host?

Because the storage class is part of the runtime contract. A Modal job can reuse Volumes and feel much less ephemeral than a fresh container, but that still is not the same as a reserved box reading from its own local NVMe with the same host-local caches and topology already in place. The honest comparison is cold-to-cold or warm-to-warm with the storage surface called out explicitly. Modal image and cold start, Modal vs owned hardware, and Training on 8x H200 SXM are the local proof surfaces for that distinction.

MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20268 min readMegaCpp Engineering

Modal

Training

Benchmarks

Infrastructure

Modal Training Platform Overview

Why we use Modal for ad-hoc training and benchmark jobs, how the image, GPU, volume, and secret model is wired, and when Modal wins against reserved H200 or TPU capacity.

By MegaCpp Engineering

MegaCpp

Focused on applied C++ model engineering

Article Preview

Published April 18, 2026•8 min read•MegaCpp Engineering

We run trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 and benchmarks on three surfaces: reserved H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 hosts, TPU slices, and ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts. ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts is not always the cheapest per GPU-hour and it is not always the fastest to warm up, but it is unusually good at letting one engineer launch a clean, isolated job without first coordinating access to a shared machine. This post is about where that trade lands in practice: what the image, GPU, volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs, and secretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs model looks like, and the specific regimes where ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts beats reserved capacity and where it loses.

Why This Matters

For ad-hoc trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 and benchmarking, operator time is part of the cost model. A platform that takes longer to provision but makes every run clean and reproducible can be a better choice than a nominally cheaper machine that needs manual prep each time.

That trade-off reads more clearly next to Modal image and cold start, Modal debugging playbook, and Modal vs owned hardware, which cover the same platform from image, operations, and routing angles.

The ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts surface is not "one harness." A typical setup breaks into three distinct lanes:

whole-model trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 benchmarks
detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook backend or microbenchmark runs with explicit manifests and result collection
validation runs where the result is acceptance or promotion, not just throughput

In the whole-model lane, a single modal.App can own the image, the GPU spec, the mounted storage, and the entrypoint function. ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts's GPU configuration supports a wide range of accelerator types and counts, which means the same basic control surface can cover anything from a single-GPU smoke test to a larger multi-GPU launch.

One public ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts detail matters for readers coming from bare-metal trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200: on the standard GPU surface, a multi-GPU request lands on one physical machine. That is the right mental model for the H200:8 and B200:8 discussion in Modal multi-GPU issues and fixes. If the question is true multi-node behavior, that is a different lane from the one this overview is describing.

First-touch definition: in these articles, H200:8 means one ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts container requesting eight GPUs on one host, not a general cross-node cluster claim. The checked-in GPU profile receipt sample is the compact local reminder that requested shape and observed dispatch both belong in the record.

The public ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts primitives behind those lanes are straightforward:

MegaCpp need	Modal surface	Why it matters here
Pick the accelerator shape	GPU type and count	Keeps single-GPU smokes, H200:8 launches, and B200 checks on one control surface
Keep writable state across runs	Volumes	Preserve compiler cache, checkpoints, and scratch state between detached jobs
Mount large external artifacts	Cloud bucket mounts	Expose datasets and wheel mirrors without baking them into every image
Inject credentials safely	Secrets	Keep storage and service credentials out of images and out of the repo
Submit work and collect receipts later	Background or detached execution	Preserve launch metadata and collected outputs instead of relying on one long terminal session

The first-touch distinction that trips people up most is storage. Cloud bucket mountsQuick term guidecloud bucket mountModal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Cloud Bucket Mount docs expose external object storage inside the container; VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs carry the mutable working set that the job itself grows over time. In practice we keep datasets and wheel mirrors in bucket-backed storage, then copy the hot shards, caches, and receipts into VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs before trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 starts.

More concretely:

a Modal VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs is the writable persistence surface for caches, checkpoints, copied shards, and receipts that must outlive one container
a cloud bucket mountQuick term guidecloud bucket mountModal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Cloud Bucket Mount docs is the object-storage view inside the container; it is a good fit for large sequential reads and mirrored artifacts, but a worse fit for hot random-write or append-heavy working state

That distinction is grounded locally, not just rhetorically. The checked-in compile runtime env sample and GPU profile receipt sample both assume a writable state surface that survives container turnover.

VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs are where ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts earns its keep. In practice, we separate at least four kinds of state:

a compiler-cache volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs so compile-heavy jobs do not start from absolute zero every time
a checkpoint volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs so long runs can resume cleanly
a local-dataset or scratch volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs so preprocessed data and copied shards can survive container turnover
a user-home or tool-state volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs for credentials, caches, and generated local artifacts that should persist across runs

That is the core ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts trick: make an ephemeral container behave just statefully enough to be useful. If you skip the volumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs, every run is a cold-start experiment. If you separate them properly, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts becomes fast enough for repeated benchmark waves while staying disposable.

A VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs is also a lifecycle boundary, not just a mount point. Writes made by one container need to become visible to later readers through the VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs commit and reload path, so a benchmark receipt should say whether the cache, checkpoint, or shard view was freshly staged, reused, or refreshed during the run. The debugging version of the same rule lives in Modal debugging playbook; the overview version is simpler: do not treat "same named VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs" as proof that two containers saw the same state.

A compiler-cache VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs only makes benchmark waves comparable when the receipt says whether the run was a warm-cache hit or a cache miss. A leaf edit that reuses most compiled objects belongs in a different bucket from a header or shape-contract change that invalidates the tree and rebuilds the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200/C++ surface. The local compile runtime env sample and compile warmup policy sample show the fields we keep explicit before comparing ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts against the cold-start lane in Modal image and cold start or the receipt lane in Modal benchmark receipts.

SecretsQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs are boring on purpose. Credentials should live in Modal SecretsQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs or a cloud-provider secretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs manager, not in the image and not in the repository. If object storage is mounted into the job, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts's documented storage integrations are the right way to do it. SecretsQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs inject environment variables; VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs preserve mutable state; bucket mountsQuick term guidecloud bucket mountModal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Cloud Bucket Mount docs expose external files. Treating those as interchangeable is how a clean benchmark job turns into a cleanup job.

The benchmark lanes layer on top of that. The useful pattern is detachQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook-and-collect: submit an explicit job, keep the launch metadata, and collect structured results later instead of relying on a long interactive terminal session. "DetachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook" here means the launcher preserves durable run identity and the result lands in a durable sink later; it does not mean "we left a terminal tab open and hoped logs were enough."

That detachQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook-and-collect pattern is where Modal benchmark receipts and multi-GPU Modal benchmarks become the more useful follow-ups than generic platform docs. The checked-in compile runtime env sample, compile warmup policy sample, GPU profile receipt sample, FA4 receipt summary sample, and CUDA graph env defaults sample are the compact local proof surfaces for that contract.

The practical routing rules ended up being simple:

ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts wins for detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook benchmark waves, batch validation, quick single-GPU smokes, and situations where operator time matters more than warm local state.
Reserved H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 hosts win for tightly coupled multi-GPU trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200, cache-sensitive bringup, and runs where we want the machine to look similar from one day to the next.
TPU wins when the model and runtime are already aligned with the XLA lane and the question is scale or TPU economics rather than CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-specific behavior.

That sounds generic, but ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts's product model makes the first line genuinely strong. Its docs are explicit about the core building blocks: GPU selection, persistent VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs, named SecretsQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs, and detached executionQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook. Those features are enough to make isolated runs pleasant without pretending ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts is the right answer for every distributed job.

What Changed Our Minds

We stopped describing ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts as "just another cloud GPU provider." The useful distinction is the execution model: detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook jobs, persistent mounted state, explicit app lifecycle, and fast operator handoff.

We also stopped describing ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts as "ephemeral" in the simplistic sense. A ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts container is ephemeral, but the working set does not have to be. Once cache, checkpoints, and scratch space are split into separate persistent VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs, the platform behaves much more like a controlled disposable worker than like a stateless demo environment.

That distinction matters even more once you compare it with the cold-start and compile-skew problems in multi-GPU Modal benchmarks and the first-forward failure modes in Modal debugging playbook.

The storage boundary changed our language too. A warm ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts run reusing mounted VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs is still not the same startup class as a reserved H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 host rereading state from local NVMe, and neither one is the same as a TPU lane reusing a populated XLA cache. Once we started writing receipts that kept those storage classes separate, a lot of fake "accelerator comparisons" collapsed back into what they really were: cold-versus-warm and network-backed-versus-local state comparisons. Modal vs owned hardware and Training on 8x H200 SXM are the local continuations when the storage contract becomes the main variable instead of the platform label.

That does not make ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts a universal answer. For tightly coupled multi-GPU trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200, warm dedicated hosts still have obvious advantages. But for detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook jobs and fast iteration, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts is one of the cleanest public platforms available.

FAQ

Frequently asked questions

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Secrets

Modal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

Volume

Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.

Grounding

cloud bucket mount

Modal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.

Grounding

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

Grounding

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Grounding

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Grounding

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Grounding

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Topic hubs

Entity Hub

Modal Training and Benchmark Operations

A curated Modal reading path: when the rented-GPU surface was useful, what broke on multi-GPU launches, how receipts were recorded, and how we kept the lane debuggable.

MegaCpp Engineering • MegaCppMore posts →

Modal Training Platform Overview

Why This Matters

The Modal Surface

Where Modal Wins

What Changed Our Minds

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

Modal Training and Benchmark Operations