MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 8 min readMegaCpp Engineering
Modal
Training
Benchmarks
Infrastructure

Modal Training Platform Overview

Why we use Modal for ad-hoc training and benchmark jobs, how the image, GPU, volume, and secret model is wired, and when Modal wins against reserved H200 or TPU capacity.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Modal Training Platform Overview
Published 8 min readMegaCpp Engineering

We run trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 and benchmarks on three surfaces: reserved H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 hosts, TPU slices, and ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts. ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts is not always the cheapest per GPU-hour and it is not always the fastest to warm up, but it is unusually good at letting one engineer launch a clean, isolated job without first coordinating access to a shared machine. This post is about where that trade lands in practice: what the image, GPU, volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs, and secretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs model looks like, and the specific regimes where ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts beats reserved capacity and where it loses.

Why This Matters

For ad-hoc trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 and benchmarking, operator time is part of the cost model. A platform that takes longer to provision but makes every run clean and reproducible can be a better choice than a nominally cheaper machine that needs manual prep each time.

That trade-off reads more clearly next to Modal image and cold start, Modal debugging playbook, and Modal vs owned hardware, which cover the same platform from image, operations, and routing angles.

The Modal Surface

The ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts surface is not "one harness." A typical setup breaks into three distinct lanes:

  1. whole-model trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 benchmarks
  2. detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook backend or microbenchmark runs with explicit manifests and result collection
  3. validation runs where the result is acceptance or promotion, not just throughput

In the whole-model lane, a single modal.App can own the image, the GPU spec, the mounted storage, and the entrypoint function. ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts's GPU configuration supports a wide range of accelerator types and counts, which means the same basic control surface can cover anything from a single-GPU smoke test to a larger multi-GPU launch.

One public ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts detail matters for readers coming from bare-metal trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200: on the standard GPU surface, a multi-GPU request lands on one physical machine. That is the right mental model for the H200:8 and B200:8 discussion in Modal multi-GPU issues and fixes. If the question is true multi-node behavior, that is a different lane from the one this overview is describing.

First-touch definition: in these articles, H200:8 means one ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts container requesting eight GPUs on one host, not a general cross-node cluster claim. The checked-in GPU profile receipt sample is the compact local reminder that requested shape and observed dispatch both belong in the record.

The public ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts primitives behind those lanes are straightforward:

MegaCpp need Modal surface Why it matters here
Pick the accelerator shape GPU type and count Keeps single-GPU smokes, H200:8 launches, and B200 checks on one control surface
Keep writable state across runs Volumes Preserve compiler cache, checkpoints, and scratch state between detached jobs
Mount large external artifacts Cloud bucket mounts Expose datasets and wheel mirrors without baking them into every image
Inject credentials safely Secrets Keep storage and service credentials out of images and out of the repo
Submit work and collect receipts later Background or detached execution Preserve launch metadata and collected outputs instead of relying on one long terminal session

The first-touch distinction that trips people up most is storage. Cloud bucket mountsQuick term guidecloud bucket mountModal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Cloud Bucket Mount docs expose external object storage inside the container; VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs carry the mutable working set that the job itself grows over time. In practice we keep datasets and wheel mirrors in bucket-backed storage, then copy the hot shards, caches, and receipts into VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs before trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 starts.

More concretely:

That distinction is grounded locally, not just rhetorically. The checked-in compile runtime env sample and GPU profile receipt sample both assume a writable state surface that survives container turnover.

VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs are where ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts earns its keep. In practice, we separate at least four kinds of state:

  1. a compiler-cache volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs so compile-heavy jobs do not start from absolute zero every time
  2. a checkpoint volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs so long runs can resume cleanly
  3. a local-dataset or scratch volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs so preprocessed data and copied shards can survive container turnover
  4. a user-home or tool-state volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs for credentials, caches, and generated local artifacts that should persist across runs

That is the core ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts trick: make an ephemeral container behave just statefully enough to be useful. If you skip the volumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs, every run is a cold-start experiment. If you separate them properly, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts becomes fast enough for repeated benchmark waves while staying disposable.

A VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs is also a lifecycle boundary, not just a mount point. Writes made by one container need to become visible to later readers through the VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs commit and reload path, so a benchmark receipt should say whether the cache, checkpoint, or shard view was freshly staged, reused, or refreshed during the run. The debugging version of the same rule lives in Modal debugging playbook; the overview version is simpler: do not treat "same named VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs" as proof that two containers saw the same state.

A compiler-cache VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs only makes benchmark waves comparable when the receipt says whether the run was a warm-cache hit or a cache miss. A leaf edit that reuses most compiled objects belongs in a different bucket from a header or shape-contract change that invalidates the tree and rebuilds the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200/C++ surface. The local compile runtime env sample and compile warmup policy sample show the fields we keep explicit before comparing ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts against the cold-start lane in Modal image and cold start or the receipt lane in Modal benchmark receipts.

SecretsQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs are boring on purpose. Credentials should live in Modal SecretsQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs or a cloud-provider secretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs manager, not in the image and not in the repository. If object storage is mounted into the job, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts's documented storage integrations are the right way to do it. SecretsQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs inject environment variables; VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs preserve mutable state; bucket mountsQuick term guidecloud bucket mountModal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Cloud Bucket Mount docs expose external files. Treating those as interchangeable is how a clean benchmark job turns into a cleanup job.

The benchmark lanes layer on top of that. The useful pattern is detachQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook-and-collect: submit an explicit job, keep the launch metadata, and collect structured results later instead of relying on a long interactive terminal session. "DetachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook" here means the launcher preserves durable run identity and the result lands in a durable sink later; it does not mean "we left a terminal tab open and hoped logs were enough."

That detachQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook-and-collect pattern is where Modal benchmark receipts and multi-GPU Modal benchmarks become the more useful follow-ups than generic platform docs. The checked-in compile runtime env sample, compile warmup policy sample, GPU profile receipt sample, FA4 receipt summary sample, and CUDA graph env defaults sample are the compact local proof surfaces for that contract.

Where Modal Wins

The practical routing rules ended up being simple:

That sounds generic, but ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts's product model makes the first line genuinely strong. Its docs are explicit about the core building blocks: GPU selection, persistent VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs, named SecretsQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingModal debugging playbook Modal Secrets docs, and detached executionQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook. Those features are enough to make isolated runs pleasant without pretending ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts is the right answer for every distributed job.

What Changed Our Minds

We stopped describing ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts as "just another cloud GPU provider." The useful distinction is the execution model: detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook jobs, persistent mounted state, explicit app lifecycle, and fast operator handoff.

We also stopped describing ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts as "ephemeral" in the simplistic sense. A ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts container is ephemeral, but the working set does not have to be. Once cache, checkpoints, and scratch space are split into separate persistent VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs, the platform behaves much more like a controlled disposable worker than like a stateless demo environment.

That distinction matters even more once you compare it with the cold-start and compile-skew problems in multi-GPU Modal benchmarks and the first-forward failure modes in Modal debugging playbook.

The storage boundary changed our language too. A warm ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts run reusing mounted VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal Volumes docs is still not the same startup class as a reserved H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 host rereading state from local NVMe, and neither one is the same as a TPU lane reusing a populated XLA cache. Once we started writing receipts that kept those storage classes separate, a lot of fake "accelerator comparisons" collapsed back into what they really were: cold-versus-warm and network-backed-versus-local state comparisons. Modal vs owned hardware and Training on 8x H200 SXM are the local continuations when the storage contract becomes the main variable instead of the platform label.

That does not make ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts a universal answer. For tightly coupled multi-GPU trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200, warm dedicated hosts still have obvious advantages. But for detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook jobs and fast iteration, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingHistory: Modal vs owned hardware Reference: Modal benchmark receipts is one of the cleanest public platforms available.

FAQ

Frequently asked questions

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Modal

A container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.

Secrets

Modal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Volume

Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.

cloud bucket mount

Modal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Topic hubs