MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 4 min readDavid Gornshtein
Modal
H200
TPU
Infrastructure
Benchmarks

Modal vs Owned H200:8 vs TPU: Which Surface We Use and Why

How we decide between Modal, reserved H200:8 hosts, and TPU slices based on operator overhead, latency to first useful step, benchmark hygiene, and failure isolation.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Modal vs Owned H200:8 vs TPU: Which Surface We Use and Why
Published 4 min readDavid Gornshtein

We do not treat ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview Reference: Modal benchmark receipts, owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 hosts, and TPU slices as interchangeable compute. They are different operating surfaces with different startup contracts, different persistence stories, and different claims they can support honestly. ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview Reference: Modal benchmark receipts is strongest when one clean detached runQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook matters more than resident host state. Owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 wins when warm local state and tightly coupled multi-GPU behavior dominate. TPU is the right lane when the question itself is XLA-shaped rather than CUDA-shaped.

Why We Care

The wrong surface wastes more time than a mediocre kernel. In practice we keep seeing three job shapes:

  1. long-running distributed training where warm state and stable topology matter most
  2. detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook benchmark or validation waves where one engineer wants a clean receipt quickly
  3. TPU-aligned work where the right question is already on the XLA side

That is why the choice is operational rather than ideological. A result is only as trustworthy as the surface that produced it.

How The Split Works

ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview Reference: Modal benchmark receipts works best when the unit of work is a self-contained job with explicit image, storage, and receipt boundaries. That makes it a good surface for detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook validations, fresh-environment benchmark waves, and short runs where operator turnaround matters more than host continuity. The closest companion posts are Modal training platform overview, Modal benchmark receipts, and Modal debugging playbook.

One public caveat matters immediately: the standard multi-GPU ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview Reference: Modal benchmark receipts lane is still a single-host lane. In these articles, H200:8 means one container with eight GPUs on one machine, not a general multi-node claim.

Owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 hosts are the opposite trade. They cost more operator effort and return a more stable state surface: warm local NVMe, resident compile artifacts, checkpoints that stay close to the training process, and a launch shape that changes less from run to run. That is why Training on H200:8 is the closer sibling when the job is already warm-state heavy.

TPU is different again. The useful comparison there is not "another accelerator with different pricing." It is a different runtime and compiler surface with its own cache and startup rules. That is why the TPU continuation is local reading such as TPU v6e Host Bringup and ZeRO-3-shaped sharding on the XLA backend, not a flattened "GPU versus TPU" slogan.

The fairness rule across all three surfaces is the same: warm and cold state must be called out explicitly. A warm ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview Reference: Modal benchmark receipts run reusing VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview is not the same startup class as a warm host reread from local NVMe, and neither one is the same as a TPU run reusing populated XLA-side state.

The checked-in proof surface for that comparison is intentionally small: Distributed debugging notes, compile runtime env sample, GPU profile receipt sample, and TPU bringup notes. Together they keep the comparison anchored in startup class, receipt fields, and debug boundaries instead of in a price chart with missing context.

Surface Best for Main upside Main downside Closest local source of truth
Modal Detached benchmark waves, validation jobs, single-host smokes Low operator overhead, clean image boundary, easy receipt capture Cold-start tax and weaker fit for tightly coupled multi-GPU work Modal training platform overview
Owned H200:8 Multi-GPU training, cache-sensitive bringup, warm reruns Warm local state, stable topology, direct control over artifacts Higher scheduling and operator burden Training on H200:8
TPU XLA-aligned training and scale-up work Good fit for XLA graphs and TPU-side economics Different runtime model and weaker fit for CUDA-specific experiments TPU v6e Host Bringup

Practical Routing Rules

We keep claims surface-aware. A ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview Reference: Modal benchmark receipts result supports a ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview Reference: Modal benchmark receipts claim. A warm host result supports a warm host claim. A TPU result supports a TPU claim. Treating them as directly interchangeable is how startup state, topology, and persistence get accidentally laundered into "the accelerator was faster."

That is also why provenance matters more than raw throughput. The useful unit is the tuple of code revision, surface, launch shape, starting state, and artifact bundle. Without that context, the comparison is usually saying less about hardware than it seems.

The most practical heuristic is not pure dollars per GPU-hour. It is latency to first trustworthy result. A detached runQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook can be more expensive per hour and still cheaper in engineering time if it produces a clean answer faster.

What Changed Our Thinking

We stopped treating ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview Reference: Modal benchmark receipts as just "cloud H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 with nicer UX." The important distinction is execution model: detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook jobs, explicit mounted state, and a cleaner boundary between the image and the run receipt.

We also stopped flattening the comparison to "owned is stable, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview Reference: Modal benchmark receipts is ephemeral." ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview Reference: Modal benchmark receipts can be meaningfully stateful when VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview hold the warm working set. The owned lane simply inherits more of that state naturally.

And we became stricter about storage. Once a comparison mixes local NVMe, network-backed mounted state, and XLA-side cache reuse without saying so, it is no longer a fair accelerator comparison.

FAQ

Frequently asked questions

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Modal

A container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

Volume

Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.

cloud bucket mount

Modal's object-storage mount surface for large read-mostly artifacts such as datasets and wheel mirrors; MegaCpp keeps hot writable state on Volumes instead.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Topic hubs