What is the first artifact to read for a detached Modal run?

The manifest. It tells you whether the lane is stuck in launch, collector, or result-persistence state before you waste time on a blind re-run. Read it together with the first typed receipt you have, even if that receipt is only partial, because the pair tells you whether the failure is lifecycle, runtime, or reporting. If you need the evidence contract behind that advice, read Modal benchmark receipts next.

What should I capture before cancelling a stuck detached call?

Capture the Modal function-call identity, the latest manifest state, and the call graph or input status before you terminate anything. Modal exposes a FunctionCall object for spawned work, and that handle can be used to fetch results, inspect a call graph, or cancel the call; once cancellation is the right move, keep the call identity beside the manifest so the next person can tell whether they are looking at a timeout, terminated input, or collector-side gap. Observability and SLO dashboards is the closer sibling when that evidence needs to become a durable dashboard signal instead of a one-off note.

When should I stop retrying on Modal and move the lane elsewhere?

When correctness now depends on warm compile state, host-resident caches, or other state that Modal does not preserve well enough for the lane. At that point owned H200:8 is the more honest surface. Modal vs owned hardware is the routing guide; Modal multi-GPU issues and fixes is the closer runbook for first-forward hangs.

Why is provenance as important as stdout?

Because many apparent regressions are really image, volume, or environment drift. Without provenance you can compare two runs that never executed the same contract. The local GPU profile receipt sample and FA4 receipt summary sample are compact checked-in examples of measured outputs, backend truth, GPU shape, and execution identity living in the same record.

What should I compare before I call a run a real regression?

Compare the manifest state and the receipt provenance before you compare the headline number. If backend identity, image inputs, or visible GPU shape do not match, you do not yet have a benchmark regression; you have a comparison class problem. Modal benchmark receipts is the closest evidence-contract companion, and the local compile runtime env sample keeps the minimum public-safe shape of that check small.

Why do Modal Volume commit and reload semantics matter during debugging?

Because two runs can point at the same named Volume and still see different state if one run never committed its writes or another run never reloaded after an external update. When a rerun looks mysterious, Volume lifecycle can be part of the explanation rather than a side detail. Modal image and cold-start is the closest sibling post because cache warmth is often the visible symptom of the same lifecycle rule.

What is the fastest sanity check for a first-collective hang?

Check whether the sharded dimension was evenly divisible across the world size. If one rank can end up with an empty shard, you may have compile-time collective divergence rather than a transport failure. Padding to an evenly divisible shape or shrinking to a smaller repro is a better first move than treating it as generic fabric flakiness.

MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20269 min readMegaCpp Engineering

Modal

Debugging

Benchmarks

Training

Observability

Modal Debugging Guide for Training and Benchmark Failures

A grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or output-state bugs.

By MegaCpp Engineering

MegaCpp

Focused on applied C++ model engineering

Article Preview

Modal Debugging Guide for Training and Benchmark Failures

Published April 18, 2026•9 min read•MegaCpp Engineering

When a ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts run fails, the useful starting point is not raw stdout. It is classification: first by lane, then by lifecycle stage such as image bootstrap, detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs launch, remote execution, collector state, benchmark-record generation, or output persistence. MegaCpp already has enough structure for that workflow, and the key is to inspect the artifact layer that matches the failure mode.

Here, debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs means operational triage, not single-stepping through code inside a live ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts container. The job is to identify which launch contract failed and which durable artifact narrows it fastest. If the lane is already running and you need kernel timing or hot-path analysis, profiler-guided optimization is the better next read; if the problem has narrowed to communicator behavior, NCCL and collective hangs is the closer sibling.

Why MegaCpp cares

TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 failures are expensive, but ambiguous failures are worse. A clean error that says "image missing dependency" costs us minutes. A detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts app that appears to launch, hangs later, and leaves partial logs with no manifest update can waste a day.

MegaCpp makes this manageable because it already separates surfaces. The public benchmark notes in Modal training platform overview and Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes describe three ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts lanes with different success criteria and bookkeeping. That means the first debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs question is not "why did ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts fail". It is "which contract failed".

That distinction matters because the remedies differ:

whole-model trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 failures usually involve startup state, volumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview, distributed mode, or steady-state metric parsing
exact-token sparse detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs failures usually involve manifest lifecycle, backend identity, or remote runtime provenance
sparse validation failures are often promotion-status or backend-bootstrap problems rather than throughput problems

Once we debugQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs at the contract boundary, the failures stop looking random.

What MegaCpp already exposes

MegaCpp already has the pieces of a real debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs guide.

The whole-model trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 launcher wires the app, image, GPU choice, mounted volumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview, and runtime environment. That makes it the first place to look for cold-start or image drift issues. If the run never reaches useful work, the failure is often in this bootstrap layer, which is the same startup surface unpacked in Modal image and cold-start.

The benchmark matrix adds structure around that launch path by naming cases, setting flags, detaching execution, and recording per-case outputs. If one case regresses and the others do not, the matrix itself becomes the smallest useful repro. When the question is whether the recorded number is still comparable at all, Modal benchmark receipts is the companion read instead of another blind rerun.

The checked-in compile runtime env sample, GPU profile receipt sample, FA4 receipt summary sample, and PP compile warmup sample are the compact local proof surfaces for those bootstrap, provenance, and warmup boundaries.

First-touch debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs vocabulary for these articles:

manifest: the durable launch and collector state for a detached runQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs
receipt or benchmark record: the structured result for one run, including provenance and measured outputs
provenance: the execution identity fields that let you tell whether two runs even belong to the same comparison class

For detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs sparse runs, the launch path creates the remote call and writes a manifest, while the collector advances that manifest through detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs, running, ok, or error. The collector also snapshots call-graph state on timeouts and errors, and it writes bench_result, bench_telemetry, backend_identity, remote_output_json, and remote_runtime_provenance once the remote result lands. That is already a debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs state machine.

A useful first read is the pair of lifecycle state and provenance payload. A manifest stuck in detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs usually points at launch or scheduling. A manifest stuck in running with no durable receipt usually points at collector or remote execution. A result that lands with changed backend_identity or remote_runtime_provenance is often not a regression yet; it is proof that the comparison class changed underneath the same benchmark name.

The receipt and readback utilities provide the comparison layer. They normalize receipt shape, timestamps, run identity, summaries, and error fields so we can compare runs without reverse-engineering each file by hand. That is the same narrow evidence posture used in profiler and receipts and Modal benchmark receipts.

The observability layer covers the persistence side. It is not just logging. It defines structured artifact categories, manifest generation, staged local writes, and cloud upload behavior for reports, traces, summaries, and logs. When debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs feels impossible, it is usually because this layer was underused or bypassed.

VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview lifecycle belongs in that same category. ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts's current VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview docs still require explicit commit/reload thinking when multiple runs share a named VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview. That matters during debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs because "same VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview name" does not guarantee "same visible state right now."

Backend drift belongs in the same first-pass check. A surprising number of "the run regressed" stories are really "the image, kernel set, or runtime identity changed underneath the same benchmark name." That is why the first artifact pair is not only manifest plus stdout, but manifest plus receipt provenance. The compact checked-in compile runtime env sample and GPU profile receipt sample are useful here because they keep backend identity and measured output in one small surface instead of splitting them across ad-hoc logs.

The reporting layer matters here too, even though it reads more like benchmarking infrastructure than debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs infrastructure. In practice, a surprising number of "trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 failures" turn out to be provenance failures: different code revision, different visible GPU set, different machine shape, or a changed environment that nobody wrote down. Once the report layer records that information, a whole class of phantom regressions disappears.

The other useful property of the current code is that it gives us natural cut points. If the manifest exists but never advances beyond detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs, the problem is usually launch or remote scheduling. If the manifest reaches running but not ok, the collector and call graph become the next stop. If the receipt exists but the artifact bundle is thin, the issue is no longer execution but observability discipline. Those are much better failure categories than "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is flaky".

How it lands in MegaCpp

The debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs posture we want in MegaCpp is straightforward: classify first, then inspect the right artifact.

Here is the operational table we follow.

Symptom	Likely layer	First surface to inspect	Typical fix direction
Long startup before any useful step	image/bootstrap or cold cache	image build, runtime environment, and the related cold-start notes	bake or pin the image, reduce bootstrap drift, preserve caches intentionally
8-GPU run hangs on first forward or first collective	distributed compile divergence	training receipts, launch flags, and multi-GPU status tracking	move back to owned H200:8, warm compile state, avoid treating Modal as current truth for that lane
Detached run exists but collector never finishes	manifest lifecycle or remote call state	manifest state, collector outputs, and receipt summaries	verify the remote call identity, poll state, inspect the call graph, capture the error payload
Result lands but numbers are suspect	lane mismatch or wrong bookkeeping	benchmark notes, receipts, and provenance metadata	compare only within the same lane and same metric contract
Logs exist but no durable report bundle	artifact persistence gap	observability settings and generated manifests	ensure artifact categories, summary files, and uploads are enabled
Re-run behaves differently from prior run	image drift or volume/state drift	image definition, mounted volumes, receipt provenance	pin image inputs, reload/commit volumes intentionally, compare provenance fields

That table is grounded in the implementation. The collector explicitly records last_polled_at, call-graph snapshots, completed_at, and typed error strings. The reporting layer records revision, host-class, runtime, and hardware metadata. The observability layer gives artifacts names and categories. This is already enough to stop most "it just hung" debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs from turning into folklore.

One especially important subcase hides inside the generic "first collective hang" bucket: uneven sharding at compile time. If one rank gets an empty shard while the others still expect a collective, the compiler can optimize that rank differently and leave the rest waiting forever. Before blaming the network or NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 in the abstract, check whether the failing sequence length, padding policy, or batch split left any rank with zero-length work; restoring even divisibility is often the fastest falsification step.

The same structure should shape our runbooks in MegaCpp. Instead of a single generic troubleshooting page, we want a small set of decision trees tied to receipt type. DetachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs sparse runs should send people toward manifest and backend-identity inspection. TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 regressions should send them toward steady-state metrics, launch flags, and stateful volumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview. Artifact gaps should send them toward the observability layer rather than back toward the launcher.

That sounds procedural, but it is really about compressing time to root cause. Most benchmark teams lose hours because they repeat the wrong first five steps. MegaCpp is already opinionated enough that we do not need more theory. We just need to keep the investigation aligned with the contracts the code already exposes.

Ablations and what we kept

The first ablation we rejected is treating all hangs as NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 bugs. The known failing lane is 8-GPU FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample plus compile with cold inductor state, where ranks spend different amounts of time in Triton compilation and then deadlock at collectives. That is a different class of bug than a generic network failure, and the workaround is different too, which is why the dedicated NCCL and collective hangs playbook exists alongside this ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts-specific one.

The second ablation we rejected is using only stdout as the debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs surface. DetachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts execution breaks that habit. If the repo's detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs contract is app.run(detach=True) plus a collector, then the manifest is not optional bookkeeping. It is part of the debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs interface. The collector proves this by storing state transitions and preserving remote-output payloads even when the remote path is inconvenient to inspect live.

The third ablation we rejected is assuming a fresh image is automatically good. ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts makes image drift easier to hide because launching is so convenient. If a dependency, wheel, or runtime patch changed between runs, the right response is to compare provenance and artifact bundles, not to assume the benchmark regressed. That is why the reporting layer and the receipt surfaces matter.

The fourth ablation we rejected is using one shared volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview for everything. The trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane separates checkpoint state, compile cache, and data-locality state. That turns debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs from "something in storage is weird" into a narrower question: did the checkpoint persist, did the cache warm, did the copied dataset exist, did the right run name land in the benchmark record.

What we kept is a very explicit stage-based playbook.

Identify the lane from the benchmark notes for the run you are investigating.
Decide whether the failure is bootstrap, remote execution, collector, benchmark record, or output persistence.
Read the manifest and benchmark record before re-running.
Compare provenance, not just metrics.
Only then decide whether to retry on ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts, switch to owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8, or move the question to TPU.

We also kept a bias toward explaining failures in terms of state transitions instead of one-off anecdotes. If the same bug can be phrased as "collector stayed in running because remote result never satisfied expected contract" or "a benchmark record existed but the summary never uploaded", it becomes fixable by another engineer later. If it stays as "that one weird ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts hang from Tuesday", it dies as tribal knowledge.

And we kept the rule that a clean no-go verdict is often the fastest fix. The public H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 notes already give us a known boundary where owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8 is the safer lane. The guide is not supposed to prove ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts can do everything. It is supposed to tell us quickly whether the current failure belongs to launch hygiene, manifest handling, output persistence, or a lane we should move elsewhere.

FAQ

Frequently asked questions

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

Grounding

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

Grounding

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Grounding

Volume

Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.

Grounding

SLO

A single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.

Grounding

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

Grounding

Debugging

A practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…

Grounding

OOM Debugging Playbook for H200 Training Runs

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Grounding

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

Secrets

Modal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.

Grounding

Topic hubs

Entity Hub

Modal Training and Benchmark Operations

A curated Modal reading path: when the rented-GPU surface was useful, what broke on multi-GPU launches, how receipts were recorded, and how we kept the lane debuggable.

MegaCpp Engineering • MegaCppMore posts →

Modal Debugging Guide for Training and Benchmark Failures

Why MegaCpp cares

What MegaCpp already exposes

How it lands in MegaCpp

Ablations and what we kept

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

Modal Training and Benchmark Operations