MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 9 min readMegaCpp Engineering
Modal
Debugging
Benchmarks
Training
Observability

Modal Debugging Guide for Training and Benchmark Failures

A grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or output-state bugs.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Modal Debugging Guide for Training and Benchmark Failures
Published 9 min readMegaCpp Engineering

When a ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts run fails, the useful starting point is not raw stdout. It is classification: first by lane, then by lifecycle stage such as image bootstrap, detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs launch, remote execution, collector state, benchmark-record generation, or output persistence. MegaCpp already has enough structure for that workflow, and the key is to inspect the artifact layer that matches the failure mode.

Here, debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs means operational triage, not single-stepping through code inside a live ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts container. The job is to identify which launch contract failed and which durable artifact narrows it fastest. If the lane is already running and you need kernel timing or hot-path analysis, profiler-guided optimization is the better next read; if the problem has narrowed to communicator behavior, NCCL and collective hangs is the closer sibling.

Why MegaCpp cares

TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 failures are expensive, but ambiguous failures are worse. A clean error that says "image missing dependency" costs us minutes. A detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts app that appears to launch, hangs later, and leaves partial logs with no manifest update can waste a day.

MegaCpp makes this manageable because it already separates surfaces. The public benchmark notes in Modal training platform overview and Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes describe three ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts lanes with different success criteria and bookkeeping. That means the first debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs question is not "why did ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts fail". It is "which contract failed".

That distinction matters because the remedies differ:

Once we debugQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs at the contract boundary, the failures stop looking random.

What MegaCpp already exposes

MegaCpp already has the pieces of a real debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs guide.

The whole-model trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 launcher wires the app, image, GPU choice, mounted volumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview, and runtime environment. That makes it the first place to look for cold-start or image drift issues. If the run never reaches useful work, the failure is often in this bootstrap layer, which is the same startup surface unpacked in Modal image and cold-start.

The benchmark matrix adds structure around that launch path by naming cases, setting flags, detaching execution, and recording per-case outputs. If one case regresses and the others do not, the matrix itself becomes the smallest useful repro. When the question is whether the recorded number is still comparable at all, Modal benchmark receipts is the companion read instead of another blind rerun.

The checked-in compile runtime env sample, GPU profile receipt sample, FA4 receipt summary sample, and PP compile warmup sample are the compact local proof surfaces for those bootstrap, provenance, and warmup boundaries.

First-touch debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs vocabulary for these articles:

  • manifest: the durable launch and collector state for a detached runQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs
  • receipt or benchmark record: the structured result for one run, including provenance and measured outputs
  • provenance: the execution identity fields that let you tell whether two runs even belong to the same comparison class

For detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs sparse runs, the launch path creates the remote call and writes a manifest, while the collector advances that manifest through detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs, running, ok, or error. The collector also snapshots call-graph state on timeouts and errors, and it writes bench_result, bench_telemetry, backend_identity, remote_output_json, and remote_runtime_provenance once the remote result lands. That is already a debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs state machine.

A useful first read is the pair of lifecycle state and provenance payload. A manifest stuck in detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs usually points at launch or scheduling. A manifest stuck in running with no durable receipt usually points at collector or remote execution. A result that lands with changed backend_identity or remote_runtime_provenance is often not a regression yet; it is proof that the comparison class changed underneath the same benchmark name.

The receipt and readback utilities provide the comparison layer. They normalize receipt shape, timestamps, run identity, summaries, and error fields so we can compare runs without reverse-engineering each file by hand. That is the same narrow evidence posture used in profiler and receipts and Modal benchmark receipts.

The observability layer covers the persistence side. It is not just logging. It defines structured artifact categories, manifest generation, staged local writes, and cloud upload behavior for reports, traces, summaries, and logs. When debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs feels impossible, it is usually because this layer was underused or bypassed.

VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview lifecycle belongs in that same category. ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts's current VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview docs still require explicit commit/reload thinking when multiple runs share a named VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview. That matters during debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs because "same VolumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview name" does not guarantee "same visible state right now."

Backend drift belongs in the same first-pass check. A surprising number of "the run regressed" stories are really "the image, kernel set, or runtime identity changed underneath the same benchmark name." That is why the first artifact pair is not only manifest plus stdout, but manifest plus receipt provenance. The compact checked-in compile runtime env sample and GPU profile receipt sample are useful here because they keep backend identity and measured output in one small surface instead of splitting them across ad-hoc logs.

The reporting layer matters here too, even though it reads more like benchmarking infrastructure than debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs infrastructure. In practice, a surprising number of "trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 failures" turn out to be provenance failures: different code revision, different visible GPU set, different machine shape, or a changed environment that nobody wrote down. Once the report layer records that information, a whole class of phantom regressions disappears.

The other useful property of the current code is that it gives us natural cut points. If the manifest exists but never advances beyond detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs, the problem is usually launch or remote scheduling. If the manifest reaches running but not ok, the collector and call graph become the next stop. If the receipt exists but the artifact bundle is thin, the issue is no longer execution but observability discipline. Those are much better failure categories than "ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is flaky".

How it lands in MegaCpp

The debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs posture we want in MegaCpp is straightforward: classify first, then inspect the right artifact.

Here is the operational table we follow.

Symptom Likely layer First surface to inspect Typical fix direction
Long startup before any useful step image/bootstrap or cold cache image build, runtime environment, and the related cold-start notes bake or pin the image, reduce bootstrap drift, preserve caches intentionally
8-GPU run hangs on first forward or first collective distributed compile divergence training receipts, launch flags, and multi-GPU status tracking move back to owned H200:8, warm compile state, avoid treating Modal as current truth for that lane
Detached run exists but collector never finishes manifest lifecycle or remote call state manifest state, collector outputs, and receipt summaries verify the remote call identity, poll state, inspect the call graph, capture the error payload
Result lands but numbers are suspect lane mismatch or wrong bookkeeping benchmark notes, receipts, and provenance metadata compare only within the same lane and same metric contract
Logs exist but no durable report bundle artifact persistence gap observability settings and generated manifests ensure artifact categories, summary files, and uploads are enabled
Re-run behaves differently from prior run image drift or volume/state drift image definition, mounted volumes, receipt provenance pin image inputs, reload/commit volumes intentionally, compare provenance fields

That table is grounded in the implementation. The collector explicitly records last_polled_at, call-graph snapshots, completed_at, and typed error strings. The reporting layer records revision, host-class, runtime, and hardware metadata. The observability layer gives artifacts names and categories. This is already enough to stop most "it just hung" debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs from turning into folklore.

One especially important subcase hides inside the generic "first collective hang" bucket: uneven sharding at compile time. If one rank gets an empty shard while the others still expect a collective, the compiler can optimize that rank differently and leave the rest waiting forever. Before blaming the network or NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 in the abstract, check whether the failing sequence length, padding policy, or batch split left any rank with zero-length work; restoring even divisibility is often the fastest falsification step.

The same structure should shape our runbooks in MegaCpp. Instead of a single generic troubleshooting page, we want a small set of decision trees tied to receipt type. DetachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs sparse runs should send people toward manifest and backend-identity inspection. TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 regressions should send them toward steady-state metrics, launch flags, and stateful volumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview. Artifact gaps should send them toward the observability layer rather than back toward the launcher.

That sounds procedural, but it is really about compressing time to root cause. Most benchmark teams lose hours because they repeat the wrong first five steps. MegaCpp is already opinionated enough that we do not need more theory. We just need to keep the investigation aligned with the contracts the code already exposes.

Ablations and what we kept

The first ablation we rejected is treating all hangs as NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 bugs. The known failing lane is 8-GPU FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample plus compile with cold inductor state, where ranks spend different amounts of time in Triton compilation and then deadlock at collectives. That is a different class of bug than a generic network failure, and the workaround is different too, which is why the dedicated NCCL and collective hangs playbook exists alongside this ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts-specific one.

The second ablation we rejected is using only stdout as the debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs surface. DetachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts execution breaks that habit. If the repo's detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal batch processing docs contract is app.run(detach=True) plus a collector, then the manifest is not optional bookkeeping. It is part of the debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs interface. The collector proves this by storing state transitions and preserving remote-output payloads even when the remote path is inconvenient to inspect live.

The third ablation we rejected is assuming a fresh image is automatically good. ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts makes image drift easier to hide because launching is so convenient. If a dependency, wheel, or runtime patch changed between runs, the right response is to compare provenance and artifact bundles, not to assume the benchmark regressed. That is why the reporting layer and the receipt surfaces matter.

The fourth ablation we rejected is using one shared volumeQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview for everything. The trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane separates checkpoint state, compile cache, and data-locality state. That turns debuggingQuick term guideDebuggingA practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…GroundingOOM Debugging Playbook for H200 Training Runs from "something in storage is weird" into a narrower question: did the checkpoint persist, did the cache warm, did the copied dataset exist, did the right run name land in the benchmark record.

What we kept is a very explicit stage-based playbook.

  1. Identify the lane from the benchmark notes for the run you are investigating.
  2. Decide whether the failure is bootstrap, remote execution, collector, benchmark record, or output persistence.
  3. Read the manifest and benchmark record before re-running.
  4. Compare provenance, not just metrics.
  5. Only then decide whether to retry on ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts, switch to owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8, or move the question to TPU.

We also kept a bias toward explaining failures in terms of state transitions instead of one-off anecdotes. If the same bug can be phrased as "collector stayed in running because remote result never satisfied expected contract" or "a benchmark record existed but the summary never uploaded", it becomes fixable by another engineer later. If it stays as "that one weird ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts hang from Tuesday", it dies as tribal knowledge.

And we kept the rule that a clean no-go verdict is often the fastest fix. The public H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 notes already give us a known boundary where owned H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8 is the safer lane. The guide is not supposed to prove ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts can do everything. It is supposed to tell us quickly whether the current failure belongs to launch hygiene, manifest handling, output persistence, or a lane we should move elsewhere.

FAQ

Frequently asked questions

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Modal

A container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Volume

Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.

SLO

A single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.

PP

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

Debugging

A practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation…

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Secrets

Modal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.

Topic hubs