MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 8 min readDavid Gornshtein
H200
NCCL
NVLink
FSDP2
Torchrun
Training
Operations

Training on 8x H200 SXM: the operator playbook

End-to-end operator notes for driving an 8x H200 SXM node: topology, NCCL tuning, storage layout, and the invariants that keep a run from silently drifting.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Training on 8x H200 SXM: the operator playbook
Published 8 min readDavid Gornshtein

An 8x H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingH200 memory geometry training speed anatomy on H200 SXM node is a practical unit for training a mid-sized specialist model from scratch. On paper it looks like a larger-memory Hopper system. In practice the gap between a fresh machine and steady-state high-throughput training with reliable checkpoints is a sequence of small operational choices that, taken in the wrong order, cost days. This post focuses on that operator surface: how to drive the node, what the topology forces on the launch flow, which NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: runtime optimization receipts settings are worth making explicit, and how we keep receipts comparable. First touch: NCCL hang triage means separating communicator bootstrap failures, compile-era watchdog timeouts, and steady-state collective skew before you change launcher policy, and a receipt is the compact run record that carries topology, env overlay, effective lane, and the first bounded step window together. For the knobs that become active once the launch lands, the closest follow-ups are H200 memory geometry, gradient accumulation and microbatching, checkpoint format and resume, and NCCL and collective hangs.

Why the operator surface is the contract

Training on this class of hardware is not about one heroic run. It is about repeating comparable launches across the same hardware class. That is a discipline problem, not just a performance problem. The launch surface is the contract. If two operators can produce different steady-state throughput on the same configuration because one forgot to pin a compile cache or left a stale NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: runtime optimization receipts setting behind, the comparison stops meaning anything.

The H200 memory geometry changes the shape of the decision surface, but it does not eliminate activation pressure. A model that fits comfortably at short context can still run into trouble once sequence length, activation checkpointing policy, or expert routing buffers move together. We want one launcher that lands cleanly in both regimes and escalates predictably when it does not.

The practical framing versus H100 is simple: the common eight-way H100 SXM lane gives you 80 GB per GPU, while the matching H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingH200 memory geometry training speed anatomy on H200 SXM lane gives you 141 GB per GPU. That bigger H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingH200 memory geometry training speed anatomy on H200 envelope often lets you postpone the most aggressive checkpointing policy at the same batch shape, but it does not remove the need for one once context length, routed-token buffers, or activation fan-out start rising together. The adjacent decision surfaces are H200 memory geometry, activation checkpointing deep dive, and activation checkpointing policy.

Topology, launcher, and what the wrapper owns

Each rank owns one H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingH200 memory geometry training speed anatomy on H200. The eight GPUs sit on an NVLink and NVSwitch fabric, and NVIDIA's public H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingH200 memory geometry training speed anatomy on H200 materials frame that fabric as a first-class part of the Hopper-era multi-GPU story. What the operator has to get right is that NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: runtime optimization receipts actually uses the expected topology and that nothing in the host image is silently routing traffic through a slower path, which is the same failure surface that later shows up in NCCL and collective hangs.

A fresh node still deserves one hard preflight: nvidia-smi topo -m should show the expected NVLink or NVSwitch fabric rather than an accidental fallback path.

On a healthy eight-way node, the GPU-to-GPU cells should read NV18 across the matrix. If you see SYS or PHB, stop and fix the topology first: that is a fallback onto CPU or PCIe paths, and any later NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: runtime optimization receipts tuning result will be noise. The same preflight is a good time to check that host automatic NUMA balancing is off and to run a quick all_reduce_perf sanity pass before attaching the full training stack.

Concern Knob What it does
stream overlap unset CUDA_DEVICE_MAX_CONNECTIONS preserves communication and compute overlap
stream priority TORCH_NCCL_HIGH_PRIORITY=1 helps comm streams avoid starvation
allocator policy PYTORCH_ALLOC_CONF=expandable_segments:True reduces fragmentation under changing workloads
heartbeat budget TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=7200 survives long compile windows

One knob is worth singling out because older recipes get it backwards: CUDA_DEVICE_MAX_CONNECTIONS=1 is not a safe H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingH200 memory geometry training speed anatomy on H200/FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample baseline. On this lane it tends to serialize the very all_gather and reduce_scatter overlap that keeps step time sane. Unless you have measured reason to do otherwise, leave it unset and record that fact in the receipt.

The broader lesson is more stable than any single knob: environment-sensitive runtime behavior belongs in the receipt surface, not in tribal memory. That is the same reporting contract we use in Profiler and performance reports. The checked-in compile runtime env sample, compile/runtime receipt sample, and GPU profile receipt sample are the compact local versions of that environment-and-provenance contract.

One detail from the longer H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingH200 memory geometry training speed anatomy on H200 memory story is worth keeping visible here: expandable_segments:True is allocator-fragmentation control, not a magic model-size increase. On long runs with changing routing or batch shape, the failure often shows up as reserved drifting away from allocated even though the model math did not change. The more detailed readback is in why a 4B-8B model fills an H200 and still OOMs; this post keeps it in the operator checklist because late allocator drift is operationally indistinguishable from "the node just became unreliable" unless the receipt names it.

PyTorch's current CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary memory notes name PYTORCH_ALLOC_CONF as the allocator control and keep PYTORCH_CUDA_ALLOC_CONF as a backward-compatible alias. If an existing launcher still exports the alias, the launch invariant is that the receipt records which allocator overlay actually reached the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary context.

What the wrapper owns

A good launcher owns provenance, environment, the torchrun line, and log extraction. It should record the source revision, a short working-tree status, and the machine class into the log before the first Python import. That provenance boundary is the operator-side companion to Checkpoint format and resume.

It also has to preserve the node's real topology. The safe default is one node-local process group over all eight GPUs, with the receipt recording the effective mapping, rather than a wrapper that permutes local device IDs and leaves NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: runtime optimization receipts reasoning over a partial or renumbered view of the host. If the launcher hides the topology, the preflight and the later throughput receipt stop describing the same lane.

That is also why MegaCpp prefers torchrun or srun-style launchers over per-process GPU masking schemes. If each worker only sees a renumbered CUDA_VISIBLE_DEVICES slice and believes it owns its own cuda:0, topology discovery becomes less trustworthy, NVLink assumptions get harder to reason about, and later NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: runtime optimization receipts symptoms stop matching the preflight you ran on the full node.

That launcher choice matters for the network side too. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingH200 memory geometry training speed anatomy on H200-class NVSwitch nodes, hiding the full local GPU set behind per-process masking can leave the full-node nvidia-smi topo -m preflight and the later NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: runtime optimization receipts communicator describing different local maps. NVIDIA's NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: runtime optimization receipts tuning guidance says the library selects algorithms from detected topology, including NVLink domain size, and keeps NVLS enabled by default on supported NVLink systems. That is why the receipt should keep the effective local rank mapping and NIC affinity beside the launch line: a lane that looked like clean NV18 fabric in preflight still needs a matching communicator view once collectives start. NCCL and collective hangs plus training speed anatomy on H200 are the right next surfaces when a supposedly healthy node suddenly loses overlap.

That same rule gets stricter once the lane crosses a single node or starts using RDMA-attached storage and networking. NV18 across the local GPU fabric is necessary, but it is not sufficient: you also want the receipt to keep GPU-to-NIC affinity and the actual communicator map visible so that a multi-node slowdown does not get misread as an intra-node NVLink regression. For the operator-facing follow-up, NCCL and collective hangs covers the failure taxonomy and comms cost and overlap covers the "healthy but slower" case where the node topology is fine and the overlap budget is not.

exec torchrun --standalone --nproc_per_node=8   -m <training-entrypoint>   --config "$CONFIG" --run_name "$RUN_NAME"   --device_batch_size "$DBS" --max_seq_len "$SEQ"

State on disk and live monitoring

There are three categories of state and they should go to three places.

  • persistent training state belongs on a durable high-capacity data volume
  • per-process artifacts such as compile caches belong in per-run scratch space
  • operator receipts belong with the launch materials for that run

In practice the split is about traffic shape, not naming purity. Training reads want sustained throughput from the intended data root, checkpoint writes want burst-safe durable storage, and receipts should stay small and easy to collect beside the launch materials instead of disappearing into scratch. If those three paths collapse into one blurry directory tree, input stalls, checkpoint pauses, and provenance loss all look like the same "slow node" failure.

The invariant that should not drift is simple: the training data root must point at the intended data volume, not at an incidental home-directory path. Resume semantics only stay trustworthy if that split matches Checkpoint format and resume.

The useful operational addition is that the split is not just durable versus temporary. It is read bandwidth versus burst-write safety versus metadata noise. Keep the dataloader on the path built for sustained reads, keep checkpoint bursts on storage that can absorb them without stalling the node, and keep compile caches off that same hot path so a flood of small files does not masquerade as a training regression.

Takeaway

The H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingH200 memory geometry training speed anatomy on H200 operator story is not that one knob solved training. It is that topology, launcher policy, and receipt discipline together make the node trustworthy enough to compare runs.

Once the node is stable, the next knobs are memory and batch-shape policies rather than topology. H200 memory geometry, Gradient Accumulation and Microbatching Under FSDP2, Training speed anatomy on H200, and CPU Offload and Startup Memory Calibration on H200 and GB10 are the next operator-facing layers of the same workflow. If you want the broader bring-up companion rather than one more knob-specific article, continue to H200 bring-up and naming.

FAQ

Frequently asked questions

What is the first preflight on a fresh H200 node?+
Verify the topology you think you bought. nvidia-smi topo -m should confirm the expected NVLink or NVSwitch path before a long run starts.
What counts as a failed topology preflight?+
Any GPU-to-GPU cell that shows SYS or PHB instead of NV18, or a quick NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes. bandwidth sanity pass that lands far below the expected NVLink domain. Fix that before you touch model knobs or env tuning.
What belongs in the same preflight besides nvidia-smi topo -m?+
Two cheap checks pay for themselves: confirm host automatic NUMA balancing is disabled, and run one bounded NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes. bandwidth sanity test before the full stack attaches. If those fail, you still have a node problem, not a model problem, and NCCL and collective hangs is the right next surface.
Why do receipts matter so much on this lane?+
Because the whole point of the node is repeatable comparison. If provenance, env state, and launch policy drift, the throughput number stops meaning anything. The quickest checked-in proof surface is compile/runtime receipt sample plus compile runtime env sample: one records the effective lane, the other records the launcher overlay that produced it.
What belongs on durable storage versus scratch?+
Checkpoints and other resume-critical training state belong on the durable data volume. Compile caches and other per-run accelerants can live in scratch, as long as the launcher records that split clearly enough to reconstruct the run and resume without guessing.
Why keep compile caches off the same hot path as data and checkpoints?+
Because compile caches fail differently from the other two storage classes. Training data wants sustained reads, checkpoints want burst-write safety, and compile caches create lots of small-file and metadata churn. If all three land on the same hot path, metadata noise can look like a dataloader stall or a checkpoint pause even when the GPUs are fine. Regional compile without losing the plot is the compile-side follow-up, and Checkpoint format and resume is the durable-state companion.
Why does H200 reduce checkpointing pressure without removing it?+
Because the main win is headroom, not immunity. Moving from the common H100 SXM 80 GB lane to the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. SXM 141 GB lane gives activations, temporary buffers, and routed-token work more room to stay resident, so some runs can delay the first aggressive checkpointing step. But once context length, microbatch shape, or MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble. traffic rises together, the same activation checkpointing policy and activation checkpointing deep dive decisions come back.
Is the H200 win only more memory, or also more memory bandwidth?+
It is both. NVIDIA's published H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. specs pair 141 GB of HBM3e with about 4.8 TB/s of memory bandwidth, while the H100 SXM baseline is 80 GB of HBM3 at about 3.35 TB/s. That extra bandwidth does not remove the need for good overlap or checkpoint policy, but it does change how quickly a healthy lane can feed long-context and buffer-heavy training once the launch and topology surfaces are already clean.
What is the minimum useful H200 operator receipt?+
Keep the topology preflight result, the launcher env overlay, the effective lane, the first bounded step window, and one peak-memory readback. In checked-in form that means compile runtime env sample, compile/runtime receipt sample, and GPU profile receipt sample. If the run was slow rather than dead, add goodput tracker sample so compile or checkpoint badput stays separate from steady-state step time.
Why make NCCL settings explicit instead of leaving them in shell history?+
Because they change the runtime enough to break comparisons when they drift silently. The checked-in compile runtime env sample keeps the launcher overlay visible, and compile/runtime receipt sample keeps the resulting lane readable after those toggles take effect.
Why keep TORCH_NCCL_HIGH_PRIORITY=1 in the baseline?+
Because it is a variance-control knob before it is a speed headline. On long H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. runs the communication stream has to stay visible enough that dense compute or compile windows do not starve collectives and turn a healthy lane into watchdog noise or step-time jitter. NCCL and collective hangs is the failure-mode companion, and compile runtime env sample keeps the toggle attached to the receipt instead of to somebody's shell memory.
Why leave CUDA_DEVICE_MAX_CONNECTIONS unset?+
Because FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism. wants communication overlap, not forced serialization. Pinning it to 1 can turn a healthy H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. lane into a queueing problem by blocking the prefetch and reduction overlap that the step depends on.
What if the platform requires CUDA_DEVICE_MAX_CONNECTIONS to be explicit?+
Treat that as a compatibility constraint, not as a new tuning baseline. Start from CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.'s default 8, record the forced value in the receipt, and only move away from it after a bounded overlap measurement. The failure mode to avoid is still the same: too few work queues can create false dependencies between otherwise independent streams, so 1 remains a diagnostic probe rather than a normal H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks./FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism. launch setting. Comms cost and overlap is the next place to check whether the collectives are healthy but no longer hidden under useful compute.
Why is TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=7200 part of the baseline?+
Because a cold compiled lane can spend a long window in graph build before the first steady-state collective shows up. If the watchdog budget is shorter than that compile window, a healthy H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. run dies looking like a network problem. Dynamo and torch.compile breakage is the compile-side explanation, and the checked-in goodput tracker sample shows why we keep compilation time separate from step time in the receipt.
Why is expandable_segments:True part of the baseline?+
Because long H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. runs fail just as often from allocator drift as from raw model bytes. The flag does not make the model smaller; it keeps fragmentation from turning a healthy early receipt into a late OOM after the routed-token or batch shape changes. Why a 4B-8B model fills an H200 and still OOMs is the memory-side explanation.
How do I tell whether expandable_segments:True actually helped?+
Do not judge it from a single OOM line. Keep one before/after peak-memory receipt with allocated, reserved, batch shape, sequence length, and env overlay. The flag is doing useful work only if the late-run gap between reserved and allocated memory stops growing under the same workload shape; if both numbers move together, you are looking at real model or activation pressure instead. The checked-in GPU profile receipt sample is the local proof surface, and Why a 4B-8B model fills an H200 and still OOMs is the longer memory-side follow-up.
Why should the launcher avoid per-process device renumbering?+
Because NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes. needs a truthful view of the node-local topology. If a wrapper presents each rank with a different renumbered slice of the same host, the topology preflight, communicator behavior, and measured receipt can drift apart even though the hardware is unchanged. That is why this lane prefers torchrun or srun-style launchers over actor patterns that remap CUDA_VISIBLE_DEVICES per process: keep the eight local GPUs in one launcher-owned view and record the effective mapping in the receipt.
What extra preflight matters on multi-node or RDMA-attached H200 lanes?+
Keep the local NV18 check, but add one more boundary check: the communicator should describe the same GPU ordering and NIC affinity that the host topology reported before launch. That means one bounded collective sanity pass plus a receipt that records local rank mapping and NIC placement, so you can tell the difference between an inter-node transport problem and a broken intra-node topology. NCCL and collective hangs and comms cost and overlap are the next two reads when that seam looks suspicious.
Should I force NCCL_MNNVL_ENABLE?+
Usually no. NVIDIA's current NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes. guidance for supported multi-node NVLink systems is that topology detection and algorithm choice should work without extra tuning, and NCCL_MNNVL_ENABLE=0 is the documented escape hatch when you explicitly want to disable that path. The safer operator rule is to keep the variable out of the baseline unless you are debugging a specific transport issue, and to treat the receipt plus preflight evidence as more trustworthy than cargo-culted env overlays. If a diagnostic run needs the setting to be explicit, record 2 as automatic detection or 0 as disable. Forcing 1 is a failure-seeking probe because NCCL initialization is allowed to fail when MNNVL is not supported or cannot be enabled, so it does not belong in the normal H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. launch overlay.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Topic hubs