h200
h100
bringup
naming
tp
sp
fsdp2
moe

H200 Bring-up and Naming: How We Stopped Confusing Our Own Receipts

The MegaCpp H200 software stack, the bring-up of TP+SP+EP+FSDP2 on a fresh us-west1-c host, and the naming glossary that prevents two engineers from quoting different runs as the same number.

9 min readDavid Gornshtein
H200 Bring-up and Naming: How We Stopped Confusing Our Own Receipts

H200 Bring-up and Naming: How We Stopped Confusing Our Own Receipts

There are two failure modes you keep hitting when you move a serious training stack from H100 to H200. The first is the obvious one: a CUDA toolkit, a torch nightly, a flash-attn build, and a fused MoE dispatch path that all need to land on the same ABI without any of them silently downgrading the others. The second is unglamorous: the names. The host you ran on three weeks ago, the preset whose label says dense but means "MoE on, sparse off, mHC enabled", the lineage tag in a handoff note, the current lane in a benchmark quartet — none of them carry their meaning on their face, and an experienced team will still misquote them.

This post is the H200 bring-up story for our stack — the actual stack table, the install order that survives, and the moves that took the TP + SP + EP + FSDP2 lane from "compile/hang/forensics" to "step completes, checkpoint saves" — followed by the naming glossary we now treat as a contract, not a wiki page.

The stack we actually run

There is a one-page truth for the H200:8 stack. Everything else is commentary.

Component Version Source
Python 3.13.12 deadsnakes PPA
torch 2.12.0.dev20260405+cu132 PyTorch nightly cu132 index
triton 3.7.0+git9c288bc5 bundled with torch nightly
CUDA toolkit 13.2 apt install cuda-toolkit-13-2
NVIDIA driver 590.48+ pre-installed on host image
flash-attn 2.8.3 pre-built wheel (cu132)
flash-attn-4 4.0.0b7 PyPI (pip install --pre flash-attn-4)
mamba-ssm 2.3.1 pre-built wheel (cu132)
causal-conv1d 1.6.1 pre-built wheel (cu132)
torchao 0.17.0 PyPI
flash-linear-attention 0.5.0 git HEAD with --no-deps
fla-core 0.4.2 PyPI with --no-deps
dualpipe 1.0.0 PyPI
cutlass-dsl 4.4.2 bundled with flash-attn-4
quack-kernels 0.3.9 bundled with flash-attn-4
qoptim-cuda 0.0.0 pre-built wheel (cu132)
cuDNN 9.20.0.48 bundled with torch nightly
NCCL 2.29.7 bundled with torch nightly
nsys 2025.6.3 bundled with CUDA toolkit
gcc 11.4.0 system (Ubuntu 22.04)

The pre-built CUDA wheels are built once on a machine that has the cu132 toolkit installed and republished as artifacts; they have no pip dependency metadata, so installing them does not pull torch back in and does not have a chance to downgrade the nightly.

Disk layout

Everything that grows lives on /mnt/data, with friendly symlinks under $HOME so the operator's shell history keeps working:

/mnt/data/
  venv/        # Python 3.13 virtualenv
  nanochat/    # repo
  data/        # training data
    parquet/

~/venv     -> /mnt/data/venv
~/nanochat -> /mnt/data/nanochat
~/data     -> /mnt/data/data

The "everything on /mnt/data" rule is not aesthetic. The root disk on a rented GPU host is small, and pip caches plus a single torch nightly will exhaust it during a build. We have lost a host to a full root volume more than once. The symlinks make this invisible to scripts; the layout makes it invisible to disk pressure.

Install order matters

The single most expensive lesson on H200 bring-up is that pip will silently downgrade torch if you let it. Several packages — most aggressively mamba-ssm, causal-conv1d, and flash-linear-attention — declare torch as a runtime dependency. If you pip install them from PyPI without --no-deps, pip happily replaces your cu132 nightly with whatever stable build it can resolve, and the next training launch crashes in torch._inductor because the ABI no longer matches your fused kernels.

The order that survives is:

  1. Create the venv on the data disk and update pip/setuptools/wheel.
  2. Install the cu132 torch nightly first. This sets the ABI baseline.
  3. Install the four pre-built CUDA wheels (flash_attn, causal_conv1d, mamba_ssm, qoptim_cuda). They have no dependency metadata; they cannot touch torch.
  4. Install flash-attn-4 from PyPI. It is pure Python and JITs CUDA at runtime.
  5. Install flash-linear-attention and fla-core with --no-deps. This is the trap.
  6. Install the rest (torchao, dualpipe, then the data and tooling packages). These are torch-safe.
  7. Verify torch was not downgraded: python -c "import torch; assert 'cu132' in torch.__version__".

If that assert fails, stop. Do not "just try training" to see if it works. The symptoms of a partial downgrade range from immediate import errors to a subtle numerical drift that only shows up after 1k steps, and you will spend the rest of the day chasing the wrong bug.

Bringing up TP + SP + EP + FSDP2

The bring-up that took the most receipts was the four-way composition: tensor parallel, sequence parallel, expert parallel, FSDP2. On a fresh H200 host, with our recipe, this composition initially looked like a torch/Inductor problem. It was not — or rather, it was not only one. The two real bugs were inside our own training script.

The first bug was an EP-active predicate. The launch used --expert_parallel=2 with --expert_tensor_parallel=0 (which means "follow TP"). Our wiring computed "EP is active" from expert_tp_mesh is not None, which stayed None in this configuration. So the model thought EP was off, attached a manual SP hook to inner.mlp, and expanded the router input from a local shard back to the full sequence. The collective looked legal; the math underneath it was wrong, and the hang showed up later as an Inductor assertion. The fix was to pass the real expert_parallel_degree into _apply_cuda_tensor_parallel() and compute "EP active" from degree > 1. After the fix, the trace showed ep_active=True ep_local=True manual_sp=False ep_lane=True, no stale hooks, and a clean router_in shape of (1, 128, 128).

With that fix, the eager TP + SP + EP + FSDP2 lane reached compute_loss return, optimizer_step_done, and a clean checkpoint save in a single step on a fresh us-west1-c host. Multi-step (num_iterations=5) passed the next day, no hang, no collective failure.

The second bug was the optimizer's notion of effective DP size when PP was added. The pre-existing path computed effective DP as world_size / (tp * ep) — correct without PP, wrong with PP. With PP turned on, the right divisor is world_size / (pp * tp * ep), and PP + TP singleton-DP lanes were silently being routed to the wrong distributed Adam path. After threading pp_degree through the optimizer routing, the PP + TP + SP eager lane on depth=4 reached step 2 of 3 with a clean checkpoint save and no traceback.

Compile, then the next compile bug

Eager correctness is half the work. Compile expansion was its own gauntlet. PP + TP + SP + compile on depth=4 passed once we stopped duplicating the PP process group when the pre-PP DP group already covered the right rank membership; reusing the existing dp_process_group as _pp_group removed a long-tail hang in dist.new_group.

The next blocker was a torch pipelining detail. Schedule1F1B's _batch_p2p intentionally degrades homogeneous send/recv batches into raw dist.isend / dist.irecv, which trips the unbatched-P2P warning during eager-init neighbor handshake. The workaround is local: inside the PP schedule.step(...) window, we override _batch_p2p to force dist.batch_isend_irecv(p2p_ops), gated by NANOCHAT_FORCE_BATCHED_PP_INIT_P2P=1. After the override, the clean run shows steady-state steps around 200ms with no unbatched-P2P warnings in the receipt slice.

The last failure class — both a fallback and a decomp for same op: aten.index_add.default — narrowed cleanly through ablations. Dense GPT under TP + SP + FSDP2 + compile passed. MoE combine alone, MoE router-gather-combine, all passed under compile. The standalone real TokenChoiceMoELayer was the failing primitive. The fix was twofold: disable the Megatron-permute padded branch on the compiled CUDA path by default, and move EP combine() out of the compile-sensitive region with a selective @torch.compiler.disable. After that, the real base_train TP + SP + EP + FSDP2 + compile lane passes on H200, and the active frontier moved from correctness to scaling and recompile behavior.

Moving from H100 to H200: the gotchas

On paper, H100 → H200 is "more HBM, otherwise the same". In practice, the gotchas that bit us were not silicon, they were build matrix:

  • Torch nightly cu132 wheels for flash-attn, mamba-ssm, and causal-conv1d are not on PyPI. You either build them once and republish as wheels, or you spend 10–15 minutes per package on every fresh box compiling from source.
  • A cold inductor cache has different first-compile latency on H200 than on H100 because the kernel set is different, which is why the multi-GPU cold-cache hang we wrote about separately shows up earlier on H200 hosts with rich recipes than on a stripped H100.
  • The gcc 11.4 baseline is fine for everything we build, but newer nvcc features will warn if you try to push to gcc 13. Stay on 11.4 unless you have a concrete reason.

The HBM headroom on H200 changes which recipes fit. A configuration that OOMs on 80GB H100 with sequence length 8K may run cleanly on H200 at 16K with room to spare; we now publish recipes with the HBM pressure noted, because "works on our H200 box" is a claim that does not transport to an H100 reader.

The naming glossary

The reason we wrote a glossary at all is that the same English words mean different things in our preset names, our log labels, and our handoff notes.

Hosts. The old March 2026 H200:8 box in europe-west4-a and the current us-west1-c bench and bring-up boxes are different runtime surfaces — same GPU class, different software histories. We use distinct host names per box rather than abstracting them into "the H200". Old reports that say "the H200" almost always mean the historical Europe host; current bring-up logs almost always mean a us-west1-c host. Comparing two tok/sec numbers across these is comparing two systems.

Presets. The most consequential trap is the word dense. Our nam52_h200_dense_ref_v1 preset means dsa=False — sparse attention off — and not "no MoE". The MoE stack is on; the receipted March 2026 baseline also carried mHC with mhc_n_streams=4. We now also publish nam52_h200_moe_no_dsa_ref_v1 as an alias of the same preset with the intended meaning spelled out, and nam52_h200_dense_no_mhc_recovery_v1 as the explicit bring-up/bisect helper with mHC deliberately disabled. The pinned NAM52 sparse support-region presets all live on nem_dsa_layers="8,9,10,11" over the 52-layer AEME substrate (4 sparse A-ranks, 9 full-attention A-ranks); the NAM52R production candidate uses pattern AEMEAEMEAEMR with the inverse mix (4 full, 9 sparse). Despite both having a historical dense key in some preset names, they are not all-dense and they are not the same attention substrate. We added nam52r_h200_4full_9sparse_candidate_v1 and nam52r_h200_prod_candidate_v1 as aliases that read correctly without context.

Log labels. The word current means "the baseline recipe used in this wave", not a timeless contract, and it is also unrelated to the --kernel=current loss-kernel option. The word head means "the run was launched from the repo HEAD on that host" — useful in current-head provenance reports for comparing the live repo against an older receipted lineage on the same runtime surface. The lineage tag v19c is operator shorthand for the March 19, 2026 H200 wave — it is not a preset family or a stable launcher contract, just a tag on a slice of receipts.

The practical rule

Before any two H200 numbers get compared, three things have to be identified explicitly: the host lineage (old Europe canonical or which current us-west1-c host), the preset/runtime family (dense no-DSA baseline, NAM52 pinned support-region sparse, or NAM52R 4 full / 9 sparse), and the launch regime (Megatron, DDP, FSDP2, no-compile, regional-compile, and so on). Anything missing one of those three is a rumor. Anything carrying all three is a comparison.

References

  • H200_STACK_SETUP.md
  • H200_NAMING.md
  • tp_sp_ep_fsdp_h200_bringup_2026-04-07.md
  • MODAL_MULTI_GPU_STATUS.md
David Gornshtein • Datasunrise OÜMore posts →