Topic Hub

H200 Training and Kernel Bring-Up

A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.

This hub is for readers who want the shortest path from H200 operations to the runtime choices underneath. Start with the operator and perf notes, then move into the kernel-specific articles once the execution model is clear.

H200

training

kernels

memory

performance

Curated set

Articles in reading order

Why this hub

Best if you care about real multi-GPU bring-up, memory cliffs, and which Hopper-era optimizations survived contact with production runs.

Read These First

Build the runtime picture before dropping into individual kernel families.

01
April 18, 2026•13 min read•MegaCpp Engineering
H200 Bringup and Naming: What Had to Be Made Explicit
A code- and doc-grounded look at H200 bringup, why naming mattered, how a flagship hybrid recipe was encoded across launch surfaces, and which infrastructure assumptions had to be turned into explicit contracts.
The cleanest opening read for what the H200 lane actually was, how it was named, and what assumptions belonged to the first stable baseline.
H200
Bringup
Distributed Training
Naming
Read article
02
April 18, 2026•8 min read•David Gornshtein
Training on 8x H200 SXM: the operator playbook
End-to-end operator notes for driving an 8x H200 SXM node: topology, NCCL tuning, storage layout, and the invariants that keep a run from silently drifting.
The operator-level playbook for an 8x H200 run: process layout, NCCL assumptions, and the baseline execution surface.
H200
NCCL
NVLink
FSDP2
Read article
03
April 18, 2026•15 min read•David Gornshtein
Training speed anatomy on H200
What actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail to move whole-step throughput.
A grounded breakdown of what actually consumes step time once the lane is alive.
H200
Training
Performance
Nam52
Read article
04
April 18, 2026•5 min read•David Gornshtein
OOM Debugging Playbook for H200 Training Runs
A practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation surfaces, and apply the cheapest fix first.
Use this before chasing fancy optimizations; it explains the recurring memory failure modes on the real lane.
Oom
H200
Memory
Debugging
Read article
05
April 18, 2026•12 min read•David Gornshtein
Training speed by feature: which parts of the stack really move step time
A grounded feature-by-feature look at training speed across a modern hybrid stack: Mamba fused paths, memory-traffic cleanup, MLA pieces, MoE dispatch, routing bridges, and feature taxes that should stay experimental.
The feature-by-feature companion once you need to explain where step time moved after the baseline run was stable.
Performance
Kernels
Mamba3
MoE
Read article

Memory, Scaling, and Failure Surfaces

These explain the recurring H200 cliffs before the kernel catalog starts to matter.

06
April 18, 2026•9 min read•David Gornshtein
H200 Memory Geometry for the Hybrid Stack
How weights, gradients, optimizer state, activations, routing scratch, runtime reserve, and fragmentation stack up on one H200 device in a hybrid training stack.
The memory-topology explanation that makes later H200 capacity and launch behavior much easier to reason about.
H200
Memory
Muon
MoE
Read article
07
April 18, 2026•9 min read•David Gornshtein
A Memory-Budget Anatomy for One Specialist on H200:8
Line-by-line breakdown of weights, gradients, Muon+AdamW state, activations, KV cache, communication buffers, allocator overhead, and fragmentation for a single specialist trained on 8x H200, with the GB10 contrast.
The budgeting lens for understanding why model shape, cache policy, and optimizer state hit the wall where they do.
Memory
H200
GB10
FP8
Read article
08
April 18, 2026•14 min read•David Gornshtein
Why a 4B-8B model fills an H200 and still OOMs
A detailed accounting of where 141 GB of HBM goes when you train a 4B-8B hybrid Mamba 3, Transformer, and MoE specialist: parameters, gradients, optimizer state, activations, KV cache, MoE routing buffers, and allocator fragmentation.
The best companion piece once H200 memory use looks irrational at first glance.
Memory
H200
MoE
Mamba
Read article
09
April 18, 2026•12 min read•David Gornshtein
NCCL and collective hangs: the H200 multi-host timeout playbook
Allreduce stragglers, NCCL deadlocks, P2P env vars, ibverbs quirks, and the liveness/timeout playbook we run on MegaCpp's H200 multi-host CUDA lanes.
The practical distributed-systems readback when H200 trouble is really in the collective layer rather than in a single kernel.
NCCL
H200
Distributed
MegaCpp
Read article

Kernel and Precision Surfaces

After the baseline is stable, these are the high-value implementation notes.

10
April 18, 2026•10 min read•David Gornshtein
Flash Attention 4 in practice: what we shipped and what we cut
Our hybrid stack's applicability matrix for Flash Attention 4, the validation profiles, the dense-full rollout gates, and the regressions that killed the first FA4 variants before they reached deployment.
What the FA4 rollout looked like in practice, including what stayed and what was cut.
Flash Attention
FA4
CuTe
H200
Read article
11
April 18, 2026•12 min read•David Gornshtein
Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships
The NVIDIA side of Multi-Latent Attention in the MegaCpp ensemble: a fused down-norm-up projection, a fused split-RoPE-concat Triton kernel, a compressed KV cache, and how it all lands on Megatron-Core.
The MLA path that shipped on Hopper and Blackwell-class GPUs, including the KV-cache implications.
MLA
Triton
H200
Blackwell
Read article
12
April 18, 2026•9 min read•MegaCpp Engineering
Kernel Catalog and Impact: Why the Runtime Needed a Real Map
A grounded tour of the kernel catalog across attention, sparse MLA, MoE, MTP, and dispatch/combine paths, with emphasis on why naming the kernel family and backend contract changed system-level decisions.
The kernel map that ties H200 runtime wins to the concrete families we kept, including MLA and MoE-related paths.
Kernels
H200
MoE
Attention
Read article
13
April 18, 2026•11 min read•David Gornshtein
The FA4 Catalog on Blackwell: Variants, sm Guards, and Runtime Selection
Inside the Flash Attention 4 catalog MegaCpp ships: which kernel variants we keep, the sm_100 / sm_121a guards, the selection policy at runtime, and the validity checks that fail closed.
A more surgical follow-through when the FA4 family needs to be understood as a catalog of actual usable paths instead of one banner label.
Flash Attention
FA4
CuTe
Blackwell
Read article
14
April 18, 2026•9 min read•David Gornshtein
FP8 in the training stack: what shipped and what we rolled back
An engineer's account of rolling FP8 through the training stack: DeepGEMM block-scaled GEMMs, torchao Float8Linear, TransformerEngine FP8-aware activation checkpointing, and the parts that looked good on paper but lost the benchmark.
The precision-policy readback: where FP8 helped, where it complicated the stack, and what was rolled back.
FP8
Training
Deepgemm
TorchAO
Read article
15
April 18, 2026•9 min read•David Gornshtein
Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use
How MegaCpp wires NVIDIA Transformer Engine into the training stack on Hopper and Blackwell, where TE replaces native PyTorch layers, the FP8 interaction, and the fallback path that keeps non-NVIDIA lanes alive.
The bridge-layer companion once H200 kernels have to coexist with the rest of the model-definition stack.
Transformer Engine
FP8
H200
Blackwell
Read article

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.

H200 Training and Kernel Bring-Up

Read These First

H200 Bringup and Naming: What Had to Be Made Explicit

Training on 8x H200 SXM: the operator playbook

Training speed anatomy on H200

OOM Debugging Playbook for H200 Training Runs

Training speed by feature: which parts of the stack really move step time

Memory, Scaling, and Failure Surfaces

H200 Memory Geometry for the Hybrid Stack

A Memory-Budget Anatomy for One Specialist on H200:8

Why a 4B-8B model fills an H200 and still OOMs

NCCL and collective hangs: the H200 multi-host timeout playbook

Kernel and Precision Surfaces

Flash Attention 4 in practice: what we shipped and what we cut

Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships

Kernel Catalog and Impact: Why the Runtime Needed a Real Map

The FA4 Catalog on Blackwell: Variants, sm Guards, and Runtime Selection

FP8 in the training stack: what shipped and what we rolled back

Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use

Adjacent topic hubs

GB10 and Blackwell Bring-Up

Modal Training and Benchmark Operations

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

MLA Integration, Dispatch, and Weight Absorption

Evaluation, Benchmarks, and Verifier Loops

Megatron Parallelism and Layout Boundaries

TPU Sparse Attention and Pallas Kernels

TPU v6e and XLA Runtime Surfaces

C++ Data Pipelines and Corpus Packaging

MoE, Routing, and Distributed Model Splits