Topic Hub

H200 Training and Kernel Bring-Up

A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.

This hub is for readers who want the shortest path from H200 operations to the runtime choices underneath. Start with the operator and perf notes, then move into the kernel-specific articles once the execution model is clear.

H200
training
kernels
memory
performance
Curated set
15
Articles in reading order
Why this hub

Best if you care about real multi-GPU bring-up, memory cliffs, and which Hopper-era optimizations survived contact with production runs.

Read These First

Build the runtime picture before dropping into individual kernel families.

  1. 01
    April 18, 202613 min readMegaCpp Engineering

    H200 Bringup and Naming: What Had to Be Made Explicit

    A code- and doc-grounded look at H200 bringup, why naming mattered, how a flagship hybrid recipe was encoded across launch surfaces, and which infrastructure assumptions had to be turned into explicit contracts.

    The cleanest opening read for what the H200 lane actually was, how it was named, and what assumptions belonged to the first stable baseline.

    H200
    Bringup
    Distributed Training
    Naming
  2. 02
    April 18, 20268 min readDavid Gornshtein

    Training on 8x H200 SXM: the operator playbook

    End-to-end operator notes for driving an 8x H200 SXM node: topology, NCCL tuning, storage layout, and the invariants that keep a run from silently drifting.

    The operator-level playbook for an 8x H200 run: process layout, NCCL assumptions, and the baseline execution surface.

    H200
    NCCL
    NVLink
    FSDP2
  3. 03
    April 18, 202615 min readDavid Gornshtein

    Training speed anatomy on H200

    What actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail to move whole-step throughput.

    A grounded breakdown of what actually consumes step time once the lane is alive.

    H200
    Training
    Performance
    Nam52
  4. 04
    April 18, 20265 min readDavid Gornshtein

    OOM Debugging Playbook for H200 Training Runs

    A practical playbook for triaging H200 out-of-memory failures: distinguish fragmentation from true exhaustion, isolate the largest activation surfaces, and apply the cheapest fix first.

    Use this before chasing fancy optimizations; it explains the recurring memory failure modes on the real lane.

    Oom
    H200
    Memory
    Debugging
  5. 05
    April 18, 202612 min readDavid Gornshtein

    Training speed by feature: which parts of the stack really move step time

    A grounded feature-by-feature look at training speed across a modern hybrid stack: Mamba fused paths, memory-traffic cleanup, MLA pieces, MoE dispatch, routing bridges, and feature taxes that should stay experimental.

    The feature-by-feature companion once you need to explain where step time moved after the baseline run was stable.

    Performance
    Kernels
    Mamba3
    MoE

Memory, Scaling, and Failure Surfaces

These explain the recurring H200 cliffs before the kernel catalog starts to matter.

  1. 06
    April 18, 20269 min readDavid Gornshtein

    H200 Memory Geometry for the Hybrid Stack

    How weights, gradients, optimizer state, activations, routing scratch, runtime reserve, and fragmentation stack up on one H200 device in a hybrid training stack.

    The memory-topology explanation that makes later H200 capacity and launch behavior much easier to reason about.

    H200
    Memory
    Muon
    MoE
  2. 07
    April 18, 20269 min readDavid Gornshtein

    A Memory-Budget Anatomy for One Specialist on H200:8

    Line-by-line breakdown of weights, gradients, Muon+AdamW state, activations, KV cache, communication buffers, allocator overhead, and fragmentation for a single specialist trained on 8x H200, with the GB10 contrast.

    The budgeting lens for understanding why model shape, cache policy, and optimizer state hit the wall where they do.

    Memory
    H200
    GB10
    FP8
  3. 08
    April 18, 202614 min readDavid Gornshtein

    Why a 4B-8B model fills an H200 and still OOMs

    A detailed accounting of where 141 GB of HBM goes when you train a 4B-8B hybrid Mamba 3, Transformer, and MoE specialist: parameters, gradients, optimizer state, activations, KV cache, MoE routing buffers, and allocator fragmentation.

    The best companion piece once H200 memory use looks irrational at first glance.

    Memory
    H200
    MoE
    Mamba
  4. 09
    April 18, 202612 min readDavid Gornshtein

    NCCL and collective hangs: the H200 multi-host timeout playbook

    Allreduce stragglers, NCCL deadlocks, P2P env vars, ibverbs quirks, and the liveness/timeout playbook we run on MegaCpp's H200 multi-host CUDA lanes.

    The practical distributed-systems readback when H200 trouble is really in the collective layer rather than in a single kernel.

    NCCL
    H200
    Distributed
    MegaCpp

Kernel and Precision Surfaces

After the baseline is stable, these are the high-value implementation notes.

  1. 10
    April 18, 202610 min readDavid Gornshtein

    Flash Attention 4 in practice: what we shipped and what we cut

    Our hybrid stack's applicability matrix for Flash Attention 4, the validation profiles, the dense-full rollout gates, and the regressions that killed the first FA4 variants before they reached deployment.

    What the FA4 rollout looked like in practice, including what stayed and what was cut.

    Flash Attention
    FA4
    CuTe
    H200
  2. 11
    April 18, 202612 min readDavid Gornshtein

    Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships

    The NVIDIA side of Multi-Latent Attention in the MegaCpp ensemble: a fused down-norm-up projection, a fused split-RoPE-concat Triton kernel, a compressed KV cache, and how it all lands on Megatron-Core.

    The MLA path that shipped on Hopper and Blackwell-class GPUs, including the KV-cache implications.

    MLA
    Triton
    H200
    Blackwell
  3. 12
    April 18, 20269 min readMegaCpp Engineering

    Kernel Catalog and Impact: Why the Runtime Needed a Real Map

    A grounded tour of the kernel catalog across attention, sparse MLA, MoE, MTP, and dispatch/combine paths, with emphasis on why naming the kernel family and backend contract changed system-level decisions.

    The kernel map that ties H200 runtime wins to the concrete families we kept, including MLA and MoE-related paths.

    Kernels
    H200
    MoE
    Attention
  4. 13
    April 18, 202611 min readDavid Gornshtein

    The FA4 Catalog on Blackwell: Variants, sm Guards, and Runtime Selection

    Inside the Flash Attention 4 catalog MegaCpp ships: which kernel variants we keep, the sm_100 / sm_121a guards, the selection policy at runtime, and the validity checks that fail closed.

    A more surgical follow-through when the FA4 family needs to be understood as a catalog of actual usable paths instead of one banner label.

    Flash Attention
    FA4
    CuTe
    Blackwell
  5. 14
    April 18, 20269 min readDavid Gornshtein

    FP8 in the training stack: what shipped and what we rolled back

    An engineer's account of rolling FP8 through the training stack: DeepGEMM block-scaled GEMMs, torchao Float8Linear, TransformerEngine FP8-aware activation checkpointing, and the parts that looked good on paper but lost the benchmark.

    The precision-policy readback: where FP8 helped, where it complicated the stack, and what was rolled back.

    FP8
    Training
    Deepgemm
    TorchAO
  6. 15
    April 18, 20269 min readDavid Gornshtein

    Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use

    How MegaCpp wires NVIDIA Transformer Engine into the training stack on Hopper and Blackwell, where TE replaces native PyTorch layers, the FP8 interaction, and the fallback path that keeps non-NVIDIA lanes alive.

    The bridge-layer companion once H200 kernels have to coexist with the rest of the model-definition stack.

    Transformer Engine
    FP8
    H200
    Blackwell

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.