MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20269 min readDavid Gornshtein

Transformer Engine

FP8

H200

Blackwell

NVIDIA

Training

Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use

Q: Is MXFP8 the same thing as the H200 FP8 path in this bridge?

No. In this post the H200 baseline is fp8_autocast with DelayedScaling. MXFP8 is the newer Blackwell block-scaling vocabulary from current Transformer Engine docs, so it is adjacent context for Blackwell-oriented lanes, not the recipe behind the H200 numbers here.

How MegaCpp wires NVIDIA Transformer Engine into the training stack on Hopper and Blackwell, where TE replaces native PyTorch layers, the FP8 interaction, and the fallback path that keeps non-NVIDIA lanes alive.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use

Published April 18, 2026•9 min read•David Gornshtein

Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingNVIDIA Transformer Engine documentation Transformer Engine FP8 and FP4 primer (TE) is one of the biggest performance levers NVIDIA provides on Hopper and Blackwell-class GPUs, and also one of the easier ways to destabilize a multi-host trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 run. The integration here keeps every TE call optional, late-bound, and behind a per-feature flag, then lifts only the modules that materially improve MFU into a stable deployment layer. This post explains that bridge: what TE buys, where it replaces native PyTorch layers, how FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper composes on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 versus smaller Blackwell-class targets, and how MegaCpp keeps a clean path back to a TE-free build for lanes that need it.

Why we own a TE bridge at all

The hybrid architecture is layout-heavy: dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns A-blocks, Mamba M-blocks, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack E-blocks, and DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns layers all share the same depth budget. Each block class has a different FLOP/byte ratio and a different dominant kernel. Running the canonical PyTorch forward leaves a meaningful fraction of achievable MFU on the table; the Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample TE spec shows the ceiling with a single fused dynamo graph per layer, fused pre-norm + linear, cuDNN flash attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, and a fp8_autocast scope that picks per-layer precision. The bridge keeps that ceiling available without making TE a hard dependency on lanes where it is unavailable or incomplete.

The bridge has three jobs: keep the import surface non-fatal when TE is missing, expose TE classes one at a time so we can A/B them against our native blocks, and degrade to the FA3 / native MLP path on the same calling convention. Everything else is downstream of those three.

The seven modules that make up the bridge

The TE surface is seven small modules, each one targeting a single replacement.

Bridge component	Replaces	Notes
TE import bridge	n/a	Import firewall; lazy accessors; one-line per re-exported TE entry point
TE attention wrapper	Custom DPA path	Wraps `DotProductAttention` for our `[B, H, T, D]` layout; GQA via `num_gqa_groups`; RoPE applied before the call
TE native block wrapper	Native A-block	Full TE-native ordering: `LayerNormLinear` + `DotProductAttention` + `Linear` + `LayerNormMLP`
TE layer-spec dispatcher	Spec dispatcher	`dict[str, type
TE linear replacement walker	Walks the model tree	Swaps `nn.Linear -> te.pytorch.Linear`; exclusion list pinned
TE expert GEMM wrapper	Custom MoE GEMM	Wraps `GroupedLinear` for FP8 expert compute; permute/unpermute via TE primitives
TE permute bridge	Custom permute path	Bridges `moe_permute_with_probs` and friends to our dispatcher

The TE import bridge is the import firewall. Every TE-aware component goes through is_te_available() and the te_* lazy accessors here, so importing any TE-aware module on a TE-less machine never raises. The first call sets a process-wide cache; subsequent calls are free. The bridge re-exports only the TE entry points we actually use; if we need a new primitive, it gets a one-liner here, then a typed wrapper at the call site. This pattern is what makes the rest of the stack import-safe.

The TE attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns wrapper documents what it cannot do (packed-doc doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample masking, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample local-window masks, MoBA top-k routing, soft-cap logits) and falls back to our FA3 path for those. This component locks us into one of the two attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns worlds: TE for unconstrained transformer-shape attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns on Hopper/Blackwell, FA3 for everything else. The TE native block wrapper's value is not the wrappers but the ordering: lifting the entire block into TE classes was the only stable fix for fusion-window breaks where saved-activation layout flipped between eager and compile. The TE linear replacement walker's exclusion list is the interesting part: token embeddings (tied with lm_head), MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routers, mHC score projections, MoD routers, n-gram hash projections, structure embeddings, and LoRA adapters are all skipped because they are either too small to win on FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper, structurally weight-shared, or wrappers around an existing Linear that the swap would clobber.

FP8 composition on H200 and on smaller Blackwell-class targets

FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper is where the bridge stops being convenient and starts being a contract. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 (Hopper), TE's fp8_autocast plus DelayedScaling on the standard recipe is the production path; the GEMMs use cuBLASLt FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper paths and the activations are stored in FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper with per-tensor scaling. We wrap every TE call site in a fp8_autocast context, but only inside the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper zone of the model: the BF16 first/last layers and any auxiliary head stay outside. The current bridge formalises this for the mHC group loop: a per-group helper probes each layer's installed factory; if any layer in the group sits in the BF16 zone the whole group runs under nullcontext, otherwise the group runs under the factory at index 0. That is coarser than per-layer but safer, because double-entering fp8_autocast inside _mhc_group_forward produced silent precision drift in earlier iterations.

It also helps to separate DelayedScaling from MXFP8. NVIDIA documents DelayedScaling as the per-tensor FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper recipe that tracks amax_history and reduces scale state through amax_reduction_group, while MXFP8 is the newer Blackwell microscaling recipe with one local scale per 32-value block on SM100+Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro devices. In other words, MXFP8 is useful vocabulary for smaller Blackwell-class lanes here, but it is not the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 baseline behind the throughput claims in this post.

On smaller Blackwell-class targets the picture changes. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper paths exist, but the cuBLASLt cooperative algorithms used on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 are not all available, and flash-attn may require target-specific builds. Those systems are treated as compatibility lanes: they must build and run a bf16 forward, but they are not used as the reference for H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 throughput claims. Any FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper measurements from them are tracked separately from the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 performance baseline.

# stylised wiring inside one mHC group
with self._mhc_group_fp8_ctx(group_indices):
    out = _mhc_group_forward(
        layers=[self.layers[i] for i in group_indices],
        x=x, **kwargs,
    )

Where the bridge meets parallelism

The TE block plays nicely with Megatron's TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding because both speak the same column/row contract; the only adjustment is that the QKV column-parallel split has to be segment-aware so head-grouped attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns slices correctly. With sequence parallelismQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel on, LayerNormLinear does the all-gather on input and Linear does the reduce-scatter on output, so the SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel gradient all-reduce on norm/QK-norm parameters has to be installed (we do this in _install_sp_norm_grad_allreduce because not every code path of ours goes through Megatron's finalize_model_grads). With FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample we wrap TE blocks at the block boundary and let fully_shard handle the rest; the only gotcha is reshard_after_forward=False for the layers that are immediately re-used by the MTP head. The FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-specific rule is that amax_reduction_group has to follow the ranks that actually own the sharded tensor, rather than one convenience group, or the scale state stops matching the real ownership boundary.

The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack path is more interesting. The TE expert GEMM wrapper plus the TE permute bridge give us FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper grouped GEMM for the expert bank with permute/unpermute primitives that handle padded and jagged dispatch. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 this is a real win, especially with EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding > 1 because GroupedLinear avoids per-expert kernel launches.

Design choices that held up

The seven-module bridge layout held up, along with the import-firewall pattern, the dict[str, type | None] spec, the per-feature TE flag rather than a global TE switch, the per-group FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper scope helper, and the rule that auxiliary heads stay native PyTorch. The BF16 first/last zone remains outside fp8_autocast, and measurements from smaller Blackwell-class systems remain separated from the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 baseline.

The design does not keep the early "wrap once at module init" pattern because it broke under TE upgrades, per-layer fp8_autocast inside _mhc_group_forward because it double-entered the context, or TE for token embeddings and routers because the measured win was negligible relative to correctness risk. The throughput win is selective; the bridge exists because selective adoption is the sustainable shape.

How the bridge survives a TE upgrade

TE upgrades are a recurring source of regressions because the project moves fast and the public surface is large. Three rules keep the bridge stable across upgrades.

First, every TE entry point we use is re-exported through the TE import bridge. When TE renames a class or moves it between submodules, exactly one bridge component changes; the call sites do not. That has paid for itself three times in the last six months: the moe_permute_with_probs rename (the previous name was a private _te_* symbol), the LayerNormLinear move into te.pytorch, and the fp8_autocast signature change that added a recipe keyword.

Second, every TE wrapper carries a parity test against the native PyTorch reference. The attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns wrapper has a parity test that builds a small QKV input, runs both paths at fp32, and asserts max-abs and max-rel error against tolerances written next to the math. The native block wrapper has a similar test for the full block. When TE bumps and a parity test fails, we know exactly where the divergence is and we can decide whether to bump TE further or pin to the previous version while we investigate.

Third, every TE-using preset has an env-var fallback that disables TE for that preset only. Setting MEGACPP_DISABLE_TE=1 at launch causes is_te_available() to return False even on a TE-installed host, which forces every wrapper to fall back to its native path. That fallback is there so validation and production can drop to the non-TE path quickly if a TE upgrade or driver change regresses correctness.

Observed H200 impact

Per-block MFU numbers are not published because they are noisy and shape-dependent, but the qualitative picture is stable on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200. Lifting the dense A-block to a TE-native block with LayerNormLinear, DotProductAttention, and LayerNormMLP improved steady-state throughput on the deep hybrid by a low-double-digit percentage. Adding TE FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper GroupedLinear for the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack expert bank on top of that added a further high-single-digit percentage on EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding > 1 configurations. The TE in-proj fusion for the Mamba 2/3 layer (te.LayerNormLinear with normalization='RMSNorm') added a smaller but still measurable win on Mamba-heavy presets. None of these gains came from a single switch; they depended on the call-site refactors and parity tests described above.

On smaller Blackwell-class targets the picture is different. TE works, but FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper paths are not all available, and the wins measured on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 do not transfer directly. Those lanes are treated as build-and-correctness targets only; their performance numbers are not part of the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 steady-state matrix.

Reusable bridge pattern

The selective-import pattern in the TE import bridge has held up well enough to reuse for other optional vendor-library integrations, including the Liger fused norm/CE path and the cut-cross-entropy path. The same three rules apply: one re-export component, parity tests against the native reference, and an env-var fallback per preset. That bridge layer is what makes "use TE where it wins, fall back gracefully where it does not" operational instead of aspirational.

FAQ

Frequently asked questions

Is MXFP8 the same thing as the H200 FP8 path in this bridge?+

No. In this post the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. baseline is fp8_autocast with DelayedScaling. MXFP8 is the newer Blackwell block-scaling vocabulary from current Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts. docs, so it is adjacent context for Blackwell-oriented lanes, not the recipe behind the H200 numbers here.

Why does the FP8 scale-reduction group follow the shard owner?+

DelayedScaling carries runtime scale state, not just a dtype choice. If a TE block is wrapped by FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism. or split by TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node./SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP., the amax_reduction_group must match the ranks that actually own that tensor slice; otherwise scale updates can describe a convenience process group instead of the shard that will execute the next FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. GEMM. That is why this bridge keeps TE scope decisions next to the parallelism boundary rather than hiding them in a global FP8 helper. The related ownership seam is FSDP, CUDA and Megatron DDP: where they help and where they do not.

Why not replace every Linear-like module with Transformer Engine?+

Because the bridge is a whitelist, not a tree-wide rewrite. Dense projections and expert GEMMs are the lanes where TE usually pays for itself; token embeddings tied to output heads, routers, structure projections, mHC score projections, n-gram hash projections, adapters, and wrapper-owned modules stay native unless a local parity receipt proves that TE preserves the ownership contract and actually improves the hot path. The related boundaries are Expert Parallel and MoE Sharding, Fused MoE and DeepEP on NVIDIA, and Adapter system and LoRA stack.

Why keep the TE bridge separate from the MLA adapter seam and the FP8 dispatch seam?+

Because those are different failure boundaries. The TE bridge makes imports, class replacement, and per-feature fallback safe. The MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. adapter seam still owns constructor and layer-spec normalization, while the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. dispatch seam still owns the "storage payload plus scale metadata" handoff for wrapper-backed tensors. Keeping those contracts separate is how a TE upgrade stays a narrow bridge edit instead of silently redefining MLA builder wiring or FP8 kernel selection. The nearby runtime surfaces are Public MLA integration patterns for Megatron, Shared MLA adapter boundaries, and Sparse MLA FP8 dispatch.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

Grounding

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

Grounding

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

Grounding

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

sm_100

Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.

Grounding

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

Grounding

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Grounding

Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.

Grounding

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

RMSNorm

Root-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.

Grounding

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

Grounding

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

Grounding

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Topic hubs

Topic Hub

H200 Training and Kernel Bring-Up

A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.

David Gornshtein • MegaCppMore posts →

Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use

Why we own a TE bridge at all

The seven modules that make up the bridge

FP8 composition on H200 and on smaller Blackwell-class targets

Where the bridge meets parallelism

Design choices that held up

How the bridge survives a TE upgrade

Observed H200 impact

Reusable bridge pattern

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

H200 Training and Kernel Bring-Up