MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20269 min readDavid Gornshtein

FP8

Training

Deepgemm

TorchAO

Transformer Engine

H200

GB10

FP8 in the training stack: what shipped and what we rolled back

Q: Did full-model FP8 become the default story for this stack?

No. The retained story is selective FP8 on the surfaces that benefit from it, not a blanket claim that the whole hybrid model now "runs in FP8."

Q: Why does Transformer Engine show up in an FP8 rollout article at all?

Because once FP8 blocks move under Transformer Engine ownership, recompute-safe activation handling moves with them too. The checkpoint boundary is part of the FP8 boundary.

Q: Why can TE plus CUDA graphs still regress instead of win?

Because graph capture wants a static replay shape while FP8 still carries live scale and cache state. If the lane fragments into many captures, amax reductions or transpose-cache work can come back on every replay. The useful question is not "graphs on?" but "did the captured path actually remove launch overhead without reintroducing FP8 bookkeeping?"

Q: What order keeps torchao FP8 compatible with sharding and compile?

Convert the linear modules into their FP8-aware form first, then shard them, then compile around that already-shaped model. If you invert that order, the training lane can keep running while dropping back to a less interesting communication or scaling path. The better local handoff is FSDP2 pain and payoff together with The Compile-Time Tax We Accept for Runtime Speed. On an FSDP2 lane, the reader-safe version is still: initialize on device="meta", run convert_to_float8_training, apply fully_shard, and only then hand the model to compile. After the optimizer step, precompute_float8_dynamic_scale_for_fsdp is the matching maintenance pass; it updates float8 scales in one collective instead of rediscovering them layer by layer.

Q: Why is the FSDP2 pre-all-gather seam part of the FP8 decision?

Because the useful distributed win is not just FP8 math inside a linear layer. The wrapper has to be present before FSDP2 rebuilds sharded weights for the forward path, otherwise the run can stay alive while the communication boundary falls back to a less interesting payload or scale path. That is why this article treats FP8 conversion order as an ownership rule, not a cosmetic dtype switch.

Q: Why can Hopper WGMMA still lose on small-M or scan-heavy work?

Because the headline FP8 path is tuned for healthy GEMMs, not for tiny or ragged M dimensions. Once the kernel starts padding small M, paying cast overhead, or waiting on scan-style data movement, the Tensor Core headline stops predicting end-to-end speed.

An engineer's account of rolling FP8 through the training stack: DeepGEMM block-scaled GEMMs, torchao Float8Linear, TransformerEngine FP8-aware activation checkpointing, and the parts that looked good on paper but lost the benchmark.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

FP8 in the training stack: what shipped and what we rolled back

Published April 18, 2026•9 min read•David Gornshtein

FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper is the first precision step where the textbook story and the measured story split apart on this stack. The useful public summary is not "FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper on" or "FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper off." The useful summary is which FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper surfaces stayed worth keeping once the full hybrid recipe was profiled.

For first touch, keep four boundaries separate:

model-level trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 dtype: the format the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane mostly lives in
kernel-local FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper: weight or activation transport inside selected GEMM-heavy kernels
wrapper-aware dispatch: the runtime rule that decides whether an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper wrapper stays on an FP8 path or silently falls back
checkpoint ownership: which runtime owns recompute-safe activation state once FP8 blocks move under Transformer Engine

The fastest checked-in route through those boundaries is the sparse MLA FP8 dispatch example for the wrapper-aware dispatch seam, the runtime patch-surface example for recipe-versus-runtime ownership, and the precision recipe overview for the cross-phase map.

Why the rollout had to stay selective

The real hybrid lane behind this article is mixed A/E/M/R: attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-family layers, expert layers, Mamba-family layers, and recurrent-tail layers do not all respond to FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper the same way. The public-safe way to say that is simple: MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack and other GEMM-heavy surfaces are the cleanest FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper targets, while scan-heavy, dispatch-heavy, or wrapper-sensitive surfaces can pay more conversion and ownership overhead than they recover.

That same boundary is visible in the checked-in examples. The dispatch-side hazard appears in the sparse MLA FP8 dispatch example, and the runtime-side boundary appears in the runtime patch-surface example. Together they show why a public FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper story has to stay per-surface rather than turning into a single headline about "the model running in FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper."

The three FP8 surfaces that stayed real

1. Block-scaled GEMMs for the MoE side

DeepGEMM is the cleanest FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper win in this slice because it targets exactly the kind of work FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper is good at: large GEMM-heavy kernels with explicit scaling discipline. The public-safe claim is not that every layer got faster. It is that grouped MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack GEMMs are a natural home for block-scaled FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper, especially on Hopper-class trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lanes where those expert GEMMs are already a dominant, well-shaped kernel family.

That is why the kept FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper story routes through MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack and not through every layer family indiscriminately. The runtime ownership sample in the runtime patch-surface example is the checked-in reminder that the hybrid stack still has several non-MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack surfaces with different contracts.

The reason DeepGEMM stays on the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack side is not just that it is fast. Its block-scaled layout matches large aligned expert GEMMs and hardware-friendly transfer shapes, which is exactly where the hybrid lane has obvious FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper arithmetic to harvest. That same coarser scaling is a less clean fit for smaller dense modules or outlier-heavy rows, where numerical headroom and integration friction dominate.

There is also an operator boundary outside the steady-state kernel numbers. DeepGEMM can be the right FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper kernel choice and still create a bad rollout if every rank reaches first-use compilation at once. In practice that means the compile cache, warmup order, and restart discipline belong in the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper story too: the profitable MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack lane is not just "faster kernels," it is "faster kernels without turning cold start into a synchronized JIT storm." That is the same reason this article hands off to The Compile-Time Tax We Accept for Runtime Speed and Training on 8x H200 SXM: the operator playbook. In practice the safe rollout is a deterministic cache root plus rank-0 or single-node prewarm; otherwise first-use JIT can turn a good kernel choice into a thundering herd against shared storage.

The public repository makes that operator seam explicit too. DeepGEMM's JIT cache root defaults to $HOME/.deep_gemm, and the project also exposes an NVRTC switch for faster compilation. The narrow operational lesson is not "everyone should flip every JIT knob"; it is that FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper rollout has a cache-policy surface. If the cache root lands on slow shared storage, or every rank reaches first-use compilation together, the kernel win can disappear into cold-start coordination overhead before the first real step.

The repository documents that faster-compile path as DG_JIT_USE_NVRTC=1. That does not make NVRTC the universal answer; it makes compile mode part of the rollout contract. For restart-heavy trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 or prewarm lanes, the better question is whether the operator is measuring kernel economics or measuring the policy around first-touch compilation and cache reuse.

2. `torchao` Float8Linear on the dense tail

For dense linears that do fit the expected GEMM shape, torchao is the lowest- friction way to keep FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper explicit. The practical lesson from the real code is that this is a module-boundary decision, not just a dtype flag. If the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper module wrapper is applied too late in the build flow, later sharding and compile passes can miss the intended subtype and the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane quietly falls back to a less interesting path.

The public-safe version of that lesson is narrow: wrapper placement and sharding order matter. Treat FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper conversion as an ownership boundary, not as a post-hoc decoration.

The public API makes that boundary concrete. convert_to_float8_training swaps eligible torch.nn.Linear modules into Float8Linear, and the current documented recipe names are tensorwise, rowwise, and rowwise_with_gw_hp. That is why this article treats FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper conversion as an early module-shaping step: the wrapper has to exist before sharding and compile decide what object they are actually carrying forward.

That ordering rule matters most when FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper meets shard-native trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200. If the float8 module swap happens after sharding or after compile has already fixed the expected parameter layout, the job can keep trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 while quietly losing the interesting FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper communication or scale-handling path you thought you had kept. That is why this article belongs next to FSDP2 pain and payoff, The Compile-Time Tax We Accept for Runtime Speed, and Graph recompilation hell: the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper win is as much about initialization order as it is about datatype choice.

That split is why torchao and DeepGEMM are complementary rather than interchangeable in this article. torchao's row-wise scaling and higher-precision weight-gradient handling make it easier to compose with ordinary PyTorch module boundaries, even when it gives up some of the absolute peak-shape advantage that block-scaled expert kernels can claim.

3. Transformer Engine-owned checkpointing for FP8 blocks

Once FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper blocks move under Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer ownership, the recompute contract moves with them too. That is the same boundary discussed in Transformer Engine on H200 and Blackwell-class GPUs and activations and how we split them: the runtime that owns the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper tensors also has to own the recompute-safe activation state for those tensors.

This is not just a performance detail. It is a correctness and integration detail. A public FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper article that skips the checkpoint boundary usually ends up making stronger claims than the runtime can actually support.

The Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper primer also makes clear why this is not one global loss-scaling trick. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper uses per-tensor scaling, and the practical high- throughput recipe is delayed scaling, where operators carry forward amax history and turn it into the next iteration's scale. Once that state is owned inside TE, recompute and capture boundaries have to preserve it honestly or the "FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper enabled" label stops describing the actual runtime contract.

There is a second boundary hiding inside that one: graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample. Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer is not only a dtype wrapper; it also owns live FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper scale state and a few cache-sensitive backward surfaces. If a lane fragments into many layer-by- microbatch captures, per-microbatch amax reductions and transpose-cache work can leak back into replay. That is why "TE plus CUDA graphsQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample" is not automatically a free win. The clean path depends on shape discipline and capture boundaries; the bad path spends the saved launch overhead on FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper bookkeeping instead. Graph recompilation hell and Transformer Engine on H200 and Blackwell-class GPUs are the better follow-ons when the graph story is the real blocker.

What we rolled back

Three classes of FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper claim did not survive this slice.

First, blanket FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper across the whole hybrid model is too coarse. It hides the difference between GEMM-heavy MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers and the less cooperative M or R surfaces, and it overstates what a mixed stack can recover from FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper after dispatch, wrapper, and recompute overhead are counted.

Second, FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper on scan-heavy Mamba-family work was not a compelling public story. The profitable part of FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper is the GEMM, but those paths also carry non-GEMM work that does not automatically get cheaper when one internal matmul changes format.

That is the step-time version of Amdahl's law: once FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper shrinks the GEMM share, the same scan, routing, and memory-bound surfaces become a larger fraction of the step. A kernel-local win can therefore be real and still disappear at the job level after the unchanged parts of the stack reclaim the critical path. For that accounting view, Training speed by feature is the better companion than any one kernel receipt.

The hardware trap is that Hopper-class FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper fast paths still want healthy GEMM shapes. Once the live M dimension gets small, ragged, or scan-adjacent, the lane starts paying for casts, padding, and data movement that do not show up in the optimistic Tensor Core headline. A WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron-capable kernel can therefore be technically "on the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper path" while still losing end-to-end because the real surface is small-M or memory-bound. Mamba 3 kernel journey and Training speed anatomy H200 are the better neighbors when that boundary is the real story.

Third, FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper plus every other advanced runtime toggle at once is not a clean public claim. CUDA-graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample policy, selective recompute, and FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper ownership can all interact. The stable public-safe story is the one where each ownership boundary is named explicitly.

What ships, what stays explicit

Surface	Status	Public-safe summary
DeepGEMM grouped MoE FP8 GEMMs	kept	good fit for expert-side GEMM-heavy work
`torchao` Float8Linear on selected dense linears	kept	useful where the module boundary and sharding order stay honest
Transformer Engine-owned FP8 checkpointing	kept	recompute ownership has to follow FP8 block ownership
Blanket all-layer FP8	rolled back	too coarse for a mixed `A/E/M/R` stack
FP8 as the main story for Mamba-family scan paths	rolled back	scan-heavy work is not the same as a GEMM-only lane
"FP8 everywhere plus every advanced toggle"	rolled back	too many interacting ownership boundaries for one clean public claim

The useful end state is therefore selective. BF16 still describes the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane at the model level. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper describes a set of retained kernel-local or module-local optimizations inside that lane.

FAQ

Frequently asked questions

Did full-model FP8 become the default story for this stack?+

No. The retained story is selective FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. on the surfaces that benefit from it, not a blanket claim that the whole hybrid model now "runs in FP8."

Which checked-in file shows the FP8 dispatch hazard fastest?+

the sparse MLA FP8 dispatch example is the shortest proof surface. It keeps raw dispatch, dequantize fallback, and FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.-aware dispatch visible in one file.

Why does Transformer Engine show up in an FP8 rollout article at all?+

Because once FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. blocks move under Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts. ownership, recompute-safe activation handling moves with them too. The checkpoint boundary is part of the FP8 boundary.

Why can TE plus CUDA graphs still regress instead of win?+

Because graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph. wants a static replay shape while FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. still carries live scale and cache state. If the lane fragments into many captures, amax reductions or transpose-cache work can come back on every replay. The useful question is not "graphs on?" but "did the captured path actually remove launch overhead without reintroducing FP8 bookkeeping?"

Do newer Transformer Engine releases remove that graph-capture caveat?+

No. They can remove specific bugs or overheads, so the right answer is always versioned measurement. NVIDIA's CUDA GraphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph. guidance still treats FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… as a special case because global FP8 buffers, dynamic scaling state, and weight quantization caches have to keep stable replay semantics. Recent Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts. release notes also show targeted fixes, such as a chunked-layer backward weight-gradient execution-order fix, which is useful but narrower than "all FP8 graph combinations are now safe."

What order keeps torchao FP8 compatible with sharding and compile?+

Convert the linear modules into their FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.-aware form first, then shard them, then compile around that already-shaped model. If you invert that order, the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… lane can keep running while dropping back to a less interesting communication or scaling path. The better local handoff is FSDP2 pain and payoff together with The Compile-Time Tax We Accept for Runtime Speed. On an FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism. lane, the reader-safe version is still: initialize on device="meta", run convert_to_float8_training, apply fully_shard, and only then hand the model to compile. After the optimizer step, precompute_float8_dynamic_scale_for_fsdp is the matching maintenance pass; it updates float8 scales in one collective instead of rediscovering them layer by layer.

Why is the FSDP2 pre-all-gather seam part of the FP8 decision?+

Because the useful distributed win is not just FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. math inside a linear layer. The wrapper has to be present before FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism. rebuilds sharded weights for the forward path, otherwise the run can stay alive while the communication boundary falls back to a less interesting payload or scale path. That is why this article treats FP8 conversion order as an ownership rule, not a cosmetic dtype switch.

Why keep DeepGEMM and torchao as separate FP8 lanes instead of picking one?+

Because they solve different problems. DeepGEMM is the better fit for large, aligned, expert-side GEMMs where block-scaled FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. can stay on the hot MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble. path, while torchao fits ordinary dense linears and shard-aware PyTorch module boundaries more naturally. Treating them as interchangeable hides the real MoE-versus-dense split in the rollout.

What proves an FP8 win survived compile tax and cold start?+

A matched lane receipt, not the first warm kernel number. Pair the compile/runtime receipt example with the distributed goodput tracker example: one fixes the effective compile and warmup lane, the other tells you whether compilation or checkpoint badput erased the claimed FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. gain after the kernels got faster. If DeepGEMM is part of that lane, the same receipt should say where the JIT cache lived and whether the run favored NVCC or the faster NVRTC path. Those details matter because they change whether the operator is measuring kernel economics or a cache-miss storm.

Why can Hopper WGMMA still lose on small-M or scan-heavy work?+

Because the headline FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. path is tuned for healthy GEMMs, not for tiny or ragged M dimensions. Once the kernel starts padding small M, paying cast overhead, or waiting on scan-style data movement, the Tensor Core headline stops predicting end-to-end speed.

Why can faster FP8 GEMMs still leave end-to-end throughput almost unchanged?+

Because FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. only accelerates the compute-bound slice that actually moved onto the FP8 path. If scan-heavy, routing-heavy, or other memory-bound kernels still own the rest of the step, their unchanged cost expands to fill the budget once the GEMMs get faster. The right reader-facing question is "what fraction of the step did FP8 really shorten?" not "did one GEMM benchmark improve?"

What is the public-safe way to summarize the real NAM56R-like hybrid lane?+

Say that BF16 remains the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… floor, FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. survives on selected GEMM-heavy surfaces, and the runtime still keeps explicit patch and ownership boundaries around those surfaces. The checked-in public-safe proof surface for that claim is the runtime patch-surface example.

Does this article imply the same model ships in NVFP4?+

No. This post is trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…-side FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. only. The serving-side Blackwell artifact story is NVFP4 inference.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

WGMMA

Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.

Grounding

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Grounding

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Grounding

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

Grounding

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Grounding

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Grounding

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

Grounding

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

Grounding

NVFP4

NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Grounding

Topic hubs

Topic Hub

H200 Training and Kernel Bring-Up

A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.

David Gornshtein • MegaCppMore posts →

FP8 in the training stack: what shipped and what we rolled back

Why the rollout had to stay selective

The three FP8 surfaces that stayed real

1. Block-scaled GEMMs for the MoE side

2. torchao Float8Linear on the dense tail

3. Transformer Engine-owned checkpointing for FP8 blocks

What we rolled back

What ships, what stays explicit

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

H200 Training and Kernel Bring-Up

2. `torchao` Float8Linear on the dense tail