FP8 in the training stack: what shipped and what we rolled back
An engineer's account of rolling FP8 through the training stack: DeepGEMM block-scaled GEMMs, torchao Float8Linear, TransformerEngine FP8-aware activation checkpointing, and the parts that looked good on paper but lost the benchmark.

FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper is the first precision step where the textbook story and the measured story split apart on this stack. The useful public summary is not "FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper on" or "FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper off." The useful summary is which FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper surfaces stayed worth keeping once the full hybrid recipe was profiled.
For first touch, keep four boundaries separate:
- model-level trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 dtype: the format the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane mostly lives in
- kernel-local FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper: weight or activation transport inside selected GEMM-heavy kernels
- wrapper-aware dispatch: the runtime rule that decides whether an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper wrapper stays on an FP8 path or silently falls back
- checkpoint ownership: which runtime owns recompute-safe activation state once FP8 blocks move under Transformer Engine
The fastest checked-in route through those boundaries is the sparse MLA FP8 dispatch example for the wrapper-aware dispatch seam, the runtime patch-surface example for recipe-versus-runtime ownership, and the precision recipe overview for the cross-phase map.
Why the rollout had to stay selective
The real hybrid lane behind this article is mixed A/E/M/R: attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-family
layers, expert layers, Mamba-family layers, and recurrent-tail layers do not
all respond to FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper the same way. The public-safe way to say that is simple:
MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack and other GEMM-heavy surfaces are the cleanest FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper targets, while scan-heavy,
dispatch-heavy, or wrapper-sensitive surfaces can pay more conversion and
ownership overhead than they recover.
That same boundary is visible in the checked-in examples. The dispatch-side hazard appears in the sparse MLA FP8 dispatch example, and the runtime-side boundary appears in the runtime patch-surface example. Together they show why a public FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper story has to stay per-surface rather than turning into a single headline about "the model running in FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper."
The three FP8 surfaces that stayed real
1. Block-scaled GEMMs for the MoE side
DeepGEMM is the cleanest FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper win in this slice because it targets exactly the kind of work FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper is good at: large GEMM-heavy kernels with explicit scaling discipline. The public-safe claim is not that every layer got faster. It is that grouped MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack GEMMs are a natural home for block-scaled FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper, especially on Hopper-class trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lanes where those expert GEMMs are already a dominant, well-shaped kernel family.
That is why the kept FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper story routes through MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack and not through every layer family indiscriminately. The runtime ownership sample in the runtime patch-surface example is the checked-in reminder that the hybrid stack still has several non-MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack surfaces with different contracts.
The reason DeepGEMM stays on the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack side is not just that it is fast. Its block-scaled layout matches large aligned expert GEMMs and hardware-friendly transfer shapes, which is exactly where the hybrid lane has obvious FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper arithmetic to harvest. That same coarser scaling is a less clean fit for smaller dense modules or outlier-heavy rows, where numerical headroom and integration friction dominate.
There is also an operator boundary outside the steady-state kernel numbers. DeepGEMM can be the right FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper kernel choice and still create a bad rollout if every rank reaches first-use compilation at once. In practice that means the compile cache, warmup order, and restart discipline belong in the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper story too: the profitable MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack lane is not just "faster kernels," it is "faster kernels without turning cold start into a synchronized JIT storm." That is the same reason this article hands off to The Compile-Time Tax We Accept for Runtime Speed and Training on 8x H200 SXM: the operator playbook. In practice the safe rollout is a deterministic cache root plus rank-0 or single-node prewarm; otherwise first-use JIT can turn a good kernel choice into a thundering herd against shared storage.
The public repository makes that operator seam explicit too. DeepGEMM's JIT
cache root defaults to $HOME/.deep_gemm, and the project also exposes an
NVRTC switch for faster compilation. The narrow operational lesson is not
"everyone should flip every JIT knob"; it is that FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper rollout has a cache-policy
surface. If the cache root lands on slow shared storage, or every rank reaches
first-use compilation together, the kernel win can disappear into cold-start
coordination overhead before the first real step.
The repository documents that faster-compile path as DG_JIT_USE_NVRTC=1.
That does not make NVRTC the universal answer; it makes compile mode part of
the rollout contract. For restart-heavy trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 or prewarm lanes, the better
question is whether the operator is measuring kernel economics or measuring the
policy around first-touch compilation and cache reuse.
2. torchao Float8Linear on the dense tail
For dense linears that do fit the expected GEMM shape, torchao is the lowest-
friction way to keep FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper explicit. The practical lesson from the real code is
that this is a module-boundary decision, not just a dtype flag. If the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper
module wrapper is applied too late in the build flow, later sharding and compile
passes can miss the intended subtype and the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane quietly falls back to
a less interesting path.
The public-safe version of that lesson is narrow: wrapper placement and sharding order matter. Treat FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper conversion as an ownership boundary, not as a post-hoc decoration.
The public API makes that boundary concrete. convert_to_float8_training
swaps eligible torch.nn.Linear modules into Float8Linear, and the current
documented recipe names are tensorwise, rowwise, and rowwise_with_gw_hp.
That is why this article treats FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper conversion as an early module-shaping step:
the wrapper has to exist before sharding and compile decide what object they
are actually carrying forward.
That ordering rule matters most when FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper meets shard-native trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200. If the float8 module swap happens after sharding or after compile has already fixed the expected parameter layout, the job can keep trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 while quietly losing the interesting FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper communication or scale-handling path you thought you had kept. That is why this article belongs next to FSDP2 pain and payoff, The Compile-Time Tax We Accept for Runtime Speed, and Graph recompilation hell: the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper win is as much about initialization order as it is about datatype choice.
That split is why torchao and DeepGEMM are complementary rather than interchangeable in this article. torchao's row-wise scaling and higher-precision weight-gradient handling make it easier to compose with ordinary PyTorch module boundaries, even when it gives up some of the absolute peak-shape advantage that block-scaled expert kernels can claim.
3. Transformer Engine-owned checkpointing for FP8 blocks
Once FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper blocks move under Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer ownership, the recompute contract moves with them too. That is the same boundary discussed in Transformer Engine on H200 and Blackwell-class GPUs and activations and how we split them: the runtime that owns the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper tensors also has to own the recompute-safe activation state for those tensors.
This is not just a performance detail. It is a correctness and integration detail. A public FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper article that skips the checkpoint boundary usually ends up making stronger claims than the runtime can actually support.
The Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper primer also makes clear why this is not one global loss-scaling trick. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper uses per-tensor scaling, and the practical high- throughput recipe is delayed scaling, where operators carry forward amax history and turn it into the next iteration's scale. Once that state is owned inside TE, recompute and capture boundaries have to preserve it honestly or the "FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper enabled" label stops describing the actual runtime contract.
There is a second boundary hiding inside that one: graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample. Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer is not only a dtype wrapper; it also owns live FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper scale state and a few cache-sensitive backward surfaces. If a lane fragments into many layer-by- microbatch captures, per-microbatch amax reductions and transpose-cache work can leak back into replay. That is why "TE plus CUDA graphsQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample" is not automatically a free win. The clean path depends on shape discipline and capture boundaries; the bad path spends the saved launch overhead on FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper bookkeeping instead. Graph recompilation hell and Transformer Engine on H200 and Blackwell-class GPUs are the better follow-ons when the graph story is the real blocker.
What we rolled back
Three classes of FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper claim did not survive this slice.
First, blanket FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper across the whole hybrid model is too coarse. It hides
the difference between GEMM-heavy MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers and the less cooperative M or
R surfaces, and it overstates what a mixed stack can recover from FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper after
dispatch, wrapper, and recompute overhead are counted.
Second, FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper on scan-heavy Mamba-family work was not a compelling public story. The profitable part of FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper is the GEMM, but those paths also carry non-GEMM work that does not automatically get cheaper when one internal matmul changes format.
That is the step-time version of Amdahl's law: once FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper shrinks the GEMM share, the same scan, routing, and memory-bound surfaces become a larger fraction of the step. A kernel-local win can therefore be real and still disappear at the job level after the unchanged parts of the stack reclaim the critical path. For that accounting view, Training speed by feature is the better companion than any one kernel receipt.
The hardware trap is that Hopper-class FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper fast paths still want healthy GEMM
shapes. Once the live M dimension gets small, ragged, or scan-adjacent, the
lane starts paying for casts, padding, and data movement that do not show up in
the optimistic Tensor Core headline. A WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron-capable kernel can therefore be
technically "on the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper path" while still losing end-to-end because the real
surface is small-M or memory-bound. Mamba 3 kernel journey
and Training speed anatomy H200 are the
better neighbors when that boundary is the real story.
Third, FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper plus every other advanced runtime toggle at once is not a clean public claim. CUDA-graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample policy, selective recompute, and FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper ownership can all interact. The stable public-safe story is the one where each ownership boundary is named explicitly.
What ships, what stays explicit
| Surface | Status | Public-safe summary |
|---|---|---|
| DeepGEMM grouped MoE FP8 GEMMs | kept | good fit for expert-side GEMM-heavy work |
torchao Float8Linear on selected dense linears |
kept | useful where the module boundary and sharding order stay honest |
| Transformer Engine-owned FP8 checkpointing | kept | recompute ownership has to follow FP8 block ownership |
| Blanket all-layer FP8 | rolled back | too coarse for a mixed A/E/M/R stack |
| FP8 as the main story for Mamba-family scan paths | rolled back | scan-heavy work is not the same as a GEMM-only lane |
| "FP8 everywhere plus every advanced toggle" | rolled back | too many interacting ownership boundaries for one clean public claim |
The useful end state is therefore selective. BF16 still describes the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane at the model level. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: Megatron FLCE on Hopper describes a set of retained kernel-local or module-local optimizations inside that lane.
Frequently asked questions
Did full-model FP8 become the default story for this stack?+
Which checked-in file shows the FP8 dispatch hazard fastest?+
Why does Transformer Engine show up in an FP8 rollout article at all?+
Why can TE plus CUDA graphs still regress instead of win?+
Do newer Transformer Engine releases remove that graph-capture caveat?+
What order keeps torchao FP8 compatible with sharding and compile?+
device="meta", run convert_to_float8_training, apply fully_shard, and only then hand the model to compile. After the optimizer step, precompute_float8_dynamic_scale_for_fsdp is the matching maintenance pass; it updates float8 scales in one collective instead of rediscovering them layer by layer.Why is the FSDP2 pre-all-gather seam part of the FP8 decision?+
Why keep DeepGEMM and torchao as separate FP8 lanes instead of picking one?+
torchao fits ordinary dense linears and shard-aware PyTorch module boundaries more naturally. Treating them as interchangeable hides the real MoE-versus-dense split in the rollout.What proves an FP8 win survived compile tax and cold start?+
Why can Hopper WGMMA still lose on small-M or scan-heavy work?+
M dimensions. Once the kernel starts padding small M, paying cast overhead, or waiting on scan-style data movement, the Tensor Core headline stops predicting end-to-end speed.Why can faster FP8 GEMMs still leave end-to-end throughput almost unchanged?+
What is the public-safe way to summarize the real NAM56R-like hybrid lane?+
Does this article imply the same model ships in NVFP4?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.
NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…
CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…