Sparse MLA FP8 dispatch
Why SparseMLA needs an FP8-aware dispatch contract when Transformer Engine wrappers hide FP8 storage behind a bf16-looking logical surface.

The failure here is not just "FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper is hard." The real problem is that generic dispatch logic can mistake a wrapper type for an ordinary dense tensor.
In the SparseMLA path, that matters because the wrapper can report a logical bf16-looking surface while the real storage is FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper and the raw pointer surface is not what a naive kernel dispatch expects. That creates two bad outcomes:
- hard failure, if the kernel reaches a NULL-facing pointer surface
- silent downgrade, if the wrapper gets dequantized implicitly and the bf16 path runs instead of the requested FP8 path
The public-safe proof surfaces are the Sparse MLA FP8 dispatch example, MLA integration pattern, and Shared MLA adapter seam. Read this next to Sparse MLA dimension generalization and Transformer Engine bridge on NVIDIA, because a dispatch contract is only worth optimizing after the shape contract and the wrapper boundary are both honest.
Why the explicit FP8 path matters
The dequantize fallback is a real fix for correctness, but it is not the same thing as an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-aware runtime contract. A fallback path pays extra movement and can silently erase the reason FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper was enabled in the first place. The better public design is to keep dispatch honest: if the input is a quantized FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper wrapper, route it to the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-capable kernel surface explicitly.
The research-backed seam worth keeping visible is the split payload boundary. In practice the safe handoff is not "one logical tensor goes downstream" but "storage payload plus scale metadata go downstream." That keeps the wrapper-specific unwrap at the adapter boundary, gives the lower CUDA or Triton chooser the real storage facts it needs, and makes it harder for a later refactor to trigger an accidental dequantize fallback.
The safest place to do that unwrap is the Python autograd boundary, before a generic pybind or C++ signature collapses everything back into one logical tensor again. That keeps the wrapper-specific logic narrow: extract the real FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper storage plus scale metadata once, hand the lower kernel chooser an honest payload, and leave the ordinary bf16 lane intact for genuine dense inputs. The checked-in Sparse MLA FP8 dispatch example keeps that boundary visible on purpose, while Precision recipe: FP16, BF16, FP8, NVFP4 and NVFP4 inference cover the separate low-precision serving lane.
That is the same basic engineering rule as the CUDA-graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample and TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample examples already in this pack. The bug is not abstract. The bug is that one runtime surface lies about the contract another runtime surface actually needs.
Frequently asked questions
Why split the handoff into storage payload plus scale metadata?+
Should the scale broadcast stay in Python forever?+
expand() itself is just a view and does not allocate new memory. But it should be treated as a seam, not the final optimization target: once the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. dispatch contract is stable, the better long-term shape is usually to keep scale handling as lightweight metadata or a kernel-side broadcast instead of materializing a larger per-token buffer just to satisfy a signature.Why is checking the logical dtype not enough for dispatch?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.
NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.
A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.