MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 2 min readDavid Gornshtein
Sparse Mla
FP8
Transformer Engine
Dispatch

Sparse MLA FP8 dispatch

Why SparseMLA needs an FP8-aware dispatch contract when Transformer Engine wrappers hide FP8 storage behind a bf16-looking logical surface.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Sparse MLA FP8 dispatch
Published 2 min readDavid Gornshtein

The failure here is not just "FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper is hard." The real problem is that generic dispatch logic can mistake a wrapper type for an ordinary dense tensor.

In the SparseMLA path, that matters because the wrapper can report a logical bf16-looking surface while the real storage is FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper and the raw pointer surface is not what a naive kernel dispatch expects. That creates two bad outcomes:

  • hard failure, if the kernel reaches a NULL-facing pointer surface
  • silent downgrade, if the wrapper gets dequantized implicitly and the bf16 path runs instead of the requested FP8 path

The public-safe proof surfaces are the Sparse MLA FP8 dispatch example, MLA integration pattern, and Shared MLA adapter seam. Read this next to Sparse MLA dimension generalization and Transformer Engine bridge on NVIDIA, because a dispatch contract is only worth optimizing after the shape contract and the wrapper boundary are both honest.

Why the explicit FP8 path matters

The dequantize fallback is a real fix for correctness, but it is not the same thing as an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-aware runtime contract. A fallback path pays extra movement and can silently erase the reason FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper was enabled in the first place. The better public design is to keep dispatch honest: if the input is a quantized FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper wrapper, route it to the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-capable kernel surface explicitly.

The research-backed seam worth keeping visible is the split payload boundary. In practice the safe handoff is not "one logical tensor goes downstream" but "storage payload plus scale metadata go downstream." That keeps the wrapper-specific unwrap at the adapter boundary, gives the lower CUDA or Triton chooser the real storage facts it needs, and makes it harder for a later refactor to trigger an accidental dequantize fallback.

The safest place to do that unwrap is the Python autograd boundary, before a generic pybind or C++ signature collapses everything back into one logical tensor again. That keeps the wrapper-specific logic narrow: extract the real FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper storage plus scale metadata once, hand the lower kernel chooser an honest payload, and leave the ordinary bf16 lane intact for genuine dense inputs. The checked-in Sparse MLA FP8 dispatch example keeps that boundary visible on purpose, while Precision recipe: FP16, BF16, FP8, NVFP4 and NVFP4 inference cover the separate low-precision serving lane.

That is the same basic engineering rule as the CUDA-graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample and TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample examples already in this pack. The bug is not abstract. The bug is that one runtime surface lies about the contract another runtime surface actually needs.

FAQ

Frequently asked questions

Why split the handoff into storage payload plus scale metadata?+
Because the single logical tensor is exactly where the wrapper hides the real contract. The split handoff keeps the low-level chooser honest about what it is branching on: real FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. storage, the scale metadata that makes it numerically usable, and a separate dense fallback lane for genuine bf16 inputs.
Should the scale broadcast stay in Python forever?+
Not necessarily. Python-side expansion is a good compatibility seam because it makes the kernel contract explicit and keeps wrapper handling out of the fast path. At the PyTorch level, expand() itself is just a view and does not allocate new memory. But it should be treated as a seam, not the final optimization target: once the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. dispatch contract is stable, the better long-term shape is usually to keep scale handling as lightweight metadata or a kernel-side broadcast instead of materializing a larger per-token buffer just to satisfy a signature.
Why is checking the logical dtype not enough for dispatch?+
Because the checked-in Sparse MLA FP8 dispatch example shows the exact trap: the wrapper can report a logical bf16-looking surface while the real storage dtype and pointer surface still belong to FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.. Branching only on the logical dtype is how a generic chooser ends up taking the bf16 lane for a wrapper-backed FP8 tensor.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

NVFP4

NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

Topic hubs