Sparse MLA dimension generalization
Why SparseMLA kernels that hardcode DeepSeek-sized dimensions fail to scale down cleanly to NAM56R-style shapes, and what a generalized contract changes.

This class of bug looks like a numerical issue from the outside, but the first
real failure is simpler: one kernel family assumes a fixed dimension contract.
Here SparseMLA means the sparse multi-head latent attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns path, and the
contract that matters first is shape plumbing, not new math.
The key shape words are small but important: d_total is the latent-side QK
width the sparse path actually receives, and d_v is the value width that has
to survive both forward and backward unchanged. In the checked-in example,
d_total is computed from the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries latent rank plus the RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries slice rather than
being smuggled in as a magic constant.
Compile/launch parity here means only that the fused path admits and launches
the target shape; it does not claim convergence parity or a full numerical
proof by itself.
If a SparseMLA path hardcodes one set of dimensions for QK and V channels, it can pass on one model family and fail on another even when the algorithm is the same. That is why the public example compares a DeepSeek-shaped hardcoded lane with a generalized lane that accepts smaller NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample-style dimensions. The same boundary discipline appears in Public MLA integration patterns for Megatron and the kernel-side follow-up Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships. The checked-in Sparse MLA dimension generalization example is the fastest way to see the exact DeepSeek-shaped lane versus NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample-shaped lane comparison.
What the generalized path is actually fixing
The generalized path is not inventing a new kernel idea. It is removing fixed dimension assumptions from the contract surface and threading the real values through forward and backward plumbing. That makes it the sparse-side version of the same contract cleanup described in shared MLA adapter boundaries: one path becomes reusable only after hidden assumptions stop masquerading as architecture.
That distinction matters. Once d_total and d_v become parameters instead of
magic constants, the same sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries idea can survive outside the original
shape family it was first authored around. For the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-oriented branch, see
Sparse MLA FP8 dispatch. If you want the
broader model-layout reason this matters, SLM architecture
is the architectural companion: block families stay meaningful only if their
dimension contracts are explicit enough to survive a different model shape.
The checked-in example is already explicit enough to make the bug concrete.
DEEPSEEK_STYLE carries d_total=576 and d_v=512, while NAM56R_STYLE
carries d_total=128 and d_v=64. The whole point of the generalized path is
that both shapes survive the same forward and backward contract instead of
forcing the smaller lane onto an unfused fallback.
That is still narrower than saying every dimension family wants the same
kernel geometry. The checked-in config surfaces reach the smaller lane by
changing latent rank and value width, not by pretending a d_total=128 path
and a d_total=576 path should share one launch policy forever. Once the
contract is honest, the next layer of work is dispatch policy: tile choices,
precision paths, and sparse top-kQuick term guidesparse top-kThe sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.GroundingAbout: DSA indexer memory fix History: clustered sparse planner stages Reference: DSA index cache patch layouts can stay shape-sensitive without
turning shape support back into a hidden constant. The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample-specific version of
top-k reuse lives in DSA index-cache patch, which
is a separate cache/reuse boundary rather than this SparseMLA shape-plumbing
fix.
Frequently asked questions
What was the real bug here?+
Why does parameterizing dimensions matter so much?+
d_total and d_v values instead of hidden constants, the same sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. logic can survive on smaller or different model shapes without silent drift.Why not just stay on the unfused fallback for smaller shapes?+
Why does this article stop at compile/launch parity?+
What should be measured after shape plumbing works?+
How does this relate to the FP8 sparse MLA path?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
The sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.