MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 19, 20263 min readDavid Gornshtein

Sparse Mla

Dimensions

Kernels

NAM56R

Sparse MLA dimension generalization

Q: What was the real bug here?

Not a new attention algorithm failure. The real bug was a hardcoded dimension contract that only happened to match one original model family. The checked-in Sparse MLA dimension generalization example shows the exact before-and-after contract.

Q: Why does parameterizing dimensions matter so much?

Because once the kernel takes real d_total and d_v values instead of hidden constants, the same sparse MLA logic can survive on smaller or different model shapes without silent drift.

Q: How does this relate to the FP8 sparse MLA path?

The FP8 branch is a separate kernel and dispatch story. This article is about making the base sparse MLA contract shape-aware before you add precision-specific fast paths on top. Read the Sparse MLA FP8 dispatch checked-in example only after the shape contract is already honest.

Why SparseMLA kernels that hardcode DeepSeek-sized dimensions fail to scale down cleanly to NAM56R-style shapes, and what a generalized contract changes.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Published April 19, 2026•3 min read•David Gornshtein

This class of bug looks like a numerical issue from the outside, but the first real failure is simpler: one kernel family assumes a fixed dimension contract. Here SparseMLA means the sparse multi-head latent attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns path, and the contract that matters first is shape plumbing, not new math. The key shape words are small but important: d_total is the latent-side QK width the sparse path actually receives, and d_v is the value width that has to survive both forward and backward unchanged. In the checked-in example, d_total is computed from the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries latent rank plus the RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries slice rather than being smuggled in as a magic constant. Compile/launch parity here means only that the fused path admits and launches the target shape; it does not claim convergence parity or a full numerical proof by itself.

If a SparseMLA path hardcodes one set of dimensions for QK and V channels, it can pass on one model family and fail on another even when the algorithm is the same. That is why the public example compares a DeepSeek-shaped hardcoded lane with a generalized lane that accepts smaller NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample-style dimensions. The same boundary discipline appears in Public MLA integration patterns for Megatron and the kernel-side follow-up Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships. The checked-in Sparse MLA dimension generalization example is the fastest way to see the exact DeepSeek-shaped lane versus NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample-shaped lane comparison.

What the generalized path is actually fixing

The generalized path is not inventing a new kernel idea. It is removing fixed dimension assumptions from the contract surface and threading the real values through forward and backward plumbing. That makes it the sparse-side version of the same contract cleanup described in shared MLA adapter boundaries: one path becomes reusable only after hidden assumptions stop masquerading as architecture.

That distinction matters. Once d_total and d_v become parameters instead of magic constants, the same sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries idea can survive outside the original shape family it was first authored around. For the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-oriented branch, see Sparse MLA FP8 dispatch. If you want the broader model-layout reason this matters, SLM architecture is the architectural companion: block families stay meaningful only if their dimension contracts are explicit enough to survive a different model shape.

The checked-in example is already explicit enough to make the bug concrete. DEEPSEEK_STYLE carries d_total=576 and d_v=512, while NAM56R_STYLE carries d_total=128 and d_v=64. The whole point of the generalized path is that both shapes survive the same forward and backward contract instead of forcing the smaller lane onto an unfused fallback.

That is still narrower than saying every dimension family wants the same kernel geometry. The checked-in config surfaces reach the smaller lane by changing latent rank and value width, not by pretending a d_total=128 path and a d_total=576 path should share one launch policy forever. Once the contract is honest, the next layer of work is dispatch policy: tile choices, precision paths, and sparse top-kQuick term guidesparse top-kThe sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.GroundingAbout: DSA indexer memory fix History: clustered sparse planner stages Reference: DSA index cache patch layouts can stay shape-sensitive without turning shape support back into a hidden constant. The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample-specific version of top-k reuse lives in DSA index-cache patch, which is a separate cache/reuse boundary rather than this SparseMLA shape-plumbing fix.

FAQ

Frequently asked questions

What was the real bug here?+

Not a new attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. algorithm failure. The real bug was a hardcoded dimension contract that only happened to match one original model family. The checked-in Sparse MLA dimension generalization example shows the exact before-and-after contract.

Why does parameterizing dimensions matter so much?+

Because once the kernel takes real d_total and d_v values instead of hidden constants, the same sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. logic can survive on smaller or different model shapes without silent drift.

Why not just stay on the unfused fallback for smaller shapes?+

Because the fallback changes the runtime cost surface completely. This article's point is that NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.-sized shapes should not be forced off the fused path just because one earlier kernel family hardcoded DeepSeek-sized dimensions.

Why does this article stop at compile/launch parity?+

Because the checked-in Sparse MLA dimension generalization example proves dimension plumbing, not full training parity or every precision-specific backward lane. The companion upstream-pack article keeps the same scope explicit: this fix is about admitting the real shape contract first, then measuring broader numerical parity separately.

What should be measured after shape plumbing works?+

The next evidence should compare the generalized lane against the original fixed-shape lane on its native dimensions, then measure the smaller lane's tile choices, memory behavior, and backward tolerances separately. Shape support is the admission ticket, not proof that one dispatch policy is optimal everywhere.

How does this relate to the FP8 sparse MLA path?+

The FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. branch is a separate kernel and dispatch story. This article is about making the base sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. contract shape-aware before you add precision-specific fast paths on top. Read the Sparse MLA FP8 dispatch checked-in example only after the shape contract is already honest.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Grounding

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Grounding

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

Grounding

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

Grounding

sparse top-k

The sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.

Grounding

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Grounding

Topic hubs

Topic Hub

MLA Integration, Dispatch, and Weight Absorption

A curated MLA reading path: the weight-absorption contract, Megatron-safe integration boundaries, dispatch and FP8 edges, and the adapter surfaces that keep MLA connected to the rest of the stack.

David Gornshtein • MegaCppMore posts →

Sparse MLA dimension generalization

What the generalized path is actually fixing

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

MLA Integration, Dispatch, and Weight Absorption