DSA indexer memory fix
Why MegaCpp replaces a memory-hungry DSA score path with a fused top-k scoring surface and treats that change as a systems fix, not just a kernel tweak.

Some attention fixes are really memory fixes in disguise. The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample indexer path is one of them: in DeepSeek Sparse AttentionQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample, the later sparse top-kQuick term guidesparse top-kThe sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.GroundingHistory: clustered sparse planner stages Reference: DSA index cache patch lane keeps only a bounded key setQuick term guidekey setThe selected sparse key positions that survive routing and stay visible to the later score or mask update path.GroundingAbout: DSA indexer memory fix deep dive History: DSA and CUDA graph safety Example: DSA CUDA graph safety sample instead of needing the whole dense score slab. The argument here is the short companion to DSA indexer memory fix deep dive. If the score path materializes the wrong intermediate, the runtime spends memory on a tensor the later top-k logic did not actually need.
The public sample keeps the right lesson visible: fused top-k scoring is not only about speed. It is about removing an avoidable memory bill from the hot path, while still staying compatible with the graph-capture rules described in DSA and CUDA graph safety.
That is also why this post sits naturally beside Training speed by feature and Profiler and receipts: a memory fix only counts if the receipt shows the hot path actually got cheaper.
The checked-in near-copy makes the shape jump visible instead of hiding it
behind prose. The upstream-style helper builds an fp32 [sq, b, h, sk] slab
with einsum and only then collapses heads, while the fused helper reuses one
fp32 [b, sq, sk] output buffer and streams per-head bmm contributions into
it. DSA indexer memory checked-in example
is the shortest local proof of why the extra head axis is the real memory bill.
The research-side sizing model is useful because it turns "large intermediate"
into an actual failure boundary. In a single-head B=1 illustration, the
materialized slab is already about 8.1 GB at L=65,536 and about 32.7 GB
at L=131,072. Holding K=1024 fixed, the fused lane's bounded outputs are
about 384 MB and 768 MB instead. Those are illustrative, not universal,
numbers, but they show why the problem arrives as a residency cliff rather than
as a small steady slowdown.
That replay-envelope point is not just about peak bytes. A fused lane that writes into fixed-size top-k and running-stats buffers gives capture and replay the same bounded buffer geometry every step, which is why this post keeps handing off to DSA and CUDA graph safety instead of treating graph safety as a separate afterthought.
Frequently asked questions
Was this mainly a math fix or a memory fix?+
Why mention CUDA graph safety in a memory article?+
What does the long-context cliff look like in concrete terms?+
8.1 GB at L=65,536 and about 32.7 GB at L=131,072. With K=1024 fixed, the fused lane stays around 384 MB and 768 MB instead. The exact numbers move with dtype, kept-k, and concurrency, but the systems lesson does not: once the dense slab survives long enough to span those axes, the failure mode stops being subtle.Does this cliff depend only on context length?+
[sq, b, h, sk] slab, so more concurrent batch or more attention heads move the cliff left even before the rest of the stack changes. The checked-in near-copy keeps the shape story visible with the larger [sq, b, h, sk] resident, and the compact sample's memory helper makes the same scaling rule explicit by multiplying batch and head count straight into the dense score bill. That is the systems reason this bug can surface earlier under more concurrency even when the algorithm itself has not changed.Does the fused lane still keep any state of its own?+
[sq, b, h, sk] resident that made the original path fall off a memory cliff.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.
The mask tensor DSA updates after sparse top-k selection so only the chosen key positions remain available to the later score path.
The sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.
The selected sparse key positions that survive routing and stay visible to the later score or mask update path.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.