MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 2 min readDavid Gornshtein
DSA
Memory
Attention
H200

DSA indexer memory fix

Why MegaCpp replaces a memory-hungry DSA score path with a fused top-k scoring surface and treats that change as a systems fix, not just a kernel tweak.

MegaCpp
Focused on applied C++ model engineering
Article Preview
DSA indexer memory fix
Published 2 min readDavid Gornshtein

Some attention fixes are really memory fixes in disguise. The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample indexer path is one of them: in DeepSeek Sparse AttentionQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample, the later sparse top-kQuick term guidesparse top-kThe sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.GroundingHistory: clustered sparse planner stages Reference: DSA index cache patch lane keeps only a bounded key setQuick term guidekey setThe selected sparse key positions that survive routing and stay visible to the later score or mask update path.GroundingAbout: DSA indexer memory fix deep dive History: DSA and CUDA graph safety Example: DSA CUDA graph safety sample instead of needing the whole dense score slab. The argument here is the short companion to DSA indexer memory fix deep dive. If the score path materializes the wrong intermediate, the runtime spends memory on a tensor the later top-k logic did not actually need.

The public sample keeps the right lesson visible: fused top-k scoring is not only about speed. It is about removing an avoidable memory bill from the hot path, while still staying compatible with the graph-capture rules described in DSA and CUDA graph safety.

That is also why this post sits naturally beside Training speed by feature and Profiler and receipts: a memory fix only counts if the receipt shows the hot path actually got cheaper.

The checked-in near-copy makes the shape jump visible instead of hiding it behind prose. The upstream-style helper builds an fp32 [sq, b, h, sk] slab with einsum and only then collapses heads, while the fused helper reuses one fp32 [b, sq, sk] output buffer and streams per-head bmm contributions into it. DSA indexer memory checked-in example is the shortest local proof of why the extra head axis is the real memory bill.

The research-side sizing model is useful because it turns "large intermediate" into an actual failure boundary. In a single-head B=1 illustration, the materialized slab is already about 8.1 GB at L=65,536 and about 32.7 GB at L=131,072. Holding K=1024 fixed, the fused lane's bounded outputs are about 384 MB and 768 MB instead. Those are illustrative, not universal, numbers, but they show why the problem arrives as a residency cliff rather than as a small steady slowdown.

That replay-envelope point is not just about peak bytes. A fused lane that writes into fixed-size top-k and running-stats buffers gives capture and replay the same bounded buffer geometry every step, which is why this post keeps handing off to DSA and CUDA graph safety instead of treating graph safety as a separate afterthought.

FAQ

Frequently asked questions

Was this mainly a math fix or a memory fix?+
It was mainly a memory fix. The fused score path still has to preserve the same later top-k decision, but the important change is that it stops allocating the large score intermediate that the downstream selection path never needed to keep around.
Why mention CUDA graph safety in a memory article?+
Because on this stack a "memory fix" only counts if the faster path is still runnable inside the real execution envelope. If the fused score path saves memory but reintroduces capture-hostile branching or allocation churn, it has simply moved the bug from allocator pressure into the launch contract.
What does the long-context cliff look like in concrete terms?+
In the single-head sizing example behind this article, the materialized score surface is already about 8.1 GB at L=65,536 and about 32.7 GB at L=131,072. With K=1024 fixed, the fused lane stays around 384 MB and 768 MB instead. The exact numbers move with dtype, kept-k, and concurrency, but the systems lesson does not: once the dense slab survives long enough to span those axes, the failure mode stops being subtle.
Does this cliff depend only on context length?+
No. The upstream-style resident scales with the full [sq, b, h, sk] slab, so more concurrent batch or more attention heads move the cliff left even before the rest of the stack changes. The checked-in near-copy keeps the shape story visible with the larger [sq, b, h, sk] resident, and the compact sample's memory helper makes the same scaling rule explicit by multiplying batch and head count straight into the dense score bill. That is the systems reason this bug can surface earlier under more concurrency even when the algorithm itself has not changed.
Does the fused lane still keep any state of its own?+
Yes, but that is not the same thing as reviving the bug. The important distinction is between bounded bookkeeping, such as fixed top-k outputs or running stats, and materializing the full dense score slab. The fused lane can keep the former without bringing back the [sq, b, h, sk] resident that made the original path fall off a memory cliff.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

index mask

The mask tensor DSA updates after sparse top-k selection so only the chosen key positions remain available to the later score path.

sparse top-k

The sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.

key set

The selected sparse key positions that survive routing and stay visible to the later score or mask update path.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.