DSA indexer memory fix deep dive
A reproducer-driven look at how a fused DSA score path avoids a large upstream-style intermediate while preserving the same output contract.

The compact memory-fix article DSA indexer memory fix states the systems lesson. The checked-in example is useful because it preserves the structure of the original comparison: an upstream-style path that materializes a larger score tensor and a fused path that computes the same contract more directly.
That is the right public framing. In this reproducer, the main difference is memory-residency shape; any speed effect is secondary and workload-dependent.
The checked-in near-copy also makes that memory bill concrete instead of
leaving it as a vague "larger intermediate" story. The upstream-style path
first materializes an fp32 score tensor shaped like [sq, b, h, sk] and only
later reduces over heads into the final [b, sq, sk] contract. The fused path
accumulates directly into that final shape. That is why this bug shows up as a
real HBM spike in one helper: the extra head axis stays resident until after
the reduction, so sequence length and head count multiply the cost before the
later DSA sparse top-k/index-cache contract ever
gets to reduce the selected key setQuick term guidekey setThe selected sparse key positions that survive routing and stay visible to the later score or mask update path.GroundingHistory: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Reference: DSA index cache patch.
The checked-in sizing path keeps that shape argument concrete. Its full
configuration uses b=8, sq=4096, sk=4096, and h=32, so the upstream
fp32 intermediate is sq * b * h * sk * 4 bytes: 16 GiB before the later
[b, sq, sk] output is even the only live score surface. The exact crossover
still depends on shape and dtype, but the main engineering point is stable:
once the extra head axis stays materialized, long context turns into a real
residency cliff.
The checked-in sample also keeps the gradient-check lane. That matters because a
memory fix that silently changes the forward or backward contract is not a fix.
run_gradcheck switches to a small float64 shape and compares forward output
plus q, weights, and k gradients instead of pretending finite differences
should be the large-shape validation path. That separation is the real contract:
use the parity lane to prove the fused boundary kept the math, and use the
larger shape lane to prove the dense resident is gone.
Example -> article -> upstream docs
- example: DSA indexer memory checked-in example
- article: this deep-dive route, with the compact companion in DSA indexer memory fix
- upstream docs: PyTorch
einsum,bmmandmatmul,gradcheck, andtopk
Frequently asked questions
Is the public claim about speed or about memory shape?+
Why keep gradcheck in a memory-focused reproducer?+
What exact intermediate is the fix removing?+
[sq, b, h, sk], then reduces it into [b, sq, sk]. The fused path skips that larger residency step and accumulates straight into [b, sq, sk].Why does a later topk call not solve the same memory problem?+
[sq, b, h, sk] scores and only later applies selection, the allocator has already paid for that dense slab. The fused lane changes the order of work so selection happens without keeping the larger score tensor alive first.Should the fused path match the materialized path bit-for-bit?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.
The mask tensor DSA updates after sparse top-k selection so only the chosen key positions remain available to the later score path.
The sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.
The selected sparse key positions that survive routing and stay visible to the later score or mask update path.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.