DSA index-cache patch
Why caching sparse top-k indices across selected DSA layers is not just a speed trick, and why the shared path has to fail closed back to a full layer when no valid cache is available.

This patch is interesting because it looks like a local optimization, but the real contract is larger.
Here DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: DSA CUDA graph safety nearcopy means DeepSeek Sparse AttentionQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: DSA CUDA graph safety nearcopy: selected indexer layers choose a sparse top-kQuick term guidesparse top-kThe sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.GroundingAbout: DSA indexer memory fix History: clustered sparse planner stages key setQuick term guidekey setThe selected sparse key positions that survive routing and stay visible to the later score or mask update path.GroundingAbout: DSA indexer memory fix deep dive History: DSA and CUDA graph safety Example: DSA CUDA graph safety sample, and nearby layers may reuse those indices instead of refreshing the full selection path. That only works if the reuse path is explicit and the failure mode is safe. A shared layer without a valid preceding cache cannot keep pretending it is on the cheap path. It has to promote itself back to a full path and recompute.
That is the public rule the checked-in sample preserves: cache when the contract exists, fail closed when it does not.
The same fail-closed discipline shows up in DSA and CUDA graph safety and DSA indexer memory fix: the speed story is only useful if the fallback behavior stays explicit.
Why this matters beyond one patch
Sparse-attention caches are tempting to describe as obvious wins. They are not obvious if the lifecycle is underspecified. Cache invalidation, nearest valid source, and cross-stage absence all change whether reuse is safe.
That is why this public example is worth keeping. It documents the schedule and the fallback rule instead of implying cached sparse indices are globally valid.
That schedule vocabulary matters. A useful public description is not "some later layers get a free cache." It is "a minority of full or anchor layers refresh sparse indices, nearby shared layers reuse them, and any consumer that cannot prove a valid local source becomes full again." The checked-in sample makes those states visible directly, which keeps the reader focused on contract boundaries instead of benchmark folklore.
That contract is also narrower than "layer N hands indices to layer N+1 forever." The checked-in pipeline sample is a good reminder that a stage boundary changes the economics: once reuse has to cross to another pipeline stage, the index payload can be expensive enough that the first consumer on the new stage is better modeled as a deliberate full recompute than as a cache import lane.
Frequently asked questions
Why is the first layer on a new pipeline stage often a full layer again?+
L=200,000, H=32, top_k=2048, and int32 indices imply about 50 GB of sparse index payload per sample before the cache even crosses a stage boundary. The safe default is still fail-closed promotion: treat the first consumer on the new stage as full again unless the cache-transfer path is explicitly proven worth keeping.What invalidates a cached sparse-index path before a stage boundary?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.
The mask tensor DSA updates after sparse top-k selection so only the chosen key positions remain available to the later score path.
The sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.
The selected sparse key positions that survive routing and stay visible to the later score or mask update path.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.