Why is the first layer on a new pipeline stage often a full layer again?

Because stage-boundary reuse turns a compute optimization into a transport problem. Once a cache has to move to another stage, the system is no longer choosing only between "reuse" and "recompute"; it is choosing between communication plus reuse and a local rebuild. At long-context scale, L=200,000, H=32, top_k=2048, and int32 indices imply about 50 GB of sparse index payload per sample before the cache even crosses a stage boundary. The safe default is still fail-closed promotion: treat the first consumer on the new stage as full again unless the cache-transfer path is explicitly proven worth keeping.

What invalidates a cached sparse-index path before a stage boundary?

A cache hit is not proof by itself. The consumer still has to prove the expected index tensor is present locally and usable for the current context. If that local-source contract fails, the DSA index-cache patch sample keeps the behavior boring: promote the layer to a full indexer path and rebuild instead of pretending the sparse reuse lane is still valid.

DSA index-cache patch

This patch is interesting because it looks like a local optimization, but the real contract is larger.

Here DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: DSA CUDA graph safety nearcopy means DeepSeek Sparse AttentionQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: DSA CUDA graph safety nearcopy: selected indexer layers choose a sparse top-kQuick term guidesparse top-kThe sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.GroundingAbout: DSA indexer memory fix History: clustered sparse planner stages key setQuick term guidekey setThe selected sparse key positions that survive routing and stay visible to the later score or mask update path.GroundingAbout: DSA indexer memory fix deep dive History: DSA and CUDA graph safety Example: DSA CUDA graph safety sample, and nearby layers may reuse those indices instead of refreshing the full selection path. That only works if the reuse path is explicit and the failure mode is safe. A shared layer without a valid preceding cache cannot keep pretending it is on the cheap path. It has to promote itself back to a full path and recompute.

That is the public rule the checked-in sample preserves: cache when the contract exists, fail closed when it does not.

The same fail-closed discipline shows up in DSA and CUDA graph safety and DSA indexer memory fix: the speed story is only useful if the fallback behavior stays explicit.

Why this matters beyond one patch

Sparse-attention caches are tempting to describe as obvious wins. They are not obvious if the lifecycle is underspecified. Cache invalidation, nearest valid source, and cross-stage absence all change whether reuse is safe.

That is why this public example is worth keeping. It documents the schedule and the fallback rule instead of implying cached sparse indices are globally valid.

That schedule vocabulary matters. A useful public description is not "some later layers get a free cache." It is "a minority of full or anchor layers refresh sparse indices, nearby shared layers reuse them, and any consumer that cannot prove a valid local source becomes full again." The checked-in sample makes those states visible directly, which keeps the reader focused on contract boundaries instead of benchmark folklore.

That contract is also narrower than "layer N hands indices to layer N+1 forever." The checked-in pipeline sample is a good reminder that a stage boundary changes the economics: once reuse has to cross to another pipeline stage, the index payload can be expensive enough that the first consumer on the new stage is better modeled as a deliberate full recompute than as a cache import lane.

DSA index-cache patch

Why this matters beyond one patch

Frequently asked questions

Terms used in this article

DSA index-cache patch

Why this matters beyond one patch

Read next

References

Frequently asked questions

Terms used in this article