DSA and CUDA graph safety
Why DSA index mask updates need branchless graph-capture-safe logic, and why small host-sync accidents can break an otherwise valid CUDA graph path.

CUDA graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: CUDA Graphs Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample is unforgiving about hidden host sync points. CUDA graph capture here means recording one fixed GPU work graph and replaying it later
without re-running the same CPU-side launch logic. That makes DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA History: DSA index cache patch Example: DSA CUDA graph safety sample index maskQuick term guideindex maskThe mask tensor DSA updates after sparse top-k selection so only the chosen key positions remain available to the later score path.GroundingAbout: index mask About: DSA index cache patch Example: DSA CUDA graph safety sample
updates a good example of a broader MegaCpp rule: math parity is not enough if
the path still branches on GPU reductions or validation checks that become
Python booleans.
For first touch, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA History: DSA index cache patch Example: DSA CUDA graph safety sample here means DeepSeek Sparse AttentionQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA History: DSA index cache patch Example: DSA CUDA graph safety sample: an attention lane
that first chooses a sparse top-kQuick term guidesparse top-kThe sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.GroundingAbout: sparse top-k About: DSA indexer memory fix History: clustered sparse planner stages key setQuick term guidekey setThe selected sparse key positions that survive routing and stay visible to the later score or mask update path.GroundingAbout: key set About: DSA indexer memory fix deep dive Example: DSA CUDA graph safety sample, then updates an index_maskQuick term guideindex maskThe mask tensor DSA updates after sparse top-k selection so only the chosen key positions remain available to the later score path.GroundingAbout: index mask About: DSA index cache patch Example: DSA CUDA graph safety sample so only
those key positions stay available to the later score path. The checked-in
samples make that mask concrete: selected slots are scattered back to a legal
value, while everything else stays blocked. That is why a "small" host-visible
branch inside the mask update is enough to break an otherwise valid capture
path.
The public sample keeps the fix simple: branchless scatter plus a small fixup. That is more useful than a broad CUDA-graphsQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: CUDA Graphs Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample slogan because it shows the exact runtime pattern that becomes safe. The compact checked-in proof is DSA CUDA graph safety sample, and the near-copy lane that preserves the same boundary more literally is DSA CUDA graph safety nearcopy.
If you want the longer kernel-side version, continue to DSA CUDA graph safety deep dive. If the bug you are chasing is not capture legality but the DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA History: DSA index cache patch Example: DSA CUDA graph safety sample indexing surface itself, the adjacent local continuations are DSA index-cache patch for cross-layer index reuse and DSA indexer memory fix for the score-path working-set rewrite. The TPU-side cousin is Graph recompilation hell: different backend, same lesson that a small host-visible condition can still break the graph contract without changing the math.
One practical footgun is that sparse helpers often look safer than they are.
torch.nonzero, torch.unique, boolean indexing that shrinks a tensor, or any
host-side if driven by a GPU result all force the runtime to learn the sparse
set before launch. The graph-safe rewrite is narrower: fixed-capacity buffers,
device-side offsets, and a small fixup pass that keeps one captured launch
topology even when the selected key setQuick term guidekey setThe selected sparse key positions that survive routing and stay visible to the later score or mask update path.GroundingAbout: key set About: DSA indexer memory fix deep dive Example: DSA CUDA graph safety sample changes.
The validation rule is therefore not "remove every check." It is "move the check to a graph-safe surface." Capture-aware eager-only guards are fine. Device-side failure paths are fine. What is not fine is a check whose answer escapes into Python control flow before the captured region is complete.
One useful extension is to think in bounded shape classesQuick term guideshape classA bounded family of tensor shapes captured or compiled as one stable runtime topology instead of treating every dynamic case as one global worst case.GroundingAbout: shape class Example: CUDA graph block validation sample Reference: activation recompute boundaries instead of one unbounded padded scratch surface. A shape classQuick term guideshape classA bounded family of tensor shapes captured or compiled as one stable runtime topology instead of treating every dynamic case as one global worst case.GroundingAbout: shape class Example: CUDA graph block validation sample Reference: activation recompute boundaries is a pre-captured bucket with a fixed maximum active-token count and fixed buffer sizes, not a promise that the captured graph accepts arbitrary dynamic shapes. If the selector can land in only a few maximum-active-token ranges, capture those classes separately and size each scratch buffer to its own ceiling. That keeps replay topology static while reducing the dead bandwidth and OOM risk that come from padding every batch to the global worst case.
Frequently asked questions
Why is branchless logic emphasized here instead of just "use CUDA graphs carefully"?+
Does math parity prove a graph path is safe?+
What should happen when a runtime shape misses every captured class?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
The mask tensor DSA updates after sparse top-k selection so only the chosen key positions remain available to the later score path.
The sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.
The selected sparse key positions that survive routing and stay visible to the later score or mask update path.
A bounded family of tensor shapes captured or compiled as one stable runtime topology instead of treating every dynamic case as one global worst case.