MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 3 min readDavid Gornshtein
DSA
CUDA Graphs
Runtime
Kernels

DSA and CUDA graph safety

Why DSA index mask updates need branchless graph-capture-safe logic, and why small host-sync accidents can break an otherwise valid CUDA graph path.

MegaCpp
Focused on applied C++ model engineering
Article Preview
DSA and CUDA graph safety
Published 3 min readDavid Gornshtein

CUDA graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: CUDA Graphs Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample is unforgiving about hidden host sync points. CUDA graph capture here means recording one fixed GPU work graph and replaying it later without re-running the same CPU-side launch logic. That makes DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA History: DSA index cache patch Example: DSA CUDA graph safety sample index maskQuick term guideindex maskThe mask tensor DSA updates after sparse top-k selection so only the chosen key positions remain available to the later score path.GroundingAbout: index mask About: DSA index cache patch Example: DSA CUDA graph safety sample updates a good example of a broader MegaCpp rule: math parity is not enough if the path still branches on GPU reductions or validation checks that become Python booleans.

For first touch, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA History: DSA index cache patch Example: DSA CUDA graph safety sample here means DeepSeek Sparse AttentionQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA History: DSA index cache patch Example: DSA CUDA graph safety sample: an attention lane that first chooses a sparse top-kQuick term guidesparse top-kThe sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.GroundingAbout: sparse top-k About: DSA indexer memory fix History: clustered sparse planner stages key setQuick term guidekey setThe selected sparse key positions that survive routing and stay visible to the later score or mask update path.GroundingAbout: key set About: DSA indexer memory fix deep dive Example: DSA CUDA graph safety sample, then updates an index_maskQuick term guideindex maskThe mask tensor DSA updates after sparse top-k selection so only the chosen key positions remain available to the later score path.GroundingAbout: index mask About: DSA index cache patch Example: DSA CUDA graph safety sample so only those key positions stay available to the later score path. The checked-in samples make that mask concrete: selected slots are scattered back to a legal value, while everything else stays blocked. That is why a "small" host-visible branch inside the mask update is enough to break an otherwise valid capture path.

The public sample keeps the fix simple: branchless scatter plus a small fixup. That is more useful than a broad CUDA-graphsQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: CUDA Graphs Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample slogan because it shows the exact runtime pattern that becomes safe. The compact checked-in proof is DSA CUDA graph safety sample, and the near-copy lane that preserves the same boundary more literally is DSA CUDA graph safety nearcopy.

If you want the longer kernel-side version, continue to DSA CUDA graph safety deep dive. If the bug you are chasing is not capture legality but the DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA History: DSA index cache patch Example: DSA CUDA graph safety sample indexing surface itself, the adjacent local continuations are DSA index-cache patch for cross-layer index reuse and DSA indexer memory fix for the score-path working-set rewrite. The TPU-side cousin is Graph recompilation hell: different backend, same lesson that a small host-visible condition can still break the graph contract without changing the math.

One practical footgun is that sparse helpers often look safer than they are. torch.nonzero, torch.unique, boolean indexing that shrinks a tensor, or any host-side if driven by a GPU result all force the runtime to learn the sparse set before launch. The graph-safe rewrite is narrower: fixed-capacity buffers, device-side offsets, and a small fixup pass that keeps one captured launch topology even when the selected key setQuick term guidekey setThe selected sparse key positions that survive routing and stay visible to the later score or mask update path.GroundingAbout: key set About: DSA indexer memory fix deep dive Example: DSA CUDA graph safety sample changes.

The validation rule is therefore not "remove every check." It is "move the check to a graph-safe surface." Capture-aware eager-only guards are fine. Device-side failure paths are fine. What is not fine is a check whose answer escapes into Python control flow before the captured region is complete.

One useful extension is to think in bounded shape classesQuick term guideshape classA bounded family of tensor shapes captured or compiled as one stable runtime topology instead of treating every dynamic case as one global worst case.GroundingAbout: shape class Example: CUDA graph block validation sample Reference: activation recompute boundaries instead of one unbounded padded scratch surface. A shape classQuick term guideshape classA bounded family of tensor shapes captured or compiled as one stable runtime topology instead of treating every dynamic case as one global worst case.GroundingAbout: shape class Example: CUDA graph block validation sample Reference: activation recompute boundaries is a pre-captured bucket with a fixed maximum active-token count and fixed buffer sizes, not a promise that the captured graph accepts arbitrary dynamic shapes. If the selector can land in only a few maximum-active-token ranges, capture those classes separately and size each scratch buffer to its own ceiling. That keeps replay topology static while reducing the dead bandwidth and OOM risk that come from padding every batch to the global worst case.

FAQ

Frequently asked questions

Why is branchless logic emphasized here instead of just "use CUDA graphs carefully"?+
Because the actual failure mode is concrete: a tensor reduction or equality check turns into a host-visible boolean, which turns a graph-safe device path into host-controlled branching. The shortest local proof is the pair of checked-in examples DSA CUDA graph safety sample and DSA CUDA graph safety nearcopy, plus CUDA graph block validation sample for the lane-level capture boundary.
Does math parity prove a graph path is safe?+
No. A path can be numerically correct in eager mode and still be unusable under capture if it hides host syncs, shape checks, or conditional control flow. That is why this article stays paired with DSA CUDA graph safety deep dive and Regional compile without losing the plot: one explains the capture boundary, the other explains where that boundary sits in the wider runtime lane.
What should happen when a runtime shape misses every captured class?+
Fall back to eager mode or a smaller safe piecewise path, then decide outside capture whether that shape is worth adding later. Do not allocate a new variable-sized buffer or invent a new graph topology from inside the captured step.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

index mask

The mask tensor DSA updates after sparse top-k selection so only the chosen key positions remain available to the later score path.

sparse top-k

The sparse-selection stage that keeps only a bounded key set before the later score or masking path runs.

key set

The selected sparse key positions that survive routing and stay visible to the later score or mask update path.

shape class

A bounded family of tensor shapes captured or compiled as one stable runtime topology instead of treating every dynamic case as one global worst case.