The FA4 Catalog on Blackwell: Variants, sm Guards, and Runtime Selection
Inside the Flash Attention 4 catalog MegaCpp ships: which kernel variants we keep, the sm_100 / sm_121a guards, the selection policy at runtime, and the validity checks that fail closed.

The cross-path Flash Attention 4 in practice post explains why we treat FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample as several separate lines of work instead of one migration. This post is the implementation companion. It describes the catalog itself: which surfaces we register, what backends each surface accepts, the compute-capability gates that decide whether a request is even legal on a given device, the runtime selection policy, and the typed validity checks that fail closed.
Why MegaCpp cares about this
FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample is two things at once. It is a fast CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample kernel family on the hardware lanes it actually targets. And it is a backend surface that does not cover every mask, layout, prefix rule, or decode shape our hybrid stack uses.
For first touch:
- FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample means FlashAttention-4: the newer attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-kernel family MegaCpp can call through a CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample runtime when the device, mask contract, and layout are eligible. The smallest checked-in runtime proof surface is Dense FA4 execute proof sample.
- CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample is NVIDIA's tensor-kernel stack; CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample is the Python-facing tile and layout layer inside that stack that FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample builds on. If you want the shortest local decoder for where CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary ownership begins and ends, use Our honest experience with CuTe DSL, Flash Attention 4 in practice, and Kernel examples overview.
- Backend eligibility means only "is this request allowed to try this backend on this device?" It is not the same thing as import success, benchmark naming, or runtime proof.
- Execute proof means a checked-in surface that records what actually ran, not just what a preset requested. In this lane the compact local receipts are Dense FA4 execute proof sample, Dense FA4 KV-cache decode sample, and FA4 receipt summary sample.
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough, Tensor Memory (TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity), and TMA are different Blackwell-side nouns. In public-safe wording here,tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro Tensor Core instruction family, TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity is the distinct tensor-memory surface, and TMA is the copy/transport surface next to them rather than a synonym for either one. If you want the shortest noun map before the routing story, start with MegaCpp model glossary. If the copy-side TMA part is the thing you are actually trying to decode, use TileLang TMA bulk-copy 3D shared-memory deep dive as the sibling article.
UMMAQuick term guideUMMAA Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.GroundingAbout: GB10 tensor-path proof summary Example: TileLang TMA bulk-copy sample is another matrix-instruction noun in the Blackwell conversation, but this FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample catalog still treats it as a hardware capability term rather than as an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns execute-proof receipt by itself.
A useful boundary to keep in your head before the catalog details: CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample is the kernel-building stack, CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample is the Python authoring layer inside that stack, dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample is one backend family built on that layer, and FlexAttention is a frontend request surface that may ask for a FLASH-family backend without turning every FLASH receipt into dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample proof. The smallest checked-in pair for that boundary is FlexAttention compile sample and MoBA block-sparse decode sample: the first keeps the optional FLASH or CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary probe separate from the basic FlexAttention import and compile path, and the second keeps a sparse FLASH-style request separate from the backend that actually executed.
If we wire FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample as one global flag, we create two failure modes:
- silent fallback, where the runtime did not actually execute FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample but the operator thinks it did
- overclaim, where a planner or config row names FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample on a device or shape that was never eligible
That is why this article uses backend eligibility as a first-touch term. It means one question only: "is this request allowed to try this backend on this device at all?" It does not mean "the import worked." It does not mean "the config asked for it." It does not mean "the benchmark row had FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample in the preset name."
The second reason this matters is the Blackwell hardware split. CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample's Blackwell documentation keeps the public names explicit: tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails for the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro-side tensor instructions, tmemQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity / Tensor Memory as a separate data locale, and TMA copy primitives as an adjacent transport surface rather than a synonym for the tensor-math path. That naming discipline is useful here because datacenter Blackwell (sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough, B200) and consumer Blackwell (sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough, GB10 / DGX Spark) do not share one proof boundary. For first-touch reading, keep the nouns separate: sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro is the baseline target name, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough are architecture-specific target labels, tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough / tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough / tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails are separate SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro-side probe surfaces, and TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough is the clustered cp.async.bulk.tensor...multicast::cluster copy probe rather than a synonym for the matrix path. The checked-in GB10 bundle is the proof boundary for what we can safely say about the consumer lane, and it is also where .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro stays in the "observed metadata, not public API" bucket unless NVIDIA publishes a stronger definition. The shortest reader-safe decoder for that target split is GB10 Blackwell tensor paths: what we actually proved, with the production-side compile-label nuance expanded in Training the MegaCpp SLM Ensemble on GB10 and NVFP4 Inference for the MegaCpp SLM Ensemble. If you want the public-safe proof surfaces behind those terms instead of a prose summary, the right neighbor is the GB10 bundle:
- GB10
sm_100acubin patch repro - GB10 gate matrix
- GB10 repro walkthrough
- minimal TMEM allocation probe source
- full GB10 tensor-path probe source
- GB10 public claims guardrail
What the catalog contains
The implementation is split across three public surfaces:
- the catalog itself
- the dense full-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns dispatcher
- the typed validity layer the dispatcher consumes
The catalog is intentionally boring data. It exists so the planner, the tests, the runtime receipts, and the prose summary all talk about the same surface IDs and the same proof rules.
The fields that matter most to an operator or reviewer are:
architectural_fit: whether a surface is applicable, conditional, or non-applicableimplementation_state: whether it is opt-in, experimental, bounded, or shadow-onlyproof_policy: what kind of receipt counts as a real execute proofrequested_backend_aliasesandaccepted_actual_backend_values: what users may ask for and what the runtime is allowed to reportconstraint_tags: the hard backend-eligibility rulesblocked_reasons: the recognized but intentionally non-productized cases
There is no single reader-safe monolithic FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample catalog in this tree on purpose. The closest checked-in public-safe catalog slice is the combination of Kernel examples overview, Attention validity prefix sample, Dense FA4 execute proof sample, Dense FA4 KV-cache decode sample, and FA4 receipt summary sample. Together they keep the same nouns this article uses: applicability, proof policy, requested backend, actual backend, and blocked reasons.
The dense FA4 line
The dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample surfaces are where deployment pressure lives:
dense_full.train_evaldense_full.prefilldense_full.decode
The important point is not the names. It is the rule that dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample has an explicit eligibility contract. Typical rejection reasons include:
- causal policy required
- non-uniform
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample not supported - sliding-window mode not supported on that line
- partial valid-token counts not supported on that line
- CUDA-only
The smallest checked-in contract sketch for that validity side is Attention validity prefix sample. The dense execute-proof companion is Dense FA4 execute proof sample. The bounded decode-side companion is Dense FA4 KV-cache decode sample.
That is also why we do not treat PyTorch's BACKEND='FLASH' knob as a blanket dense-FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample proof by itself. On the frontend, FLASH is still a requested backend, and the official PyTorch docs keep backend choice separate from whether that backend is actually usable for the call. The recent PyTorch FlexAttention plus FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample write-up keeps the Flash backend's FP32-reference validation story explicit, but it also keeps the limitations explicit, especially around block-size constraints, captured-buffer gradients, and block-sparse backward determinism. Our catalog follows that same separation: dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample only counts as proven when the request is legal for the dense lane and the runtime receipt still shows the dense lane actually executed.
One useful distinction is that dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample parity and sparse FLASH determinism are different claims. The dense train/eval lane is the one that inherits the usual FP32-reference parity story, while dynamic block-sparse routes still carry a separate determinism warning because gradient accumulation order can move with the sparse schedule. That is exactly why the catalog keeps dense and blockized sparse as different proof surfaces instead of one generic "FLASH is on" state.
Precision is a separate axis again: an FP8 training recipe can make the dense lane worth measuring, but it does not by itself prove that FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample was selected, legal, or parity-checked for the call.
If your next question is "what actually executed in rollout?" rather than "which surface is legal?", jump to Flash Attention 4 in practice. If the confusion is really about training FP8 versus Blackwell inference formats, keep FP8 in the training stack, Precision recipe: FP16, BF16, FP8, NVFP4, and NVFP4 inference separate from the FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample eligibility story.
Sparse and exact-token lines stay separate
We keep the sparse and exact-token lines separate because they fail for different reasons and therefore need different proof contracts.
block_sparse_flash.*is the blockized sparse CUDA line. Shadow mode never counts as execute proof.exact_token.*is the exact-token DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample line. It is not a substitute for dense deployment proof.
That separation is the whole reason the catalog exists. One line executing cleanly does not make another line eligible or proven.
The blockized sparse FLASH line is also a different frontend contract, not just a different mask. On that route, Python-side score or mask logic can still lower into a CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary-backed FLASH execution path, which is powerful but makes bookkeeping stricter: the request name, the eligibility result, and the actual executed backend all need to stay visible separately.
That is also why a blockized sparse BACKEND='FLASH' request cannot inherit the dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample proof story by name alone. The dense line is the place where we expect the tighter FP32-reference parity narrative and the smaller execute-proof samples to stay aligned. The blockized sparse line is a different contract: it keeps the Python-side routing flexibility, but its proof boundary is narrower, its runtime evidence is more observational, and its determinism story remains separate from the dense lane. In other words, sharing the word FLASH does not collapse those routes into one evidence class.
If you want the shortest checked-in receipt for that sparse boundary, open MoBA block-sparse decode sample. It exists specifically so a blockized sparse request can ask for a FLASH-style backend and still report Triton as the actual backend without accidentally claiming dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample execution.
Device guards and runtime selection
The runtime selection side has three jobs:
- decide whether a request is eligible
- choose among the eligible backends
- record what actually executed
The third job is the one teams usually underbuild. That is why the checked-in receipt surfaces matter:
- Dense FA4 execute proof sample
- Dense FA4 KV-cache decode sample
- FA4 receipt summary sample
- Kernel examples overview
- Kernel examples catalog
In practice that means requested, eligible, and executed are three different states. Attention validity prefix sample is the checked-in reminder that validity and prefix normalization decide whether a request may try the backend at all. A dense_fa4 or BACKEND='FLASH' request only tells the selector what to attempt. Execute proof begins only when the runtime-side samples record an actual forward pass or a structured receipt with the backend that really ran.
The rollout-side receipt sample keeps that proof bar explicit by separating kernel-truth fields from speed fields. FA4 receipt summary sample records backend_truth and fa4_kernels_verified separately from throughput, peak memory, and compile_time_sec, which means a fast run without backend confirmation still does not count as dense-FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample evidence. If your question is "what really executed?" rather than "was the smoke fast?", pair that sample with Profiler and receipts instead of collapsing the two stories into one number.
For consumer versus datacenter Blackwell, the reader-facing rule is intentionally narrow:
- GB10 does not inherit every B200 expectation just because both say "Blackwell."
- The GB10 public-safe evidence supports a narrower BF16 or FP16 CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary-style warp-level line, not a datacenter
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough plus TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity claim. - The public-safe evidence does not justify claiming datacenter-only
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough or TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity parity on GB10.
The official CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Blackwell docs keep the same split. SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro pages describe the tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails-oriented datacenter contract, while separate SM120 pages document the consumer-Blackwell GEMM lane. That is the right grounding for our routing language here: a consumer-family compile target such as sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.GroundingAbout: GB10 stack parity History: GB10 journey Example: GB10 cubin patch repro can still be the correct shipping choice without proving that the device exposes the same SM100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough or TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity path the datacenter docs describe.
That distinction is not just a performance note. On the consumer lane, the safer fallback path is closer to register-backed mma.sync.aligned plus ldmatrix staging than to the datacenter TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity / tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough contract, which is why fa3_checked_first is a routing guardrail rather than a benchmark preference.
How it lands in MegaCpp
The catalog is the contract. The dispatch layer is where the runtime truth is attached. The validity layer is the typed gate that keeps the dispatch from guessing.
That split matters because the worst FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample bugs are not "the kernel crashed." They are:
- the runtime executed a fallback without exposing it clearly
- the config or preset name implied a stronger claim than the runtime delivered
- a shape-specific restriction leaked through as a benchmark-only curiosity instead of a typed rejection
Production checklist
- Keep dense, sparse, and exact-token FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample lines separate.
- Keep backend eligibility and execute proof as separate concepts.
- Fail closed on unsupported validity shapes instead of guessing.
- Treat shadow paths as telemetry, never as execute proof.
- Record requested backend and actual backend separately.
- Keep GB10 claims inside the public-safe boundary defined by the checked-in GB10 bundle.
Frequently asked questions
Why keep separate dense, sparse, and exact-token FA4 surfaces?+
What is FA4 in plain language?+
Does GB10 run the same FA4 line as datacenter Blackwell?+
sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a., that is a consumer-family compile choice, not permission to treat GB10 as the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. plus TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. lane.What do sm_100a, sm_121a, and sm_120f mean for FA4 routing?+
sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. is the datacenter Blackwell target associated with the broader SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.-side tensor-path contract. sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the exact GB10 device target. sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a. is the consumer-Blackwell family compile target we use when we want family-common optimizations for shipping kernels. In FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. terms, that distinction matters because backend eligibility is decided against the real device and the checked-in proof surfaces, not against the most optimistic compile label. Use MegaCpp model wiring examples for the checked-in GB10 bring-up map, GB10 repro bundle overview plus GB10 repro walkthrough for the exact gate walk, and GB10 sm_100a cubin patch repro plus GB10 Blackwell tensor paths: what we actually proved for the runtime-proof boundary.What does "backend eligible" mean in plain language?+
Where do tcgen05, TMEM, and TMA multicast expectations actually live in the checked-in tree?+
sm_100a cubin patch repro, Driver signal vs runtime proof sample, Baseline arch-patch probe sample, GB10 gate matrix, GB10 repro walkthrough, minimal TMEM allocation probe source, full GB10 tensor-path probe source, and GB10 public claims guardrail. Those links keep separate what the driver loads, what helper metadata says, what the staged tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe. / tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes. walk reaches, and what still lacks an end-to-end runtime proof on GB10.Are OMMA or UTCOMMA official names I should use in FA4-facing docs?+
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path., TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable., TMA, and UMMAQuick term guideUMMAA Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.. If OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable. or UTCOMMAQuick term guideUTCOMMAA datacenter-oriented Blackwell tensor-path feature called out in the GB10 hardware notes as unavailable on consumer parts. appear in internal notes or disassembly discussion, label them explicitly as repo-local shorthand or low-level instruction names rather than as NVIDIA product terminology.Which checked-in files show the difference between "eligible", "requested", and "executed"?+
Does requesting BACKEND='FLASH' prove FA4 executed?+
Why doesn't BACKEND='FLASH' make every FlexAttention route a dense FA4 proof?+
FLASH on routes that still have extra mask or score-shaping behavior, sparse bookkeeping, or backend fallbacks in play. Dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. proof is the narrower claim: the request matched the dense lane's validity contract, the device was eligible for that lane, and the runtime receipt still shows the dense FA4 path rather than a different backend or a sparse-only observation lane.Where does TensorSSA fit in the FlexAttention FLASH path?+
score_mod or mask_mod expressions can be represented below the frontend before a FLASH-family backend is attempted, but that does not prove dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. executed. The catalog still requires the same three facts: the request was eligible, the requested backend stayed visible, and the runtime receipt recorded the actual backend.Do FA4's softmax changes change the catalog decision?+
exp2 pieces that reduce pressure on special-function units. That is useful kernel context, but it is not a new catalog state. MegaCpp still decides eligibility from the device, mask, and validity contract, then counts proof only from the runtime receipt.Where should I look for the FlexAttention boundary in checked-in examples?+
Why is blockized sparse BACKEND='FLASH' still treated as experimental when dense FA4 is cataloged more tightly?+
Why prefer an FA3-first fallback on consumer Blackwell instead of dropping straight to SDPA?+
fa3_checked_first style policy is meant to protect.Why does the sparse BACKEND='FLASH' route still need stricter bookkeeping than a plain dense FA4 call?+
score_mod or mask_mod logic down into the FLASH-family backend, so the request shape, the requested backend, and the executed backend have to stay visible as separate facts. If you want the neighboring runtime story, pair this post with Attention Validity and Structure-Aware Attention and Flash Attention 4 in practice rather than treating the sparse line as an automatic dense-FA4 proof.Which checked-in file should I open if I want the whole applicability matrix, not just the samples?+
What is the shortest checked-in reading path if I only need the catalog nouns?+
sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. / sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. / sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. split is the thing you still need to decode.Does block-scaled FP4 or NVFP4 support make this a dense FA4 proof?+
Where should I go if my question is really about precision, not FA4 eligibility?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.
Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.
FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.
CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.
The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.
Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.
Family-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.
Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.
Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.
The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.
Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.
The older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.
A Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.
A datacenter-oriented Blackwell tensor-path feature called out in the GB10 hardware notes as unavailable on consumer parts.
The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.
The older warp-level matrix-multiply path that predates Hopper WGMMA and the SM100 tcgen05 family.
Observed section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.
The validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.
The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.