MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202611 min readDavid Gornshtein

Flash Attention

FA4

CuTe

Blackwell

H200

Attention Kernels

The FA4 Catalog on Blackwell: Variants, sm Guards, and Runtime Selection

Q: What is FA4 in plain language?

In these posts, FA4 means "the CuTe DSL FlashAttention-4 backend family," not a universal replacement for every attention path in the stack. Some dense causal requests are eligible, some decode requests are only bounded helpers, and some sparse or exact-token requests belong to completely different proof lanes. The fastest checked-in decoder is Kernel examples overview plus Dense FA4 execute proof sample.

Q: What do sm_100a, sm_121a, and sm_120f mean for FA4 routing?

sm_100a is the datacenter Blackwell target associated with the broader SM100-side tensor-path contract. sm_121a is the exact GB10 device target. sm_120f is the consumer-Blackwell family compile target we use when we want family-common optimizations for shipping kernels. In FA4 terms, that distinction matters because backend eligibility is decided against the real device and the checked-in proof surfaces, not against the most optimistic compile label. Use MegaCpp model wiring examples for the checked-in GB10 bring-up map, GB10 repro bundle overview plus GB10 repro walkthrough for the exact gate walk, and GB10 sm_100a cubin patch repro plus GB10 Blackwell tensor paths: what we actually proved for the runtime-proof boundary.

Inside the Flash Attention 4 catalog MegaCpp ships: which kernel variants we keep, the sm_100 / sm_121a guards, the selection policy at runtime, and the validity checks that fail closed.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

The FA4 Catalog on Blackwell: Variants, sm Guards, and Runtime Selection

Published April 18, 2026•11 min read•David Gornshtein

The cross-path Flash Attention 4 in practice post explains why we treat FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample as several separate lines of work instead of one migration. This post is the implementation companion. It describes the catalog itself: which surfaces we register, what backends each surface accepts, the compute-capability gates that decide whether a request is even legal on a given device, the runtime selection policy, and the typed validity checks that fail closed.

Why MegaCpp cares about this

FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample is two things at once. It is a fast CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample kernel family on the hardware lanes it actually targets. And it is a backend surface that does not cover every mask, layout, prefix rule, or decode shape our hybrid stack uses.

For first touch:

FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample means FlashAttention-4: the newer attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-kernel family MegaCpp can call through a CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample runtime when the device, mask contract, and layout are eligible. The smallest checked-in runtime proof surface is Dense FA4 execute proof sample.
CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample is NVIDIA's tensor-kernel stack; CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample is the Python-facing tile and layout layer inside that stack that FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample builds on. If you want the shortest local decoder for where CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary ownership begins and ends, use Our honest experience with CuTe DSL, Flash Attention 4 in practice, and Kernel examples overview.
Backend eligibility means only "is this request allowed to try this backend on this device?" It is not the same thing as import success, benchmark naming, or runtime proof.
Execute proof means a checked-in surface that records what actually ran, not just what a preset requested. In this lane the compact local receipts are Dense FA4 execute proof sample, Dense FA4 KV-cache decode sample, and FA4 receipt summary sample.
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough, Tensor Memory (TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity), and TMA are different Blackwell-side nouns. In public-safe wording here, tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro Tensor Core instruction family, TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity is the distinct tensor-memory surface, and TMA is the copy/transport surface next to them rather than a synonym for either one. If you want the shortest noun map before the routing story, start with MegaCpp model glossary. If the copy-side TMA part is the thing you are actually trying to decode, use TileLang TMA bulk-copy 3D shared-memory deep dive as the sibling article.

UMMAQuick term guideUMMAA Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.GroundingAbout: GB10 tensor-path proof summary Example: TileLang TMA bulk-copy sample is another matrix-instruction noun in the Blackwell conversation, but this FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample catalog still treats it as a hardware capability term rather than as an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns execute-proof receipt by itself.

A useful boundary to keep in your head before the catalog details: CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample is the kernel-building stack, CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample is the Python authoring layer inside that stack, dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample is one backend family built on that layer, and FlexAttention is a frontend request surface that may ask for a FLASH-family backend without turning every FLASH receipt into dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample proof. The smallest checked-in pair for that boundary is FlexAttention compile sample and MoBA block-sparse decode sample: the first keeps the optional FLASH or CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary probe separate from the basic FlexAttention import and compile path, and the second keeps a sparse FLASH-style request separate from the backend that actually executed.

If we wire FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample as one global flag, we create two failure modes:

silent fallback, where the runtime did not actually execute FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample but the operator thinks it did
overclaim, where a planner or config row names FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample on a device or shape that was never eligible

That is why this article uses backend eligibility as a first-touch term. It means one question only: "is this request allowed to try this backend on this device at all?" It does not mean "the import worked." It does not mean "the config asked for it." It does not mean "the benchmark row had FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample in the preset name."

The second reason this matters is the Blackwell hardware split. CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample's Blackwell documentation keeps the public names explicit: tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails for the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro-side tensor instructions, tmemQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity / Tensor Memory as a separate data locale, and TMA copy primitives as an adjacent transport surface rather than a synonym for the tensor-math path. That naming discipline is useful here because datacenter Blackwell (sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough, B200) and consumer Blackwell (sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough, GB10 / DGX Spark) do not share one proof boundary. For first-touch reading, keep the nouns separate: sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro is the baseline target name, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough are architecture-specific target labels, tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough / tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough / tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails are separate SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro-side probe surfaces, and TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough is the clustered cp.async.bulk.tensor...multicast::cluster copy probe rather than a synonym for the matrix path. The checked-in GB10 bundle is the proof boundary for what we can safely say about the consumer lane, and it is also where .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro stays in the "observed metadata, not public API" bucket unless NVIDIA publishes a stronger definition. The shortest reader-safe decoder for that target split is GB10 Blackwell tensor paths: what we actually proved, with the production-side compile-label nuance expanded in Training the MegaCpp SLM Ensemble on GB10 and NVFP4 Inference for the MegaCpp SLM Ensemble. If you want the public-safe proof surfaces behind those terms instead of a prose summary, the right neighbor is the GB10 bundle:

What the catalog contains

The implementation is split across three public surfaces:

the catalog itself
the dense full-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns dispatcher
the typed validity layer the dispatcher consumes

The catalog is intentionally boring data. It exists so the planner, the tests, the runtime receipts, and the prose summary all talk about the same surface IDs and the same proof rules.

The fields that matter most to an operator or reviewer are:

architectural_fit: whether a surface is applicable, conditional, or non-applicable
implementation_state: whether it is opt-in, experimental, bounded, or shadow-only
proof_policy: what kind of receipt counts as a real execute proof
requested_backend_aliases and accepted_actual_backend_values: what users may ask for and what the runtime is allowed to report
constraint_tags: the hard backend-eligibility rules
blocked_reasons: the recognized but intentionally non-productized cases

There is no single reader-safe monolithic FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample catalog in this tree on purpose. The closest checked-in public-safe catalog slice is the combination of Kernel examples overview, Attention validity prefix sample, Dense FA4 execute proof sample, Dense FA4 KV-cache decode sample, and FA4 receipt summary sample. Together they keep the same nouns this article uses: applicability, proof policy, requested backend, actual backend, and blocked reasons.

The dense FA4 line

The dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample surfaces are where deployment pressure lives:

dense_full.train_eval
dense_full.prefill
dense_full.decode

The important point is not the names. It is the rule that dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample has an explicit eligibility contract. Typical rejection reasons include:

causal policy required
non-uniform doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample not supported
sliding-window mode not supported on that line
partial valid-token counts not supported on that line
CUDA-only

The smallest checked-in contract sketch for that validity side is Attention validity prefix sample. The dense execute-proof companion is Dense FA4 execute proof sample. The bounded decode-side companion is Dense FA4 KV-cache decode sample.

That is also why we do not treat PyTorch's BACKEND='FLASH' knob as a blanket dense-FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample proof by itself. On the frontend, FLASH is still a requested backend, and the official PyTorch docs keep backend choice separate from whether that backend is actually usable for the call. The recent PyTorch FlexAttention plus FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample write-up keeps the Flash backend's FP32-reference validation story explicit, but it also keeps the limitations explicit, especially around block-size constraints, captured-buffer gradients, and block-sparse backward determinism. Our catalog follows that same separation: dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample only counts as proven when the request is legal for the dense lane and the runtime receipt still shows the dense lane actually executed.

One useful distinction is that dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample parity and sparse FLASH determinism are different claims. The dense train/eval lane is the one that inherits the usual FP32-reference parity story, while dynamic block-sparse routes still carry a separate determinism warning because gradient accumulation order can move with the sparse schedule. That is exactly why the catalog keeps dense and blockized sparse as different proof surfaces instead of one generic "FLASH is on" state.

Precision is a separate axis again: an FP8 training recipe can make the dense lane worth measuring, but it does not by itself prove that FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample was selected, legal, or parity-checked for the call.

If your next question is "what actually executed in rollout?" rather than "which surface is legal?", jump to Flash Attention 4 in practice. If the confusion is really about training FP8 versus Blackwell inference formats, keep FP8 in the training stack, Precision recipe: FP16, BF16, FP8, NVFP4, and NVFP4 inference separate from the FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample eligibility story.

Sparse and exact-token lines stay separate

We keep the sparse and exact-token lines separate because they fail for different reasons and therefore need different proof contracts.

block_sparse_flash.* is the blockized sparse CUDA line. Shadow mode never counts as execute proof.
exact_token.* is the exact-token DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample line. It is not a substitute for dense deployment proof.

That separation is the whole reason the catalog exists. One line executing cleanly does not make another line eligible or proven.

The blockized sparse FLASH line is also a different frontend contract, not just a different mask. On that route, Python-side score or mask logic can still lower into a CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary-backed FLASH execution path, which is powerful but makes bookkeeping stricter: the request name, the eligibility result, and the actual executed backend all need to stay visible separately.

That is also why a blockized sparse BACKEND='FLASH' request cannot inherit the dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample proof story by name alone. The dense line is the place where we expect the tighter FP32-reference parity narrative and the smaller execute-proof samples to stay aligned. The blockized sparse line is a different contract: it keeps the Python-side routing flexibility, but its proof boundary is narrower, its runtime evidence is more observational, and its determinism story remains separate from the dense lane. In other words, sharing the word FLASH does not collapse those routes into one evidence class.

If you want the shortest checked-in receipt for that sparse boundary, open MoBA block-sparse decode sample. It exists specifically so a blockized sparse request can ask for a FLASH-style backend and still report Triton as the actual backend without accidentally claiming dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample execution.

Device guards and runtime selection

The runtime selection side has three jobs:

decide whether a request is eligible
choose among the eligible backends
record what actually executed

The third job is the one teams usually underbuild. That is why the checked-in receipt surfaces matter:

In practice that means requested, eligible, and executed are three different states. Attention validity prefix sample is the checked-in reminder that validity and prefix normalization decide whether a request may try the backend at all. A dense_fa4 or BACKEND='FLASH' request only tells the selector what to attempt. Execute proof begins only when the runtime-side samples record an actual forward pass or a structured receipt with the backend that really ran.

The rollout-side receipt sample keeps that proof bar explicit by separating kernel-truth fields from speed fields. FA4 receipt summary sample records backend_truth and fa4_kernels_verified separately from throughput, peak memory, and compile_time_sec, which means a fast run without backend confirmation still does not count as dense-FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample evidence. If your question is "what really executed?" rather than "was the smoke fast?", pair that sample with Profiler and receipts instead of collapsing the two stories into one number.

For consumer versus datacenter Blackwell, the reader-facing rule is intentionally narrow:

GB10 does not inherit every B200 expectation just because both say "Blackwell."
The GB10 public-safe evidence supports a narrower BF16 or FP16 CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary-style warp-level line, not a datacenter tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough plus TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity claim.
The public-safe evidence does not justify claiming datacenter-only tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough or TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity parity on GB10.

The official CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Blackwell docs keep the same split. SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro pages describe the tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails-oriented datacenter contract, while separate SM120 pages document the consumer-Blackwell GEMM lane. That is the right grounding for our routing language here: a consumer-family compile target such as sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.GroundingAbout: GB10 stack parity History: GB10 journey Example: GB10 cubin patch repro can still be the correct shipping choice without proving that the device exposes the same SM100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough or TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity path the datacenter docs describe.

That distinction is not just a performance note. On the consumer lane, the safer fallback path is closer to register-backed mma.sync.aligned plus ldmatrix staging than to the datacenter TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity / tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough contract, which is why fa3_checked_first is a routing guardrail rather than a benchmark preference.

How it lands in MegaCpp

The catalog is the contract. The dispatch layer is where the runtime truth is attached. The validity layer is the typed gate that keeps the dispatch from guessing.

That split matters because the worst FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample bugs are not "the kernel crashed." They are:

the runtime executed a fallback without exposing it clearly
the config or preset name implied a stronger claim than the runtime delivered
a shape-specific restriction leaked through as a benchmark-only curiosity instead of a typed rejection

Production checklist

Keep dense, sparse, and exact-token FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FlashAttention 4 in practice Example: Dense FA4 execute proof sample lines separate.
Keep backend eligibility and execute proof as separate concepts.
Fail closed on unsupported validity shapes instead of guessing.
Treat shadow paths as telemetry, never as execute proof.
Record requested backend and actual backend separately.
Keep GB10 claims inside the public-safe boundary defined by the checked-in GB10 bundle.

FAQ

Frequently asked questions

Why keep separate dense, sparse, and exact-token FA4 surfaces?+

Because they have different eligibility cliffs, fallback reasons, and proof contracts. Treating them as one backend flag hides the exact place where a request stopped being valid.

What is FA4 in plain language?+

In these posts, FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. means "the CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang. FlashAttention-4 backend family," not a universal replacement for every attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. path in the stack. Some dense causal requests are eligible, some decode requests are only bounded helpers, and some sparse or exact-token requests belong to completely different proof lanes. The fastest checked-in decoder is Kernel examples overview plus Dense FA4 execute proof sample.

Does GB10 run the same FA4 line as datacenter Blackwell?+

No. The safe public claim is narrower. GB10 is handled as its own consumer-Blackwell lane with its own eligibility boundary and its own checked-in proof surfaces. Even when a shipping GB10 kernel compiles as sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a., that is a consumer-family compile choice, not permission to treat GB10 as the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. plus TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. lane.

What do sm_100a, sm_121a, and sm_120f mean for FA4 routing?+

sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. is the datacenter Blackwell target associated with the broader SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.-side tensor-path contract. sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the exact GB10 device target. sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a. is the consumer-Blackwell family compile target we use when we want family-common optimizations for shipping kernels. In FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. terms, that distinction matters because backend eligibility is decided against the real device and the checked-in proof surfaces, not against the most optimistic compile label. Use MegaCpp model wiring examples for the checked-in GB10 bring-up map, GB10 repro bundle overview plus GB10 repro walkthrough for the exact gate walk, and GB10 sm_100a cubin patch repro plus GB10 Blackwell tensor paths: what we actually proved for the runtime-proof boundary.

What does "backend eligible" mean in plain language?+

That the request satisfies the hard preconditions for a backend on the current device: the right compute capability, causal policy, validity shape, and mask contract. It does not mean the backend executed.

Where do tcgen05, TMEM, and TMA multicast expectations actually live in the checked-in tree?+

Use the narrow GB10 proof surfaces, not a marketing summary: MegaCpp model wiring examples, GB10 tensor paths, GB10 sm_100a cubin patch repro, Driver signal vs runtime proof sample, Baseline arch-patch probe sample, GB10 gate matrix, GB10 repro walkthrough, minimal TMEM allocation probe source, full GB10 tensor-path probe source, and GB10 public claims guardrail. Those links keep separate what the driver loads, what helper metadata says, what the staged tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe. / tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes. walk reaches, and what still lacks an end-to-end runtime proof on GB10.

Are OMMA or UTCOMMA official names I should use in FA4-facing docs?+

Not as first-choice reader-facing terms. For official naming here, prefer tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path., TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable., TMA, and UMMAQuick term guideUMMAA Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.. If OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable. or UTCOMMAQuick term guideUTCOMMAA datacenter-oriented Blackwell tensor-path feature called out in the GB10 hardware notes as unavailable on consumer parts. appear in internal notes or disassembly discussion, label them explicitly as repo-local shorthand or low-level instruction names rather than as NVIDIA product terminology.

Which checked-in files show the difference between "eligible", "requested", and "executed"?+

Attention validity prefix sample shows the validity-side contract, Dense FA4 execute proof sample shows the minimal execute-proof surface, Dense FA4 KV-cache decode sample shows the bounded decode-side checks, and FA4 receipt summary sample shows how rollout summaries keep the requested backend and the backend that actually executed separate. If the confusion is really about helper signals versus runtime proof, Driver signal vs runtime proof sample is the shortest checked-in reminder that a driver-visible hint is still not an FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. execution receipt.

Does requesting BACKEND='FLASH' prove FA4 executed?+

No. It proves only that the caller asked the selector to try the FLASH or CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.-style lane. Eligibility can still reject the request, and the runtime can still record a different actual backend. The execute-proof boundary is the checked-in runtime evidence: Dense FA4 execute proof sample checks a real CUDA forward pass, while FA4 receipt summary sample keeps backend truth separate from the verification flags and speed summary.

Why doesn't BACKEND='FLASH' make every FlexAttention route a dense FA4 proof?+

Because frontend lowering, backend eligibility, and executed backend are three different facts. FlexAttention can legally ask for FLASH on routes that still have extra mask or score-shaping behavior, sparse bookkeeping, or backend fallbacks in play. Dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. proof is the narrower claim: the request matched the dense lane's validity contract, the device was eligible for that lane, and the runtime receipt still shows the dense FA4 path rather than a different backend or a sparse-only observation lane.

Where does TensorSSA fit in the FlexAttention FLASH path?+

Treat TensorSSA as a compiler-lowering detail, not as an execution receipt. In the FlexAttention path, some Python-side score_mod or mask_mod expressions can be represented below the frontend before a FLASH-family backend is attempted, but that does not prove dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. executed. The catalog still requires the same three facts: the request was eligible, the requested backend stayed visible, and the runtime receipt recorded the actual backend.

Do FA4's softmax changes change the catalog decision?+

No. The FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. research treats softmax work as part of the kernel pipeline: the backend may schedule exponentiation and rescaling work differently from older attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. kernels, including software-emulated exp2 pieces that reduce pressure on special-function units. That is useful kernel context, but it is not a new catalog state. MegaCpp still decides eligibility from the device, mask, and validity contract, then counts proof only from the runtime receipt.

Where should I look for the FlexAttention boundary in checked-in examples?+

Use FlexAttention compile sample for the frontend-side boundary and MoBA block-sparse decode sample for the sparse requested-vs-actual receipt. The first keeps the optional FLASH or CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface. probe separate from loading FlexAttention at all; the second shows why a sparse FLASH-style request still needs its own runtime receipt instead of inheriting dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. proof by name.

Why is blockized sparse BACKEND='FLASH' still treated as experimental when dense FA4 is cataloged more tightly?+

Because the sparse line keeps extra moving parts alive at the frontend boundary. Dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. mainly has to answer "was this dense causal request legal, and what actually executed?" The blockized sparse line also has to preserve Python-side score or mask shaping, keep requested and actual backend names visible, and avoid overclaiming determinism just because a FLASH-family kernel appeared in the receipt. That is why the catalog treats it as a separate proof lane rather than a denser alias.

Why prefer an FA3-first fallback on consumer Blackwell instead of dropping straight to SDPA?+

Because the safer failure mode is still a typed, architecture-verified attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. surface. When FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. eligibility fails on the consumer lane, the FA3 fallback preserves a closer performance and contract boundary than an immediate drop to generic SDPA or an unsupported Triton attempt, which is exactly what the fa3_checked_first style policy is meant to protect.

Why does the sparse BACKEND='FLASH' route still need stricter bookkeeping than a plain dense FA4 call?+

Because the sparse route is not only "dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. with a different mask." It can carry Python-side score_mod or mask_mod logic down into the FLASH-family backend, so the request shape, the requested backend, and the executed backend have to stay visible as separate facts. If you want the neighboring runtime story, pair this post with Attention Validity and Structure-Aware Attention and Flash Attention 4 in practice rather than treating the sparse line as an automatic dense-FA4 proof.

Which checked-in file should I open if I want the whole applicability matrix, not just the samples?+

Start with Kernel examples overview, then Attention validity prefix sample, Dense FA4 execute proof sample, and FA4 receipt summary sample. This tree deliberately keeps the applicability matrix split across smaller checked-in contracts instead of hiding it in one large implementation file.

What is the shortest checked-in reading path if I only need the catalog nouns?+

Open Kernel examples overview first, then Attention validity prefix sample, Dense FA4 execute proof sample, and FA4 receipt summary sample. Jump into the GB10 bundle only if the sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. / sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. / sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. split is the thing you still need to decode.

Does block-scaled FP4 or NVFP4 support make this a dense FA4 proof?+

No. Block-scaled FP4, NVFP4, and SM120 narrow-precision GEMM support are adjacent precision and GEMM facts, not attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. execution receipts. For this catalog, the proof bar is still narrower: the dense attention request has to pass backend eligibility, and the runtime receipt has to show the dense FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. path actually executed. Use Precision recipe: FP16, BF16, FP8, NVFP4 and NVFP4 inference for the dtype story, then pair Dense FA4 execute proof sample with GB10 Blackwell tensor paths: what we actually proved before making a runtime claim.

Where should I go if my question is really about precision, not FA4 eligibility?+

Use the sibling precision posts on purpose: FP8 in the training stack for selective FP8 training surfaces, Precision recipe: FP16, BF16, FP8, NVFP4 for the full per-phase dtype map, and NVFP4 inference for the serving-side Blackwell path. They explain adjacent decisions, not FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. execute proof.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

sm_100

Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.

Grounding

sm_121a

Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.

Grounding

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Grounding

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

Grounding

CuTe DSL

The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.

Grounding

sm_100a

Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.

Grounding

sm_120f

Family-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.

Grounding

tcgen05.alloc

Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.

Grounding

tcgen05.ld

Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

tcgen05

The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.

Grounding

TMA multicast

The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.

Grounding

TMEM

Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.

Grounding

OMMA

The older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.

Grounding

UMMA

A Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.

Grounding

UTCOMMA

A datacenter-oriented Blackwell tensor-path feature called out in the GB10 hardware notes as unavailable on consumer parts.

Grounding

tcgen05.mma

The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.

Grounding

mma.sync

The older warp-level matrix-multiply path that predates Hopper WGMMA and the SM100 tcgen05 family.

Grounding

.nv.capmerc

Observed section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.

Grounding

AttentionValidity

The validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.

Grounding

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

Grounding

CUTLASS

NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.

Grounding

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

Grounding

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

Grounding

Topic hubs

Topic Hub

H200 Training and Kernel Bring-Up

A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.

David Gornshtein • MegaCppMore posts →

The FA4 Catalog on Blackwell: Variants, sm Guards, and Runtime Selection

Why MegaCpp cares about this

What the catalog contains

The dense FA4 line

Sparse and exact-token lines stay separate

Device guards and runtime selection

How it lands in MegaCpp

Production checklist

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

H200 Training and Kernel Bring-Up