Flash Attention 4 in practice: what we shipped and what we cut
Our hybrid stack's applicability matrix for Flash Attention 4, the validation profiles, the dense-full rollout gates, and the regressions that killed the first FA4 variants before they reached deployment.

Flash Attention 4 in Practice: What We Shipped, What We Cut
Flash Attention 4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample is not a drop-in replacement for Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns 3. In our hybrid Mamba 3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode plus Transformer stack, the dense Transformer blocks are a minority of compute but most of the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-shaped risk, and FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample changes enough of the kernel contract that "just turn it on" is the wrong verb. What we actually shipped is a bounded opt-in path on the canonical H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 stack, with a fail-closed applicability matrix, a staged rollout, and a list of variants that looked fine in microbenchmark and were rejected on contract grounds.
This post describes that work the way we track it in code and public receipts: applicability split, validation test profiles, the dense/full rollout manifest, the hybrid prefill/decode plan, and the FA3 control-ppath regressions that killed the early FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample candidates before we ever let them touch a main training run. The companion FA4 catalog on Blackwell covers device-side variant selection; this article stays on the rollout and evidence boundary.
Not One Migration, Four Separate Lines
The first non-negotiable rule we wrote down is that "FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample migration" is not a single line of work. Collapsing the lines is how you end up promoting sparse donor-side evidence as if it were dense deployment proof. The taxonomy we actually track splits along five axes:
dense/full FA4: the upstreamflash_attn/cutedense full-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns path in our main dispatcher.blockized sparse CUDA: blockized sparse routing that may execute on Triton or on CuTe/FLASH, with MoBA-style block selection.exact-token DSA: token-topk sparse semantics including the bounded eval/no-gradfa4_gatherpath.serving/decode: product-surface decode truth, which is stricter than model-level plumbing.non-applicable: Mamba-3 and Gated DeltaNet surfaces, which are not attention and must stay labeled that way.
Each line gets its own execution evidence, its own promotion gate, and its own
fail-closed guards. A checked-in execution record on one line never implies parity on
another. We encode this in a machine-readable catalog so that the planner, the test runner, and the evidence grader all agree on surface IDs, profile IDs, and family boundaries, and we use a companion stoplight matrix for prose synthesis. Truth order when they
disagree: current code on main wins, then open task state, then the
catalog, then the matrix, then dated reports and execution records. Docs never
outrank code.
The public-safe proof surface is intentionally small: dense FA4 execute proof sample, dense FA4 KV-cache decode sample, attention validity prefix sample, and FA4 receipt summary sample.
Applicability: What Actually Counts As Proof
Applicability is where most FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample enthusiasm goes to die. Our working matrix is roughly this:
- Dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample applies now to bounded CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 causal fixed-length
train/eval/prefill, plus bounded contiguous-KV and paged serving/runtime
decode. Proof requires
actual_backend=dense_fa4on a real bounded execute surface. A--dense_fa4CLI flag, aprefill_backend=dense_fa4config entry, or a planner manifest does not count. - Blockized sparse FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample applies to bounded CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 blockized sparse train/eval.
Proof requires the full four-tuple:
requested_backend,actual_backend,runtime_mode, andfallback_reasontogether. Preset names andshadowruns do not count. - Exact-token DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample applies only to exact-token semantics, including bounded
eval/no-grad
fa4_gather. Donor/runtime compare and blockized sparse FLASH execution records are not substitutes. - Serving/decode applies to bounded dense-first serving with contiguous-KV and to paged or scheduler-managed bounded serving paths. Proof requires engine/runtime evidence on the real bounded serving path; control-ppath config acceptance does not count.
The non-negotiable guards behind the matrix: shadow never counts as sparse
FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample execute proof; donor runtime comparison never counts as mainline execute
proof; TPU rows stay TPU-native and non-FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample; Mamba-3 and Gated DeltaNet stay
non-applicable; and a ServingConfig(prefill_backend="dense_fa4") plus a
bounded decode_backend="dense_fa4" still do not imply paged or
schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention-managed productization. Those knobs are control-ppath surface
until an engine-level FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample prefill path actually executes.
That serving split is easiest to reason about next to KV cache and paged attention: config acceptance, contiguous-KV smoke, and paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack productization are different receipts.
FA4 Validation Profiles
The local validation lane covers the FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample code and test substrate. The profile catalog is code, not prose: a machine-readable wave structure lives alongside the dispatcher and is rendered into pytest commands by a helper script instead of being hand-maintained in shell history. Each profile carries four schema fields that keep overclaim out of the evidence stream:
proof_class, one ofplanner,observational,validation, orexecute.artifact_kind, the concrete substrate:contract_test,planner_manifest,launch_manifest_pending,observational_profile_record,matrix_summary_record,microbenchmark_case_record,validated_remote_execution, ornegative_guard.evidence_grade, matchingproof_class.counts_for_status, which isTrueonly for real execute-grade rows.
The first validation tier is dispatcher-contract coverage on the fixed-length no-KV slice. It
runs CPU and small-CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 tests on the public flash-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend matrix,
serving config rejection of prefill_backend="dense_fa4" with paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack, the
train-args wiring, and the ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary contract. The second tier crosses into bounded
H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 canonical execution records: a CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary import-plus-tiny-forward smoke, an H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200
dense-decode smoke, an exact-token fa4_gather H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 smoke, and the
small-shape CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary train ladder evidence family. The second tier does not produce promotion rows;
it produces the evidence that the rollout execution tier is allowed to run at
all.
A launch manifest is still a planning artifact until it carries runtime truth such as actual backend, request shape, fallback reason, and whether the receipt is prefill-only or decode-shaped. The local FA4 receipt summary sample is the small version of that boundary.
The Real Blocker: Dispatch Order
The single biggest reason dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample spent a month as "real code, not yet
a runtime runtime" is dispatch ordering. Our flash_attn_func checks FA3
before dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 paths where FA3 is available. That means
--dense_fa4 is not equivalent to "must execute dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample" on a normal
H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200; it means "dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample is enabled as an optional branch" while FA3
usually wins first. The fix is not a raise on FA3 availability; the fix
is a dedicated selector:
enable_dense_fa4_attention()anddisable_dense_fa4_attention()as explicit entry points.- A
_use_dense_fa4(device)helper that stays separate from the sparse backend knobs (moba_backend=fa4,donor_runtime_compare=fa4,block_sparse_runtime_mode). - A
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample contract that is an explicit policy decision and not an implicit fallback. Uniform rows normalize away upstream and may still dispatch on dense FA4; non-uniform rows are rejected with a machine-readable tag (dense_fa4_no_doc_ids_support) and fall through to another dense backend rather than raising.
Per-call truth is recorded in _last_dense_fa4_result, so a test or a
execution evidence can assert that the dispatcher actually took the FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample branch rather than
inferring it from presets.
Dense/Full Rollout: L0 to R7
The rollout is a manifest, not a vibe. The machine-readable companion lives in the project and drives the helper that emits manifests and runnable commands templates. The rungs:
- L0: reference dispatcher contract is green on the fixed-length no-KV slice.
- L1: comparison/profile evidence schema is stable; candidate rows fail closed when FA4 did not actually execute.
- R1/R2: bounded H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 canonical execute proof with explicit environment-bound rows.
- R3: ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook execute proof with the same shape (batch=1, seq=128, n_head=8, head_dim=64, bf16, causal-only, no KV cache, CuTe interface import check).
- R4: no-KV dense comparison matrix, six configs plus baseline, with complete perf/memory/diff fields and preserved actual-backend truth.
- R5: profiler evidence on the canonical stack, driven by the public FA4-vs-Triton dense profiling harness.
- R6: bounded 2-step train and 100-step short-train execution records, no NaN loss, loss-convergence ratio not worse than 1.05 vs the dense reference preset.
- R7: hybrid prefill execution record with explicit decode truth; promotion gate passes with explicit execute-proof evidence.
Stop rules are concrete: max_abs_diff > 0.01 on any required row, any
NaN/Inf output, throughput regression over 10% versus the dense H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200
reference, memory increase over 5%, a failed 2-step train, or any execution record
claiming decode/KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack support for dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample that has not actually
executed that path. The last one is the rule that catches the most eager
reports.
Hybrid Prefill, Explicit Decode
The first honest hybrid plan is narrow: prompt-prefill may use upstream
flash_attn/cute; decode stays on an already-supported path. Receipts at
this rung record machine binding, the exact prompt length, causal policy,
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample presence, the observed prefill result (including the dispatcher's
requested-vs-effective backend truth and any fallback reason), and an
explicit decode field: either decode_executed=false or
decode_backend="fa3" with the name spelled out. An execution record that silently
mixes FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample prefill with unlabeled decode fallback is not valid evidence.
Decode itself is still blocked on prerequisites we refuse to wave through:
a public decode/KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack contract for dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample in the dispatcher, a
paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack decode semantics proof on that line, a fail-closed doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample and
packed-doc policy for decode, a bounded canonical-H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 token-by-token
decode execution record, and a parity matrix after the prefill handoff. The
promotion-gate config keeps no_kv_cache_support active for both
prefill_promotion and full_train_promotion, so the gate itself blocks
honestly when these are missing.
The dense FA4 KV-cache decode sample keeps that claim bounded: no-KV prefill, contiguous-KV decode, and paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack or continuously batched decode are separate receipts.
Regressions That Killed Early Variants
Three classes of regression accounted for nearly every rejected candidate.
The first was fail-open inference in the promotion gate. An earlier gate
marked a dense/full candidate ready as soon as all comparison rows passed,
without requiring explicit execute-proof rows for both h200_canonical and
modal_h200. That is exactly how "all green" promotions happen without a
single real execute record. We removed the inference, forced
environment-bound rows, and renamed the canonical metrics to
median_tok_per_sec and peak_memory_mib so legacy field names
(throughput_toks, peak_memory_gb) can still parse but cannot be the only
evidence.
The second was a control-ppath dtype regression on the exact-token FA3
path. Our chunk-metadata candidate vectorized planning work and restored
row_cu_seqlens to int64 on the layout-facing side while keeping cu_k
int32 on the kernel-facing side. Baseline: 485,376 tok/s; candidate:
552,397 tok/s; delta: +13.8%; peak_memory_mib unchanged. Same exact-token
backend, same packer path, same chunk_plan_count=64, same
runtime_row_metadata_prepared_once=true. We kept it because the runtime
identity did not drift. The earlier non-dtype-correct candidate looked
similarly good in microbenchmark but silently changed the metadata contract
consumed by tests and chunk-layout code, and that is the kind of
"improvement" that turns into a week of triage two refactors later.
The third was serving-config overclaim. An older shape of
AdapterServingEngine.from_config() accepted
prefill_backend="dense_fa4" combined with paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack and continuous
batchingQuick term guidecontinuous batchingThe serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.GroundingAbout: inference serving stack Reference: KV cache and paged attention Reference: observability and SLO dashboards, because the engine routed prefill through a replay path that
never asked the dispatcher which backend actually ran. We made
ServingConfig fail closed on those combinations and made
AdapterServingEngine.from_config() also fail closed until an engine-level
FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample prefill path exists. Now the config surface blocks the claim instead
of enabling the illusion.
The Honest Verdict
Dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample is real code, a real bounded opt-in path, and not yet a
general deployment-grade runtime. Code-wired: yes. Dispatcher-contract
tests: yes. Canonical H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 execute proof: yes. ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts execute proof: the
path exists and the helper emits the command, but the checked-in execution record is
still pending. No-KV dense comparison matrix, bounded prefill, and bounded
short-train: planned and gated, not executed on main yet. Decode and
KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack: blocked on the prerequisites above.
Shortest honest description for the rest of the team: dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample is a bounded experimental execution path with one canonical H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 smoke record and a staged rollout, separate from exact-token DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample and from blockized sparse CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200/FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample. The value of writing it down this way, rather than collapsing it into a single "FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample is live" bullet, is that the next person to touch the dispatcher can see exactly which line they are standing on and which execution records they still owe.
Lane status
| Lane | Code wired | Dispatcher tests | Canonical H200 execution record | Modal execution record |
|---|---|---|---|---|
| dense / full FA4 | yes | yes | yes (smoke) | helper emits, evidence pending |
| blockized sparse FA4 | yes | yes | partial | not started |
| exact-token DSA + FA4 | yes | yes | yes | partial |
| FA3 fallback | yes | yes | yes | yes |
# Fail-closed applicability - dense FA4 is opt-in only.
def select_backend(req):
if not req.opt_in_fa4:
return "fa3"
ok, reason = _validate_dense_fa4_eligibility(req.shape, req.dtype, req.sm)
return "fa4_dense" if ok else ("fa3", reason)
Frequently asked questions
Is --dense_fa4 enough to prove FA4 executed?+
actual_backend=dense_fa4 on a bounded execution record. A flag, config field, or launch manifest can only say the path was requested or allowed.Why keep dense/full FA4 separate from exact-token DSA and sparse FA4?+
What is still blocked?+
Does speculative target verification make dense FA4 a decode proof?+
Why is contiguous-KV decode not the same as paged-KV decode?+
Which local files show the requested-versus-executed distinction?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.
The validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.
The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.
A container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.
A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.
The serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.