Flash Attention 4 in Practice: What We Shipped, What We Cut
Our hybrid stack's applicability matrix for Flash Attention 4, the Build4 validation profiles, the dense-full rollout gates, and the regressions that killed the first FA4 variants before they reached production.

Flash Attention 4 in Practice: What We Shipped, What We Cut
Flash Attention 4 is not a drop-in replacement for Flash Attention 3. In our hybrid Mamba 3 plus Transformer stack, the dense Transformer blocks are a minority of compute but most of the attention-shaped risk, and FA4 changes enough of the kernel contract that "just turn it on" is the wrong verb. What we actually shipped is a bounded opt-in lane on the canonical H200 stack, with a fail-closed applicability matrix, a staged rollout, and a list of variants that looked fine in microbench and were rejected on contract grounds.
This post describes that work the way we track it internally: applicability split, Build4 test profiles, the dense/full rollout manifest, the hybrid prefill/decode plan, and the FA3 control-plane regressions that killed the early FA4 candidates before we ever let them touch a main training run.
Not One Migration, Four Separate Lines
The first non-negotiable rule we wrote down is that "FA4 migration" is not a single line of work. Collapsing the lines is how you end up promoting sparse donor receipts as if they were dense production proof. The taxonomy we actually track splits along five axes:
dense/full FA4: the upstreamflash_attn/cutedense full-attention path in our main dispatcher.blockized sparse CUDA: blockized sparse routing that may execute on Triton or on CuTe/FLASH, with MoBA-style block selection.exact-token DSA: token-topk sparse semantics including the bounded eval/no-gradfa4_gatherpath.serving/decode: product-surface decode truth, which is stricter than model-level plumbing.non-applicable: Mamba-3 and Gated DeltaNet surfaces, which are not attention and must stay labeled that way.
Each line gets its own receipts, its own promotion gate, and its own
fail-closed guards. A checked-in receipt on one line never implies parity on
another. We machine-readably encode this in a catalog
(fa4_catalog.py) so that the planner, the test runner, and the receipt
graders all agree on surface_id, profile_id, and family boundaries, and we
use a companion stoplight matrix for prose synthesis. Truth order when they
disagree: current code on main wins, then open task state, then the
catalog, then the matrix, then dated reports and receipts. Docs never
outrank code.
Applicability: What Actually Counts As Proof
Applicability is where most FA4 enthusiasm goes to die. Our working matrix is roughly this:
- Dense/full FA4 applies now to bounded CUDA causal fixed-length
train/eval/prefill, plus bounded contiguous-KV and paged serving/runtime
decode. Proof requires
actual_backend=dense_fa4on a real bounded execute surface. A--dense_fa4CLI flag, aprefill_backend=dense_fa4config entry, or a planner manifest does not count. - Blockized sparse FA4 applies to bounded CUDA blockized sparse train/eval.
Proof requires the full four-tuple:
requested_backend,actual_backend,runtime_mode, andfallback_reasontogether. Preset names andshadowruns do not count. - Exact-token DSA applies only to exact-token semantics, including bounded
eval/no-grad
fa4_gather. Donor/runtime compare and blockized sparse FLASH receipts are not substitutes. - Serving/decode applies to bounded dense-first serving with contiguous-KV and to paged or scheduler-managed bounded serving lanes. Proof requires engine/runtime evidence on the real bounded serving path; control-plane config acceptance does not count.
The non-negotiable guards behind the matrix: shadow never counts as sparse
FA4 execute proof; donor runtime comparison never counts as mainline execute
proof; TPU rows stay TPU-native and non-FA4; Mamba-3 and Gated DeltaNet stay
non-applicable; and a ServingConfig(prefill_backend="dense_fa4") plus a
bounded decode_backend="dense_fa4" still do not imply paged or
scheduler-managed productization. Those knobs are control-plane surface
until an engine-level FA4 prefill path actually executes.
Build4 Validation Profiles
Build4 is our local validation host for the FA4 code/test substrate. The profile catalog is code, not prose: a machine-readable wave structure lives alongside the dispatcher and is rendered into pytest commands by a helper script instead of being hand-maintained in shell history. Each profile carries four schema fields that keep overclaim out of the receipt stream:
proof_class, one ofplanner,observational,validation, orexecute.artifact_kind, the concrete substrate:contract_test,planner_manifest,launch_manifest_pending,observational_profile_receipt,matrix_summary_receipt,microbench_case_receipt,validated_remote_receipt, ornegative_guard.receipt_grade, matchingproof_class.counts_for_status, which isTrueonly for real execute-grade rows.
Wave 1 is dispatcher-contract coverage on the fixed-length no-KV slice. It
runs CPU and small-CUDA tests on flash_attention.py, the backend matrix,
serving config rejection of prefill_backend="dense_fa4" with paged-KV, the
train-args wiring, and the Modal CuTe contract. Wave 2 crosses into bounded
H200 canonical receipts: a CuTe import-plus-tiny-forward smoke, an H200
dense-decode smoke, an exact-token fa4_gather H200 smoke, and the NAM52
CuTe train ladder receipt family. Wave 2 does not produce promotion rows;
it produces the evidence that wave 3 (rollout receipts) is allowed to run at
all.
The Real Blocker: Dispatch Order
The single biggest reason dense/full FA4 spent a month as "real code, not yet
a production runtime" is dispatch ordering. Our flash_attn_func checks FA3
before dense/full FA4 on CUDA paths where FA3 is available. That means
--dense_fa4 is not equivalent to "must execute dense/full FA4" on a normal
H200; it means "dense/full FA4 is enabled as an optional branch" while FA3
usually wins first. The fix is not a raise on FA3 availability; the fix
is a dedicated selector:
enable_dense_fa4_attention()anddisable_dense_fa4_attention()as explicit entry points.- An internal
_use_dense_fa4(device)helper that stays separate from the sparse backend knobs (moba_backend=fa4,donor_runtime_compare=fa4,block_sparse_runtime_mode). - A
doc_idscontract that is an explicit policy decision and not an implicit fallback. Uniform rows normalize away upstream and may still dispatch on dense FA4; non-uniform rows are rejected with a machine-readable tag (dense_fa4_no_doc_ids_support) and fall through to another dense backend rather than raising.
Per-call truth is recorded in _last_dense_fa4_result, so a test or a
receipt can assert the dispatcher actually took the FA4 branch rather than
inferring it from presets.
Dense/Full Rollout: L0 to R7
The rollout is a manifest, not a vibe. The machine-readable companion lives in the repo and drives the helper that emits manifests and launcher templates. The rungs:
- L0: local dispatcher contract is green on the fixed-length no-KV slice.
- L1: comparison/profile receipt schema is stable; candidate rows fail closed when FA4 did not actually execute.
- R1/R2: bounded H200 canonical execute proof with explicit environment-bound rows.
- R3: Modal detached execute proof with the same shape (batch=1, seq=128, n_head=8, head_dim=64, bf16, causal-only, no KV cache, CuTe interface import check).
- R4: no-KV dense comparison matrix, six configs plus baseline, with complete perf/memory/diff fields and preserved actual-backend truth.
- R5: profiler receipt on the canonical stack, driven by
profile_fa4_vs_triton.pyin--mode=dense. - R6: bounded 2-step train and 100-step short-train receipts, no NaN loss, loss-convergence ratio not worse than 1.05 vs the dense reference preset.
- R7: hybrid prefill receipt with explicit decode truth; promotion gate passes with explicit execute-proof evidence.
Stop rules are concrete: max_abs_diff > 0.01 on any required row, any
NaN/Inf output, throughput regression over 10% versus the dense H200
reference, memory increase over 5%, a failed 2-step train, or any receipt
claiming decode/KV-cache support for dense/full FA4 that has not actually
executed that path. The last one is the rule that catches the most eager
reports.
Hybrid Prefill, Explicit Decode
The first honest hybrid plan is narrow: prompt-prefill may use upstream
flash_attn/cute; decode stays on an already-supported path. Receipts at
this rung record machine binding, the exact prompt length, causal policy,
doc_ids presence, the observed prefill result (including the dispatcher's
requested-vs-effective backend truth and any fallback reason), and an
explicit decode field: either decode_executed=false or
decode_backend="fa3" with the name spelled out. A receipt that silently
mixes FA4 prefill with unlabeled decode fallback is not a receipt.
Decode itself is still blocked on prerequisites we refuse to wave through:
a public decode/KV-cache contract for dense/full FA4 in the dispatcher, a
paged-KV decode semantics proof on that line, a fail-closed doc_ids and
packed-doc policy for decode, a bounded canonical-H200 token-by-token
decode receipt, and a parity matrix after the prefill handoff. The
promotion-gate config keeps no_kv_cache_support active for both
prefill_promotion and full_train_promotion, so the gate itself blocks
honestly when these are missing.
Regressions That Killed Early Variants
Three classes of regression accounted for nearly every rejected candidate.
The first was fail-open inference in the promotion gate. An earlier gate
marked a dense/full candidate ready as soon as all comparison rows passed,
without requiring explicit execute-proof rows for both h200_canonical and
modal_h200. That is exactly how "all green" promotions happen without a
single real execute receipt. We removed the inference, forced
environment-bound rows, and renamed the canonical metrics to
median_tok_per_sec and peak_memory_mib so legacy field names
(throughput_toks, peak_memory_gb) can still parse but cannot be the only
evidence.
The second was a control-plane dtype regression on the exact-token FA3
path. Our chunk-metadata candidate vectorized planning work and restored
row_cu_seqlens to int64 on the layout-facing side while keeping cu_k
int32 on the kernel-facing side. Baseline: 485,376 tok/s; candidate:
552,397 tok/s; delta: +13.8%; peak_memory_mib unchanged. Same exact-token
backend, same packer path, same chunk_plan_count=64, same
runtime_row_metadata_prepared_once=true. We kept it because the runtime
identity did not drift. The earlier non-dtype-correct candidate looked
similarly good in microbench but silently changed the metadata contract
consumed by tests and chunk-layout code, and that is the kind of
"improvement" that turns into a week of triage two refactors later.
The third was serving-config overclaim. An older shape of
AdapterServingEngine.from_config() accepted
prefill_backend="dense_fa4" combined with paged-KV and continuous
batching, because the engine routed prefill through a replay path that
never asked the dispatcher which backend actually ran. We made
ServingConfig fail closed on those combinations and made
AdapterServingEngine.from_config() also fail closed until an engine-level
FA4 prefill path exists. Now the config surface blocks the claim instead
of enabling the illusion.
The Honest Verdict
Dense/full FA4 is real code, a real bounded opt-in lane, and not yet a
general production-grade runtime. Code-wired: yes. Dispatcher-contract
tests: yes. Canonical H200 execute proof: yes. Modal execute proof: the
lane exists and the helper emits the command, but the checked-in receipt is
still pending. No-KV dense comparison matrix, bounded prefill, and bounded
short-train: planned and gated, not executed on main yet. Decode and
KV-cache: blocked on the prerequisites above.
Shortest honest description for the rest of the team: dense/full FA4 is a bounded experimental execution lane with one canonical H200 smoke receipt and a staged rollout, separate from exact-token DSA and from blockized sparse CUDA/FA4. The value of writing it down this way, rather than collapsing it into a single "FA4 is live" bullet, is that the next person to touch the dispatcher can see exactly which line they are standing on and which receipts they still owe.
References (filenames only): FA4_APPLICABILITY_AND_TEST_PROFILES.md,
FA4_PRODUCTION_EXECUTION_PLAN.ru.md,
dense_full_fa4_first_slice_2026-03-13.md,
dense_full_fa4_execution_rollout_2026-03-17.md,
dense_fa4_hybrid_prefill_decode_plan_2026-03-14.md,
dense_full_fa4_rollout_manifest_2026-03-17.json,
fa3_chunk_metadata_candidate_2026-03-11.md,
fa3_chunk_metadata_dtype_candidate_2026-03-11.md.