MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202610 min readDavid Gornshtein

Flash Attention

FA4

CuTe

H200

Attention Kernels

Dense Attention

Flash Attention 4 in practice: what we shipped and what we cut

Q: Is --dense_fa4 enough to prove FA4 executed?

No. The proof surface is runtime truth such as actual_backend=dense_fa4 on a bounded execution record. A flag, config field, or launch manifest can only say the path was requested or allowed.

Q: What is still blocked?

Broad deployment-grade decode and KV-cache support. Prefill and smoke proofs exist on narrower surfaces, but paged-KV and continuously batched decode need their own runtime receipts.

Our hybrid stack's applicability matrix for Flash Attention 4, the validation profiles, the dense-full rollout gates, and the regressions that killed the first FA4 variants before they reached deployment.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Flash Attention 4 in practice: what we shipped and what we cut

Published April 18, 2026•10 min read•David Gornshtein

Flash Attention 4 in Practice: What We Shipped, What We Cut

Flash Attention 4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample is not a drop-in replacement for Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns 3. In our hybrid Mamba 3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode plus Transformer stack, the dense Transformer blocks are a minority of compute but most of the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-shaped risk, and FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample changes enough of the kernel contract that "just turn it on" is the wrong verb. What we actually shipped is a bounded opt-in path on the canonical H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 stack, with a fail-closed applicability matrix, a staged rollout, and a list of variants that looked fine in microbenchmark and were rejected on contract grounds.

This post describes that work the way we track it in code and public receipts: applicability split, validation test profiles, the dense/full rollout manifest, the hybrid prefill/decode plan, and the FA3 control-ppath regressions that killed the early FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample candidates before we ever let them touch a main training run. The companion FA4 catalog on Blackwell covers device-side variant selection; this article stays on the rollout and evidence boundary.

Not One Migration, Four Separate Lines

The first non-negotiable rule we wrote down is that "FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample migration" is not a single line of work. Collapsing the lines is how you end up promoting sparse donor-side evidence as if it were dense deployment proof. The taxonomy we actually track splits along five axes:

dense/full FA4: the upstream flash_attn/cute dense full-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns path in our main dispatcher.
blockized sparse CUDA: blockized sparse routing that may execute on Triton or on CuTe/FLASH, with MoBA-style block selection.
exact-token DSA: token-topk sparse semantics including the bounded eval/no-grad fa4_gather path.
serving/decode: product-surface decode truth, which is stricter than model-level plumbing.
non-applicable: Mamba-3 and Gated DeltaNet surfaces, which are not attention and must stay labeled that way.

Each line gets its own execution evidence, its own promotion gate, and its own fail-closed guards. A checked-in execution record on one line never implies parity on another. We encode this in a machine-readable catalog so that the planner, the test runner, and the evidence grader all agree on surface IDs, profile IDs, and family boundaries, and we use a companion stoplight matrix for prose synthesis. Truth order when they disagree: current code on main wins, then open task state, then the catalog, then the matrix, then dated reports and execution records. Docs never outrank code.

The public-safe proof surface is intentionally small: dense FA4 execute proof sample, dense FA4 KV-cache decode sample, attention validity prefix sample, and FA4 receipt summary sample.

Applicability: What Actually Counts As Proof

Applicability is where most FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample enthusiasm goes to die. Our working matrix is roughly this:

Dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample applies now to bounded CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 causal fixed-length train/eval/prefill, plus bounded contiguous-KV and paged serving/runtime decode. Proof requires actual_backend=dense_fa4 on a real bounded execute surface. A --dense_fa4 CLI flag, a prefill_backend=dense_fa4 config entry, or a planner manifest does not count.
Blockized sparse FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample applies to bounded CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 blockized sparse train/eval. Proof requires the full four-tuple: requested_backend, actual_backend, runtime_mode, and fallback_reason together. Preset names and shadow runs do not count.
Exact-token DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample applies only to exact-token semantics, including bounded eval/no-grad fa4_gather. Donor/runtime compare and blockized sparse FLASH execution records are not substitutes.
Serving/decode applies to bounded dense-first serving with contiguous-KV and to paged or scheduler-managed bounded serving paths. Proof requires engine/runtime evidence on the real bounded serving path; control-ppath config acceptance does not count.

The non-negotiable guards behind the matrix: shadow never counts as sparse FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample execute proof; donor runtime comparison never counts as mainline execute proof; TPU rows stay TPU-native and non-FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample; Mamba-3 and Gated DeltaNet stay non-applicable; and a ServingConfig(prefill_backend="dense_fa4") plus a bounded decode_backend="dense_fa4" still do not imply paged or schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention-managed productization. Those knobs are control-ppath surface until an engine-level FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample prefill path actually executes.

That serving split is easiest to reason about next to KV cache and paged attention: config acceptance, contiguous-KV smoke, and paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack productization are different receipts.

FA4 Validation Profiles

The local validation lane covers the FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample code and test substrate. The profile catalog is code, not prose: a machine-readable wave structure lives alongside the dispatcher and is rendered into pytest commands by a helper script instead of being hand-maintained in shell history. Each profile carries four schema fields that keep overclaim out of the evidence stream:

proof_class, one of planner, observational, validation, or execute.
artifact_kind, the concrete substrate: contract_test, planner_manifest, launch_manifest_pending, observational_profile_record, matrix_summary_record, microbenchmark_case_record, validated_remote_execution, or negative_guard.
evidence_grade, matching proof_class.
counts_for_status, which is True only for real execute-grade rows.

The first validation tier is dispatcher-contract coverage on the fixed-length no-KV slice. It runs CPU and small-CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 tests on the public flash-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend matrix, serving config rejection of prefill_backend="dense_fa4" with paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack, the train-args wiring, and the ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary contract. The second tier crosses into bounded H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 canonical execution records: a CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary import-plus-tiny-forward smoke, an H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 dense-decode smoke, an exact-token fa4_gather H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 smoke, and the small-shape CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary train ladder evidence family. The second tier does not produce promotion rows; it produces the evidence that the rollout execution tier is allowed to run at all.

A launch manifest is still a planning artifact until it carries runtime truth such as actual backend, request shape, fallback reason, and whether the receipt is prefill-only or decode-shaped. The local FA4 receipt summary sample is the small version of that boundary.

The Real Blocker: Dispatch Order

The single biggest reason dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample spent a month as "real code, not yet a runtime runtime" is dispatch ordering. Our flash_attn_func checks FA3 before dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 paths where FA3 is available. That means --dense_fa4 is not equivalent to "must execute dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample" on a normal H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200; it means "dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample is enabled as an optional branch" while FA3 usually wins first. The fix is not a raise on FA3 availability; the fix is a dedicated selector:

enable_dense_fa4_attention() and disable_dense_fa4_attention() as explicit entry points.
A _use_dense_fa4(device) helper that stays separate from the sparse backend knobs (moba_backend=fa4, donor_runtime_compare=fa4, block_sparse_runtime_mode).
A doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample contract that is an explicit policy decision and not an implicit fallback. Uniform rows normalize away upstream and may still dispatch on dense FA4; non-uniform rows are rejected with a machine-readable tag (dense_fa4_no_doc_ids_support) and fall through to another dense backend rather than raising.

Per-call truth is recorded in _last_dense_fa4_result, so a test or a execution evidence can assert that the dispatcher actually took the FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample branch rather than inferring it from presets.

Dense/Full Rollout: L0 to R7

The rollout is a manifest, not a vibe. The machine-readable companion lives in the project and drives the helper that emits manifests and runnable commands templates. The rungs:

L0: reference dispatcher contract is green on the fixed-length no-KV slice.
L1: comparison/profile evidence schema is stable; candidate rows fail closed when FA4 did not actually execute.
R1/R2: bounded H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 canonical execute proof with explicit environment-bound rows.
R3: ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook execute proof with the same shape (batch=1, seq=128, n_head=8, head_dim=64, bf16, causal-only, no KV cache, CuTe interface import check).
R4: no-KV dense comparison matrix, six configs plus baseline, with complete perf/memory/diff fields and preserved actual-backend truth.
R5: profiler evidence on the canonical stack, driven by the public FA4-vs-Triton dense profiling harness.
R6: bounded 2-step train and 100-step short-train execution records, no NaN loss, loss-convergence ratio not worse than 1.05 vs the dense reference preset.
R7: hybrid prefill execution record with explicit decode truth; promotion gate passes with explicit execute-proof evidence.

Stop rules are concrete: max_abs_diff > 0.01 on any required row, any NaN/Inf output, throughput regression over 10% versus the dense H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 reference, memory increase over 5%, a failed 2-step train, or any execution record claiming decode/KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack support for dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample that has not actually executed that path. The last one is the rule that catches the most eager reports.

Hybrid Prefill, Explicit Decode

The first honest hybrid plan is narrow: prompt-prefill may use upstream flash_attn/cute; decode stays on an already-supported path. Receipts at this rung record machine binding, the exact prompt length, causal policy, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample presence, the observed prefill result (including the dispatcher's requested-vs-effective backend truth and any fallback reason), and an explicit decode field: either decode_executed=false or decode_backend="fa3" with the name spelled out. An execution record that silently mixes FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample prefill with unlabeled decode fallback is not valid evidence.

Decode itself is still blocked on prerequisites we refuse to wave through: a public decode/KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack contract for dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample in the dispatcher, a paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack decode semantics proof on that line, a fail-closed doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample and packed-doc policy for decode, a bounded canonical-H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 token-by-token decode execution record, and a parity matrix after the prefill handoff. The promotion-gate config keeps no_kv_cache_support active for both prefill_promotion and full_train_promotion, so the gate itself blocks honestly when these are missing.

The dense FA4 KV-cache decode sample keeps that claim bounded: no-KV prefill, contiguous-KV decode, and paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack or continuously batched decode are separate receipts.

Regressions That Killed Early Variants

Three classes of regression accounted for nearly every rejected candidate.

The first was fail-open inference in the promotion gate. An earlier gate marked a dense/full candidate ready as soon as all comparison rows passed, without requiring explicit execute-proof rows for both h200_canonical and modal_h200. That is exactly how "all green" promotions happen without a single real execute record. We removed the inference, forced environment-bound rows, and renamed the canonical metrics to median_tok_per_sec and peak_memory_mib so legacy field names (throughput_toks, peak_memory_gb) can still parse but cannot be the only evidence.

The second was a control-ppath dtype regression on the exact-token FA3 path. Our chunk-metadata candidate vectorized planning work and restored row_cu_seqlens to int64 on the layout-facing side while keeping cu_k int32 on the kernel-facing side. Baseline: 485,376 tok/s; candidate: 552,397 tok/s; delta: +13.8%; peak_memory_mib unchanged. Same exact-token backend, same packer path, same chunk_plan_count=64, same runtime_row_metadata_prepared_once=true. We kept it because the runtime identity did not drift. The earlier non-dtype-correct candidate looked similarly good in microbenchmark but silently changed the metadata contract consumed by tests and chunk-layout code, and that is the kind of "improvement" that turns into a week of triage two refactors later.

The third was serving-config overclaim. An older shape of AdapterServingEngine.from_config() accepted prefill_backend="dense_fa4" combined with paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack and continuous batchingQuick term guidecontinuous batchingThe serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.GroundingAbout: inference serving stack Reference: KV cache and paged attention Reference: observability and SLO dashboards, because the engine routed prefill through a replay path that never asked the dispatcher which backend actually ran. We made ServingConfig fail closed on those combinations and made AdapterServingEngine.from_config() also fail closed until an engine-level FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample prefill path exists. Now the config surface blocks the claim instead of enabling the illusion.

The Honest Verdict

Dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample is real code, a real bounded opt-in path, and not yet a general deployment-grade runtime. Code-wired: yes. Dispatcher-contract tests: yes. Canonical H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 execute proof: yes. ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts execute proof: the path exists and the helper emits the command, but the checked-in execution record is still pending. No-KV dense comparison matrix, bounded prefill, and bounded short-train: planned and gated, not executed on main yet. Decode and KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack: blocked on the prerequisites above.

Shortest honest description for the rest of the team: dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample is a bounded experimental execution path with one canonical H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 smoke record and a staged rollout, separate from exact-token DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample and from blockized sparse CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200/FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample. The value of writing it down this way, rather than collapsing it into a single "FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell Example: Dense FA4 execute proof sample is live" bullet, is that the next person to touch the dispatcher can see exactly which line they are standing on and which execution records they still owe.

Lane status

Lane	Code wired	Dispatcher tests	Canonical H200 execution record	Modal execution record
dense / full FA4	yes	yes	yes (smoke)	helper emits, evidence pending
blockized sparse FA4	yes	yes	partial	not started
exact-token DSA + FA4	yes	yes	yes	partial
FA3 fallback	yes	yes	yes	yes

# Fail-closed applicability - dense FA4 is opt-in only.
def select_backend(req):
    if not req.opt_in_fa4:
        return "fa3"
    ok, reason = _validate_dense_fa4_eligibility(req.shape, req.dtype, req.sm)
    return "fa4_dense" if ok else ("fa3", reason)

FAQ

Frequently asked questions

Is --dense_fa4 enough to prove FA4 executed?+

No. The proof surface is runtime truth such as actual_backend=dense_fa4 on a bounded execution record. A flag, config field, or launch manifest can only say the path was requested or allowed.

Why keep dense/full FA4 separate from exact-token DSA and sparse FA4?+

Those lines fail for different reasons and need different receipts. Dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. is mostly an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.-dispatch and validity-shape problem, exact-token DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture. is a gather-and-validity problem, and sparse FA4 depends on routing contracts that do not transfer from the dense lane.

What is still blocked?+

Broad deployment-grade decode and KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step. support. Prefill and smoke proofs exist on narrower surfaces, but paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer. and continuously batched decode need their own runtime receipts.

Does speculative target verification make dense FA4 a decode proof?+

No. Target verification can look like bounded prefill only when the drafted token block is contiguous and verified as a batch. That is useful for speculative decoding, but it still does not prove paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer. or continuously batched decode. MoBA belongs to the blockized sparse line for the same reason: the local MoBA block-sparse decode sample is a requested-versus-actual backend receipt, not evidence that dense/full FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. covered the sparse path.

Why is contiguous-KV decode not the same as paged-KV decode?+

Contiguous-KV decode can append K/V and gather a bounded prefix before calling a dense varlen kernel; paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer. decode has to respect block tables, schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state. moves, and cache residency. The dense FA4 KV-cache decode sample is useful because it keeps that distinction explicit instead of treating an append-style helper as a general serving engine.

Which local files show the requested-versus-executed distinction?+

Attention validity prefix sample shows the validity-side contract, dense FA4 execute proof sample shows the minimal execute-proof surface, and FA4 receipt summary sample shows how rollout summaries keep requested, effective, and fallback states separate.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Grounding

AttentionValidity

The validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.

Grounding

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

Grounding

Paged attention

The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

Grounding

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

Grounding

scheduler

The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.

Grounding

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

Grounding

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

Grounding

continuous batching

The serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Topic hubs

Topic Hub

H200 Training and Kernel Bring-Up

A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.

David Gornshtein • MegaCppMore posts →

Flash Attention 4 in practice: what we shipped and what we cut

Flash Attention 4 in Practice: What We Shipped, What We Cut

Not One Migration, Four Separate Lines

Applicability: What Actually Counts As Proof

FA4 Validation Profiles

The Real Blocker: Dispatch Order

Dense/Full Rollout: L0 to R7

Hybrid Prefill, Explicit Decode

Regressions That Killed Early Variants

The Honest Verdict

Lane status

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

H200 Training and Kernel Bring-Up