Attention Validity and Structure-Aware Attention
A packed-row validity regression, the clustered-sparse follow-up it forced, and the structure-aware attention plan we are integrating into the MegaCpp training stack.

Attention Validity and Structure-Aware Attention
A lot of model quality on long C++ contexts hides inside boring metadata: which tokens in a packed row are actually valid, which blocks a clustered sparse router is allowed to attend to, and whether a piece of structural information is expected to be present, absent, or explicitly zero. When any one of those three states gets silently collapsed into another, the consequences are not noisy - they are quiet and plausible, and they survive through training because nothing obviously explodes. This post walks through one regression we fixed, the clustered-sparse follow-up that regression forced into the open, and how the result is feeding the structure-aware attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns plan we are integrating next.
The Packed-Row Validity Regression
We train on packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles where multiple documents share a single fixed-length sequence. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity metadata tells downstream kernels which tokens in the row are real, where each document starts and ends, and how to mask across document boundaries. On current main, the training script canonicalized that metadata for two paths in particular: CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 + FSDP distributed, and XLA. Both canonicalizers had a seemingly harmless rule: if any validity tensor is present in the batch, zero-fill all the other validity fields that are missing, so that every path downstream sees a uniformly shaped dict.
For packed-row metadata that is slot-prefix-only - meaning the batch supplies block-level slot counts but deliberately omits token-level counts - the "zero-fill missing" rule rewrote several absent fields into zero tensors:
- missing
row_valid_token_countsbecame a zero tensor. - missing
row_valid_block_countsbecame a zero tensor. - missing
row_block_size_tokensbecame a zero tensor. - missing
base_block_tokensbecame a zero tensor.
The smallest checked-in sketch of that contract is
Attention-validity prefix sample:
token_prefix means explicit per-row token counts, slot_prefix
means slot counts plus base_block_tokens, and mode="none" is the
intentional absence case. That one sample is enough to explain why
"absent" and "present but zero" cannot be collapsed.
The function is called shape canonicalization, but the semantic content
it was producing was not shape. It flipped "token-prefix is absent" into
"token-prefix is explicitly zero." Downstream consumers of
normalize_attention_validity() read that as a real, zero-length
token prefix, which is exactly the same as telling the kernel "this row
has zero valid tokens." No crashes, no NaNs, just silently-masked rows
inside training batches.
The affected code was upstream of the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns path itself. The bug
was not in the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity helper,
the FlashAttention backend adapter, or the clustered sparse module; it was in
the main training entrypoint, in
_canonicalize_structure_meta_for_fsdp_cuda and
_canonicalize_structure_meta_for_xla. The fix is one-line philosophy
applied in two places: missing validity fields stay absent. Structural,
platform, and tree metadata continue to be shape-stabilized as before
- those were never the problem - but optional validity fields are now
preserved in their original present/absent state. Slot-only metadata
now stays slot-only, and
normalize_attention_validity()seesslot_countspresent andtoken_prefixabsent unless the loader actually supplied one.
The fix landed with regression tests across the obvious surfaces: attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity tests, targeted coverage in training entrypoint regression tests for the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 and XLA canonicalizers preserving absent token-validity fields while still injecting the missing required CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200/XLA keys, the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity integration tests, and a flash-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns test run to confirm nothing downstream regressed. The durable rule for future canonicalizers is now written down: missing validity fields must stay absent unless the runtime is intentionally deriving them from a stronger contract. Shape stabilization and semantic injection are different operations and should never share a code path.
That direction also lines up with the public framework surface. PyTorch's
varlen_attn
API carries packed-sequence boundaries through cumulative sequence tensors
(cu_seq_q, cu_seq_k) instead of a padded dense batch, and the
PyTorch/XLA recompilation guide
calls out data-dependent outputs such as torch.nonzero plus real-dimension
queries as recurring sources of recompilation pressure. That is not proof
that every packed-row runtime must use the same interface, but it is strong
external evidence for the same bias: keep validity metadata compact and
typed, and avoid turning it into a dynamic-mask problem by accident.
The Clustered-Sparse Follow-Up
Fixing the canonicalizer closed the source of the regression but did not
close an ambiguity it exposed downstream. Our clustered sparse
attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns contract has a function
(_resolve_attention_validity_contract) that keeps slot_prefix
metadata only when the batch's base_block_tokens equals the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns
kernel's query_tile_size. When those do not match, the code had a
fallback path, and the fallback had never been covered by a test.
Enumerating the cases: slot_prefix is present, but base_block_tokens
does not match query_tile_size. Two subpaths:
- If an auxiliary
token_prefixis present in the batch, clustered sparse falls back totoken_prefixand proceeds with token-level validity. - If no auxiliary
token_prefixexists, the contract degrades toAttentionValidity(mode="none"). Clustered sparse attention then behaves as if no validity metadata was ever supplied.
That second branch is what the packed-row fix made visible. Before the
fix, slot-only rows with mismatched block sizes would pick up a
zero-filled token_prefix from the canonicalizer and hit the
"fall back to token_prefix" branch, with token_prefix semantically
meaning "zero valid tokens." After the fix, they hit the
mode="none" branch, which is also the explicitly documented behavior.
Neither behavior is a crash; both are product decisions about what
"slot-only metadata, wrong block size, no token fallback" should mean
on the clustered sparse path.
The follow-up added one targeted regression test that pins the current
explicit behavior: slot-prefix-only metadata, mismatched
base_block_tokens, no auxiliary token_prefix, the contract degrades
to mode="none". The residual risk is honest and written into the
report: this is now explicit, but it is still a product decision rather
than a final semantic contract. A future follow-up may choose stricter
strict fallback behavior, or preserve a coarse slot contract instead. The
value of the test is that whichever choice we make later, we will make
it on purpose.
Attention Validity, Presence, and Absence
The general rule the packed-row incident forced us to write down: attention validityQuick term guideAttentionValidityThe validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.GroundingExample: Pallas softcap attention sample Reference: tokenized enriched packed rows on TPU metadata has three states, not two.
- Present and populated. Use it. The kernel gets real counts and masks.
- Absent. Do not invent values. Downstream code selects a lower-signal
contract (for example, degrading from token-prefix to slot-prefix, or
from slot-prefix to
mode="none"), and does so explicitly. - Present and zero. This is a real semantic state that means "zero valid tokens," and it must only ever arise because the loader or runtime intentionally produced zero. It is never the result of a missing tensor being normalized to zero.
Canonicalization code at any level of the stack must preserve the absence vs present-zero distinction. That is the rule we now enforce in tests, and the rule we require of any new metadata path.
Why This Matters for a Code Model
The failure mode here is not theoretical. For C++ code packed into long training rows, slot-only metadata is common: we often know block boundaries from the chunker without materializing token-level counts in the loader. A quiet degradation to "zero valid tokens" on a fraction of those rows looks, in loss curves, like a subtle data-quality problem or a subtle learning-rate problem, depending on your priors. It does not look like a canonicalizer bug, which is what it is.
That is why backend eligibility is downstream of metadata semantics, not a substitute for them. A faster FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample, FlexAttention, or PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample lane does not repair a canonicalizer that already collapsed "absent" into "present-zero"; it only makes the wrong contract execute on a different kernel. Backend promotion is evidence for execution, not a replacement for the validity rule itself.
Recommendation Hierarchy Before We Change the Kernel
The packed-row fix plus the clustered-sparse test kept the existing dense path honest. The next move is a post-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns gate, not a kernel rewrite. Our review of the sink/spike, gated-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, streaming-sinks, and DynamicTanh papers produced an explicit recommendation order that lines up with what our backends can actually accept without breaking contracts:
- First: an optional query-dependent sigmoid gate after the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns
output, applied before
c_proj. This addresses attention sinks more directly than bounded softcap squashing, stays backend-agnostic across FA3, CuTe-backed FA4, FlexAttention, Pallas, and Splash, and preserves the currentqk_norm,qk_clip, and softcap logic instead of replacing them. - Second: instrumentation and a packed-doc audit. First-token
attention mass, max and high-percentile hidden activations,
prefix-vs-suffix usage on packed documents, and sink behavior per
document rather than per row. Part of the observed bias likely
comes from
best_fitpacking cropping document prefixes and oversampling document starts, which is a data question, not an attention-math question. - Third: DynamicTanh as a separate research track. Larger architectural blast radius, initialization sensitivity, and full train-dynamics impact make it wrong as the first production-facing mitigation step.
- Fourth: sink-aware serving and KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack follow-up. Sparse decode can reduce KV reads; paged KV is the real route to storage savings; sink-window retention is a bounded serving heuristic, not a claim of long-dependency preservation for code.
The gated-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns V1 spec is deliberately narrow: one config group
(attn_output_gate, attn_output_gate_granularity="head",
attn_output_gate_bias, attn_output_gate_init="identity_bias",
attn_output_gate_log_stats). The gate initializes close to identity -
weight near zero, bias positive so sigmoid(bias) is near 1 - so V1
does not destabilize existing checkpoints. The dense module gains a
per-head c_gate linear and multiplies the per-head attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns output
by sigmoid(c_gate(x)) before flattening into c_proj. The sparse
module mirrors the same math through a shared helper in the
_full_attention path and the _finalize_sparse_output exit, so
dense and sparse do not drift semantically. Checkpoint compatibility
is explicit: gate parameters exist only when the feature is enabled,
old checkpoints load silently with the feature off, and enabling the
feature adds clean missing-key behavior if operators do it
deliberately.
Structure-Aware Attention Integration
Longer-term, the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity work feeds into the structure-aware attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns plan. The thesis is short: code structure is mostly an offline fact. We already have tree-sitter chunks, clang call/type edges, and token-aligned AST metadata available in the enriched parquet contract; we just stop re-deriving weak versions of them online.
The proposed end-state attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns contract for code has three parts:
a local causal window, exact structural neighbors from an offline
graph IR, and a small learned overflow budget for edges the graph did
not capture. Execution stays block-friendly: semantic blocks derived
from chunk_boundaries replace fixed 128-token blocks, fixed-K
neighbor lists come from offline preprocessing, and the runtime layout
fits existing block-sparse plumbing. The four-path incremental rollout
sequences it as graph-bias first (relation-aware bias, no hard masking
beyond causal/doc masks), then semantic block sparse (chunk-aligned
blocks in place of fixed-size MoBA blocks), then offline sparse
structure IR (fixed-K neighbors fed to the model, eliminating the
online generic router for structure-driven layers), then a hybrid
overflow router on top.
The required integration contract is strict: training consumes token
IDs, token-aligned metadata, and precomputed sparse/graph IR. No
text-to-token, no int-to-string-to-AST, no runtime tree-sitter or
clang. Parquet rows used for training should already contain or
cheaply derive input_ids, target_ids, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample (or enough to
reconstruct them from BOS), token_structure_ids,
token_dep_levels, optional token_ast_depth, token_sibling_index,
token_ast_node_type, token/chunk graph metadata (token_chunk_starts,
token_chunk_ends, token_chunk_dep_levels, token_call_edges,
token_type_edges), and, once available, the precomputed sparse
structure IR itself. The preprocessing parser stack branches cleanly:
tree-sitter v11 for the syntactic line, clang v12 for the semantic
line, both converging on one token-only consumer contract with
identical packed-row semantics.
Where the Validity Rule Meets the Structure Plan
The packed-row fix is the reason the structure plan can move forward without relitigating basic metadata semantics. Offline graph IR, semantic blocks, and fixed-K neighbor lists all expand the set of optional tensors each batch can carry. If the canonicalizer's old "zero-fill missing" rule were still in place, every new optional tensor would be one more silent zero-equals-absent trap waiting to happen at scale. With the absence-preserving rule enforced and tested, we can add structural fields to the contract confidently: present and populated means use the graph, absent means degrade to a lower-signal contract explicitly, and present-zero is reserved for the case where the preprocessing actually emitted zero.
Two immediate next moves follow from this. The gated-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns V1 ablation runs on the current dense and sparse paths without touching validity, to separate the architecture effect from the packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles effect. In parallel, the structure pipeline produces its first fixed-K neighbor IR per chunk, with explicit presence/absence semantics, loaded through the same canonicalization discipline. By the time the semantic-block sparse path lands, the validity contract has already been proven to survive two new optional-metadata introductions without quiet regressions. That is the only scaffolding that makes the structure-aware attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns integration safe enough to run.
The Short Version
A canonicalizer that zero-filled missing attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity fields
silently converted "absent" into "zero valid tokens" for slot-only
packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles. The fix preserves absence; the regression tests pin it;
the clustered-sparse follow-up pins the explicit fallback to
mode="none" when slot_prefix metadata and query_tile_size do not
agree and no token_prefix exists. The durable rule - present,
absent, and present-zero are three distinct states and none of them
may be invented by shape canonicalization - is now enforced across
the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200+FSDP and XLA paths. On that foundation, the gated-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns
V1 work and the structure-aware attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns integration can proceed
without paying rent to a metadata ambiguity we have already fixed.
Validity states at a glance
| State | Meaning | What the kernel must do |
|---|---|---|
| Present, nonzero | explicit valid-token count | use as-is |
| Present, zero | row is intentionally empty | skip row, do not synthesise |
| Absent | metadata omitted for this row | fall back via contract, never invent |
# fallback rule applied in the canonicalizer
if token_prefix is None and not slot_prefix_matches(query_tile_size):
return ValidityMode.NONE # explicit absence, not zero-fill
Frequently asked questions
Why is "absent" different from zero for token-prefix metadata?+
Does post-attention gating replace validity checks?+
Where is the checked-in prefix example?+
token_prefix, slot_prefix with base_block_tokens, and mode="none".Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.
CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.
JAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.
The stable TPU attention family used for dense or local-mask lanes before MegaCpp drops to narrower planner-driven sparse contracts.
The long-context failure mode where a few tokens, often the first token, absorb disproportionate attention mass and behave like a null-attention valve.
The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.