MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 11 min readDavid Gornshtein
Attention Validity
Packed Rows
Clustered Sparse
Structure Aware Attention
Code Modeling

Attention Validity and Structure-Aware Attention

A packed-row validity regression, the clustered-sparse follow-up it forced, and the structure-aware attention plan we are integrating into the MegaCpp training stack.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Attention Validity and Structure-Aware Attention
Published 11 min readDavid Gornshtein

Attention Validity and Structure-Aware Attention

A lot of model quality on long C++ contexts hides inside boring metadata: which tokens in a packed row are actually valid, which blocks a clustered sparse router is allowed to attend to, and whether a piece of structural information is expected to be present, absent, or explicitly zero. When any one of those three states gets silently collapsed into another, the consequences are not noisy - they are quiet and plausible, and they survive through training because nothing obviously explodes. This post walks through one regression we fixed, the clustered-sparse follow-up that regression forced into the open, and how the result is feeding the structure-aware attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns plan we are integrating next.

The Packed-Row Validity Regression

We train on packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles where multiple documents share a single fixed-length sequence. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity metadata tells downstream kernels which tokens in the row are real, where each document starts and ends, and how to mask across document boundaries. On current main, the training script canonicalized that metadata for two paths in particular: CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 + FSDP distributed, and XLA. Both canonicalizers had a seemingly harmless rule: if any validity tensor is present in the batch, zero-fill all the other validity fields that are missing, so that every path downstream sees a uniformly shaped dict.

For packed-row metadata that is slot-prefix-only - meaning the batch supplies block-level slot counts but deliberately omits token-level counts - the "zero-fill missing" rule rewrote several absent fields into zero tensors:

  • missing row_valid_token_counts became a zero tensor.
  • missing row_valid_block_counts became a zero tensor.
  • missing row_block_size_tokens became a zero tensor.
  • missing base_block_tokens became a zero tensor.

The smallest checked-in sketch of that contract is Attention-validity prefix sample: token_prefix means explicit per-row token counts, slot_prefix means slot counts plus base_block_tokens, and mode="none" is the intentional absence case. That one sample is enough to explain why "absent" and "present but zero" cannot be collapsed.

The function is called shape canonicalization, but the semantic content it was producing was not shape. It flipped "token-prefix is absent" into "token-prefix is explicitly zero." Downstream consumers of normalize_attention_validity() read that as a real, zero-length token prefix, which is exactly the same as telling the kernel "this row has zero valid tokens." No crashes, no NaNs, just silently-masked rows inside training batches.

The affected code was upstream of the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns path itself. The bug was not in the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity helper, the FlashAttention backend adapter, or the clustered sparse module; it was in the main training entrypoint, in _canonicalize_structure_meta_for_fsdp_cuda and _canonicalize_structure_meta_for_xla. The fix is one-line philosophy applied in two places: missing validity fields stay absent. Structural, platform, and tree metadata continue to be shape-stabilized as before

  • those were never the problem - but optional validity fields are now preserved in their original present/absent state. Slot-only metadata now stays slot-only, and normalize_attention_validity() sees slot_counts present and token_prefix absent unless the loader actually supplied one.

The fix landed with regression tests across the obvious surfaces: attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity tests, targeted coverage in training entrypoint regression tests for the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 and XLA canonicalizers preserving absent token-validity fields while still injecting the missing required CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200/XLA keys, the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity integration tests, and a flash-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns test run to confirm nothing downstream regressed. The durable rule for future canonicalizers is now written down: missing validity fields must stay absent unless the runtime is intentionally deriving them from a stronger contract. Shape stabilization and semantic injection are different operations and should never share a code path.

That direction also lines up with the public framework surface. PyTorch's varlen_attn API carries packed-sequence boundaries through cumulative sequence tensors (cu_seq_q, cu_seq_k) instead of a padded dense batch, and the PyTorch/XLA recompilation guide calls out data-dependent outputs such as torch.nonzero plus real-dimension queries as recurring sources of recompilation pressure. That is not proof that every packed-row runtime must use the same interface, but it is strong external evidence for the same bias: keep validity metadata compact and typed, and avoid turning it into a dynamic-mask problem by accident.

The Clustered-Sparse Follow-Up

Fixing the canonicalizer closed the source of the regression but did not close an ambiguity it exposed downstream. Our clustered sparse attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns contract has a function (_resolve_attention_validity_contract) that keeps slot_prefix metadata only when the batch's base_block_tokens equals the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernel's query_tile_size. When those do not match, the code had a fallback path, and the fallback had never been covered by a test.

Enumerating the cases: slot_prefix is present, but base_block_tokens does not match query_tile_size. Two subpaths:

  • If an auxiliary token_prefix is present in the batch, clustered sparse falls back to token_prefix and proceeds with token-level validity.
  • If no auxiliary token_prefix exists, the contract degrades to AttentionValidity(mode="none"). Clustered sparse attention then behaves as if no validity metadata was ever supplied.

That second branch is what the packed-row fix made visible. Before the fix, slot-only rows with mismatched block sizes would pick up a zero-filled token_prefix from the canonicalizer and hit the "fall back to token_prefix" branch, with token_prefix semantically meaning "zero valid tokens." After the fix, they hit the mode="none" branch, which is also the explicitly documented behavior. Neither behavior is a crash; both are product decisions about what "slot-only metadata, wrong block size, no token fallback" should mean on the clustered sparse path.

The follow-up added one targeted regression test that pins the current explicit behavior: slot-prefix-only metadata, mismatched base_block_tokens, no auxiliary token_prefix, the contract degrades to mode="none". The residual risk is honest and written into the report: this is now explicit, but it is still a product decision rather than a final semantic contract. A future follow-up may choose stricter strict fallback behavior, or preserve a coarse slot contract instead. The value of the test is that whichever choice we make later, we will make it on purpose.

Attention Validity, Presence, and Absence

The general rule the packed-row incident forced us to write down: attention validityQuick term guideAttentionValidityThe validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.GroundingExample: Pallas softcap attention sample Reference: tokenized enriched packed rows on TPU metadata has three states, not two.

  • Present and populated. Use it. The kernel gets real counts and masks.
  • Absent. Do not invent values. Downstream code selects a lower-signal contract (for example, degrading from token-prefix to slot-prefix, or from slot-prefix to mode="none"), and does so explicitly.
  • Present and zero. This is a real semantic state that means "zero valid tokens," and it must only ever arise because the loader or runtime intentionally produced zero. It is never the result of a missing tensor being normalized to zero.

Canonicalization code at any level of the stack must preserve the absence vs present-zero distinction. That is the rule we now enforce in tests, and the rule we require of any new metadata path.

Why This Matters for a Code Model

The failure mode here is not theoretical. For C++ code packed into long training rows, slot-only metadata is common: we often know block boundaries from the chunker without materializing token-level counts in the loader. A quiet degradation to "zero valid tokens" on a fraction of those rows looks, in loss curves, like a subtle data-quality problem or a subtle learning-rate problem, depending on your priors. It does not look like a canonicalizer bug, which is what it is.

That is why backend eligibility is downstream of metadata semantics, not a substitute for them. A faster FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample, FlexAttention, or PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample lane does not repair a canonicalizer that already collapsed "absent" into "present-zero"; it only makes the wrong contract execute on a different kernel. Backend promotion is evidence for execution, not a replacement for the validity rule itself.

Recommendation Hierarchy Before We Change the Kernel

The packed-row fix plus the clustered-sparse test kept the existing dense path honest. The next move is a post-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns gate, not a kernel rewrite. Our review of the sink/spike, gated-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, streaming-sinks, and DynamicTanh papers produced an explicit recommendation order that lines up with what our backends can actually accept without breaking contracts:

  • First: an optional query-dependent sigmoid gate after the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns output, applied before c_proj. This addresses attention sinks more directly than bounded softcap squashing, stays backend-agnostic across FA3, CuTe-backed FA4, FlexAttention, Pallas, and Splash, and preserves the current qk_norm, qk_clip, and softcap logic instead of replacing them.
  • Second: instrumentation and a packed-doc audit. First-token attention mass, max and high-percentile hidden activations, prefix-vs-suffix usage on packed documents, and sink behavior per document rather than per row. Part of the observed bias likely comes from best_fit packing cropping document prefixes and oversampling document starts, which is a data question, not an attention-math question.
  • Third: DynamicTanh as a separate research track. Larger architectural blast radius, initialization sensitivity, and full train-dynamics impact make it wrong as the first production-facing mitigation step.
  • Fourth: sink-aware serving and KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack follow-up. Sparse decode can reduce KV reads; paged KV is the real route to storage savings; sink-window retention is a bounded serving heuristic, not a claim of long-dependency preservation for code.

The gated-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns V1 spec is deliberately narrow: one config group (attn_output_gate, attn_output_gate_granularity="head", attn_output_gate_bias, attn_output_gate_init="identity_bias", attn_output_gate_log_stats). The gate initializes close to identity - weight near zero, bias positive so sigmoid(bias) is near 1 - so V1 does not destabilize existing checkpoints. The dense module gains a per-head c_gate linear and multiplies the per-head attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns output by sigmoid(c_gate(x)) before flattening into c_proj. The sparse module mirrors the same math through a shared helper in the _full_attention path and the _finalize_sparse_output exit, so dense and sparse do not drift semantically. Checkpoint compatibility is explicit: gate parameters exist only when the feature is enabled, old checkpoints load silently with the feature off, and enabling the feature adds clean missing-key behavior if operators do it deliberately.

Structure-Aware Attention Integration

Longer-term, the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity work feeds into the structure-aware attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns plan. The thesis is short: code structure is mostly an offline fact. We already have tree-sitter chunks, clang call/type edges, and token-aligned AST metadata available in the enriched parquet contract; we just stop re-deriving weak versions of them online.

The proposed end-state attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns contract for code has three parts: a local causal window, exact structural neighbors from an offline graph IR, and a small learned overflow budget for edges the graph did not capture. Execution stays block-friendly: semantic blocks derived from chunk_boundaries replace fixed 128-token blocks, fixed-K neighbor lists come from offline preprocessing, and the runtime layout fits existing block-sparse plumbing. The four-path incremental rollout sequences it as graph-bias first (relation-aware bias, no hard masking beyond causal/doc masks), then semantic block sparse (chunk-aligned blocks in place of fixed-size MoBA blocks), then offline sparse structure IR (fixed-K neighbors fed to the model, eliminating the online generic router for structure-driven layers), then a hybrid overflow router on top.

The required integration contract is strict: training consumes token IDs, token-aligned metadata, and precomputed sparse/graph IR. No text-to-token, no int-to-string-to-AST, no runtime tree-sitter or clang. Parquet rows used for training should already contain or cheaply derive input_ids, target_ids, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample (or enough to reconstruct them from BOS), token_structure_ids, token_dep_levels, optional token_ast_depth, token_sibling_index, token_ast_node_type, token/chunk graph metadata (token_chunk_starts, token_chunk_ends, token_chunk_dep_levels, token_call_edges, token_type_edges), and, once available, the precomputed sparse structure IR itself. The preprocessing parser stack branches cleanly: tree-sitter v11 for the syntactic line, clang v12 for the semantic line, both converging on one token-only consumer contract with identical packed-row semantics.

Where the Validity Rule Meets the Structure Plan

The packed-row fix is the reason the structure plan can move forward without relitigating basic metadata semantics. Offline graph IR, semantic blocks, and fixed-K neighbor lists all expand the set of optional tensors each batch can carry. If the canonicalizer's old "zero-fill missing" rule were still in place, every new optional tensor would be one more silent zero-equals-absent trap waiting to happen at scale. With the absence-preserving rule enforced and tested, we can add structural fields to the contract confidently: present and populated means use the graph, absent means degrade to a lower-signal contract explicitly, and present-zero is reserved for the case where the preprocessing actually emitted zero.

Two immediate next moves follow from this. The gated-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns V1 ablation runs on the current dense and sparse paths without touching validity, to separate the architecture effect from the packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles effect. In parallel, the structure pipeline produces its first fixed-K neighbor IR per chunk, with explicit presence/absence semantics, loaded through the same canonicalization discipline. By the time the semantic-block sparse path lands, the validity contract has already been proven to survive two new optional-metadata introductions without quiet regressions. That is the only scaffolding that makes the structure-aware attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns integration safe enough to run.

The Short Version

A canonicalizer that zero-filled missing attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-validity fields silently converted "absent" into "zero valid tokens" for slot-only packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles. The fix preserves absence; the regression tests pin it; the clustered-sparse follow-up pins the explicit fallback to mode="none" when slot_prefix metadata and query_tile_size do not agree and no token_prefix exists. The durable rule - present, absent, and present-zero are three distinct states and none of them may be invented by shape canonicalization - is now enforced across the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200+FSDP and XLA paths. On that foundation, the gated-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns V1 work and the structure-aware attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns integration can proceed without paying rent to a metadata ambiguity we have already fixed.


Validity states at a glance

State Meaning What the kernel must do
Present, nonzero explicit valid-token count use as-is
Present, zero row is intentionally empty skip row, do not synthesise
Absent metadata omitted for this row fall back via contract, never invent
# fallback rule applied in the canonicalizer
if token_prefix is None and not slot_prefix_matches(query_tile_size):
    return ValidityMode.NONE  # explicit absence, not zero-fill
FAQ

Frequently asked questions

Why is "absent" different from zero for token-prefix metadata?+
Zero is a real instruction that the row has no valid tokens. Absent means the loader did not provide token-prefix metadata, so the runtime must choose an explicit fallback instead of inventing an empty row.
Does post-attention gating replace validity checks?+
No. A post-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. gate can dampen sink-like behavior after the attention result is computed, but it cannot recover rows that were masked incorrectly before the kernel ran. Validity metadata still has to preserve the difference between absent, present, and present-zero before any backend or architectural mitigation gets involved.
Where is the checked-in prefix example?+
Attention-validity prefix sample shows the three states directly: token_prefix, slot_prefix with base_block_tokens, and mode="none".
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

AttentionValidity

The validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

Pallas

JAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.

Splash

The stable TPU attention family used for dense or local-mask lanes before MegaCpp drops to narrower planner-driven sparse contracts.

Attention sinks

The long-context failure mode where a few tokens, often the first token, absorb disproportionate attention mass and behave like a null-attention valve.

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

Paged attention

The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.