MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 10 min readDavid Gornshtein
Long Context
Yarn
RNoPE
Attention Sinks
Gated Attention
RoPE
C++

Long context and attention sinks: what actually held up past 16K

YaRN, RNoPE, packed-document masking, attention sinks, massive activations, and query-dependent output gating: a field report on which long-context techniques survived contact with the MegaCpp C++ corpus.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Long context and attention sinks: what actually held up past 16K
Published 10 min readDavid Gornshtein

Long Context and Attention Sinks: What Actually Held Up Past 16K

The MegaCpp specialists have to read repositories, not snippets. A realistic training sample from our v4 context-graph sampler is a Callers -> Target -> Callees bundle that reaches 64K tokens of C++ before the budget cuts in; a realistic inference sample is a translation unit plus its headers and a stack trace, which comfortably exceeds 32K. That puts us in the bucket of "every long-context paper applies in principle, none of them apply cleanly". This post is a field report on which long-context techniques survived contact with the C++ corpus, what broke in ways that did not show up on the short-context ablations, and which mitigations we actually kept past 16K.

The two axes we care about

Long-context work on our stack breaks into two non-overlapping axes that are easy to confuse. The first is positional extrapolation: the model was pretrained at one sequence length and we want it to behave reasonably at a much longer one. YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA, RNoPEQuick term guideRNoPEA hybrid RoPE/NoPE layout where some heads keep rotary position while a subset stay positionless for retrieval-style matching once masking is trustworthy.GroundingHistory: document masking and curriculum Reference: MLA and weight absorption, RoPE theta scaling, and a context-length curriculum all live on this axis. The second is numerical stability at long range: attention sinksQuick term guideAttention sinksThe long-context failure mode where a few tokens, often the first token, absorb disproportionate attention mass and behave like a null-attention valve.GroundingHistory: gated DeltaNet and hyper-connections Reference: attention sinks and telemetry on TPU, massive activations, first-token bias, packed-document prefix bias. These are properties of what the model actually computes once the positions are correct, and they are orthogonal to whether the positional encoding extrapolates. We hit both axes in the same training cycle and it took us a while to stop treating them as the same problem.

Positional extrapolation: YaRN, RNoPE, and the parts that worked

The design target is 128K tokens today with a clear path to 1M. The FLOP math dictates that we cannot just pretrain at 1M. A single 1M-token sample through a 4B model is enough to stall any single-node training budget for a week, and the KV-cache alone at 1M tokens and standard GQA is comfortably north of 130 GiB before MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries compression. KV cache and paged attention matters on the serving side even when the training story is mostly curriculum and masking. The industry pattern is to pretrain short and extend through a staged curriculum, and that is what we do.

The concrete schedule on the specialists is four stages: 8K pretrain (the bulk of tokens), then LCFT-1 at 32K (YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA scale 4), LCFT-2 at 128K (YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA scale 16), and LCFT-3 at 512K (YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA scale 64). Each stage is short relative to the pretrain budget (roughly 140h : 8h : 8h : 4h of wall-clock for the 4B backbone), and each stage consumes training samples built from concatenated v4 context graphs sized to the stage's target length.

YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA is the positional scaling we kept. The core of the algorithm is the frequency-band split: the high RoPE frequencies encode local position and extrapolate cleanly, the low frequencies encode global position and must be interpolated down when the max sequence length grows, and the middle band is smoothed. yarn_find_correction_range computes the two band edges from beta_fast and beta_slow, and yarn_get_mscale returns 0.1 * log(scale) + 1.0 for scale > 1, which is the attention-logit rescaling that keeps softmax behaviour consistent with the original training temperature after the frequencies change. This is plain RoPE with a frequency-aware twist, it is checkpoint-compatible with the short-context pretrain (the weights are unchanged, only the positional embedding computation changes), and it does not require a separate training phase to activate. It does require the context-length curriculum to activate well, which is the next-biggest lesson from this cycle.

RNoPEQuick term guideRNoPEA hybrid RoPE/NoPE layout where some heads keep rotary position while a subset stay positionless for retrieval-style matching once masking is trustworthy.GroundingHistory: document masking and curriculum Reference: MLA and weight absorption (the hybrid RoPE/NoPE scheme where a fraction of the heads do not receive a positional encoding at all) is on our roadmap but not in the specialists as shipped. The short-context ablations showed that RNoPEQuick term guideRNoPEA hybrid RoPE/NoPE layout where some heads keep rotary position while a subset stay positionless for retrieval-style matching once masking is trustworthy.GroundingHistory: document masking and curriculum Reference: MLA and weight absorption interacts badly with our packed-document training regime until intra-document masking is correct, and we fixed the masking first. The non-positional heads are supposed to provide a positionless "retrieval" path that the YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA-scaled positional heads cannot; without clean document masking, those heads learn spurious cross-document retrievals instead. Once masking is solid the RNoPEQuick term guideRNoPEA hybrid RoPE/NoPE layout where some heads keep rotary position while a subset stay positionless for retrieval-style matching once masking is trustworthy.GroundingHistory: document masking and curriculum Reference: MLA and weight absorption ablation becomes meaningful, and that is the next step on the extrapolation axis.

The three things that actually moved long-context accuracy past 16K were not glamorous. They were: fixing intra-document masking, switching from best_fit packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles to a packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles policy that does not crop document prefixes, and extending the context curriculum to actually reach the target length. YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA by itself on the pretrain checkpoint did not clear any of our long-context evals; YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA on top of a correctly curriculum-extended model did. That is why document masking and curriculum and packed rows as the real training contract matter as much as the positional recipe here.

Packed-document masking, which turned out to be load-bearing

The short version: a transformer trained on packed sequences that ignores document boundaries learns to attend across documents, and a transformer that does that looks fine at 4K and degrades badly at 32K. The long version is worse, because our original packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles policy was best_fit: pack the largest document first, fill remaining slack with the next-largest that fits, repeat. That policy systematically crops document prefixes: the first 4K tokens of a 10K-token document land in one packed row, and the remaining 6K either get cropped or land in a different row. A model trained on that distribution oversamples document starts and undertrains document interiors, which is exactly the kind of bias that surfaces as "the model has a weird affinity for line 1 of any file" at eval time.

The fix is two changes. First, an intra-document additive mask that blocks attention from a token in one packed document to tokens in any other document in the same row. This is straightforward to build from a per-row doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample tensor and applies cleanly to FA3, FlexAttention, and the manual SDPA fallback; we carry the mask through the same sequence-parallel path as the causal mask. Second, a packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles policy that respects document boundaries and does not crop prefixes: either bin-packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles variants that take the full document as an atomic unit and accept the resulting packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles inefficiency, or a continuous-packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles scheme with explicit boundary markers that the masking layer reads. We run the latter in production. The document-level eval metric that was silently regressing on best_fit recovered as soon as prefixes stopped getting cropped; the long-context eval that had been blamed on YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA turned out to be a masking bug in disguise.

Attention sinks, massive activations, and the gated-attention RFC

The other axis is numerical. The short version of the phenomenon, from the spike/sparse/sink literature, is that pre-norm transformers trained on long sequences develop two related pathologies. The first is the attention sinkQuick term guideAttention sinksThe long-context failure mode where a few tokens, often the first token, absorb disproportionate attention mass and behave like a null-attention valve.GroundingHistory: gated DeltaNet and hyper-connections Reference: attention sinks and telemetry on TPU: a few tokens (often the first token) receive a disproportionate fraction of attention mass, regardless of content, as a kind of "null attention" valve. The second is massive activations: a few hidden units in a few tokens grow to magnitudes orders of magnitude larger than the rest of the tensor, which is the activation-space counterpart of the sink behaviour in attention-space. They are related but not identical; fixing one does not automatically fix the other.

The mitigations we evaluated fall into four buckets.

Static controls (already in mainline): qk_norm, qk_clip_threshold with an epsilon, the attn_softcap on attention logits, and the output_softcap on the LM head. These are cheap and they compose with everything else. qk_norm is the one we are explicitly careful about: the spike paper is fine with it but the long-context ablations in our own code showed that qk_norm interacts with long-range positional retrieval, and we keep it off on heads whose long-context eval suffers from it. Softcaps stay on everywhere; they are a clean guard against logit explosion and they compose with FA3 through the kernel's softcap argument.

Streaming sinks (StreamingLLM-style): retain the first K "sink" tokens in the KV-cache forever plus a sliding window of the most recent tokens. This bounds decode memory growth and is useful as a serving heuristic. It is not a substitute for long-range recall; pinning tokens 0..3 plus the last 2048 does not let the model answer a question whose answer lives at token 8000. For our packed documents the sink policy also has to be document-relative, not row-global: a sink token in the middle of a row from the previous document is not a sink for the current document. We kept the document-relative variant for serving only, and we explicitly do not claim it as a long-context solution.

Post-attention output gatingQuick term guideGated attentionA query-dependent sigmoid gate on the post-attention output, applied before c_proj so sink mitigation stays backend-agnostic instead of rewriting the attention kernel.GroundingAbout: gated DeltaNet and hyper-connections Reference: attention validity and structure (from the Qwen gated-attention paper). This is the mitigation we moved first on. A query-dependent sigmoid gate multiplies the attention output before the c_proj projection; the gate is computed from the query state, stays outside the attention kernel contract, and is checkpoint-compatible because it collapses to identity when the gate weights are initialised to produce ~1. It addresses sinks more directly than pure bounded squashing, it applies uniformly to dense and sparse attention paths (including our DSA indexer), and it does not require any kernel changes. The decision in our attention-sink-mitigation RFC was to ship this as the first-line mitigation and measure before touching anything else.

DynamicTanh (from the "Transformers without Normalization" paper). Tempting because it simultaneously addresses massive activations and removes a whole class of norm layers, but the blast radius is enormous: normalization replacement changes full training dynamics, not just attention routing, and initialisation sensitivity is real. We kept it in a separate research track. If gated attentionQuick term guideGated attentionA query-dependent sigmoid gate on the post-attention output, applied before c_proj so sink mitigation stays backend-agnostic instead of rewriting the attention kernel.GroundingAbout: gated DeltaNet and hyper-connections Reference: attention validity and structure closes the measured gap on first-token mass and outlier percentiles, DynamicTanh becomes unnecessary; if it does not, DynamicTanh is the next thing to try, not the first.

The public papers line up with this split. YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA is a positional extrapolation recipe for RoPE models, StreamingLLM is a decode-time sink-plus-window policy for stable streaming, and the recent gated-attention paper treats a post-SDPA sigmoid gate as a direct sink mitigation rather than another positional trick. That is why we treat extrapolation, sink control, and packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles discipline as separate knobs instead of collapsing them into one "long context" feature.

The instrumentation that forced these decisions into an order was the other surviving piece of the RFC. Before any mitigation ships, we measure: first-token attention mass on a fixed eval prompt set, max and high-percentile hidden activations per layer, prefix-versus-suffix attention usage on packed documents, and sink behaviour per document rather than per packed row. A mitigation that does not move at least one of those numbers is not a mitigation, it is a refactor, and we do not ship it as the former.

What actually held up past 16K

On the 4B specialist backbone evaluated at 4K / 16K / 32K / 64K on a code-reasoning harness built from repository-level context graphs:

Below 16K, nothing we changed moved the number meaningfully - well-behaved RoPE and reasonable packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles already cover that range. At 16K the intra-document masking fix and the packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles policy change were the dominant improvements: the model stopped spuriously attending across concatenated training documents, document-interior accuracy recovered, and a family of cross-translation-unit hallucinations dropped out. YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA was present but not load-bearing here.

At 32K and 64K, YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.GroundingHistory: SOTA ablation and comparison Reference: fused MLA on NVIDIA with the correct frequency-band correction was load-bearing. Without it, precision on long-range signature matching ("the exact signature of Buffer::append declared 30K tokens ago") collapsed across every eval we had. With it, long-context tracked the 16K eval within single-digit points, the best result we have on this axis. Output gatingQuick term guideGated attentionA query-dependent sigmoid gate on the post-attention output, applied before c_proj so sink mitigation stays backend-agnostic instead of rewriting the attention kernel.GroundingAbout: gated DeltaNet and hyper-connections Reference: attention validity and structure was net positive at this length; the outlier-channel mass moved off the first token and onto the gate.

At 128K the picture is less complete. MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries weight absorption and paged KV cache make 128K tractable on a single B200 for serving; whether the model actually uses the full window for retrieval, versus relying on the streaming-sink slice plus recent context, is a question the current eval set does not cleanly separate. A middle-of-context retrieval eval is on the roadmap.

We do not claim streaming sinks plus a recent window are a long-context solution; they are a decode-memory heuristic that composes with everything above. We also do not claim any of this transfers to the Mamba-3 M-blocks, which have their own recurrence dynamics; this post is about the minority of attention blocks (MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries and DSA) in the hybrid stack. In practice that keeps the long-context story tied to the SLM architecture.

Long-context techniques: what held up

Technique Window Status on the C++ corpus Notes
YaRN rope scaling up to 32K kept clean extrapolation past 16K once attn_factor is tuned
RNoPE (rotary skip-band) 16K-32K kept on a subset of layers helps long-doc recall, costs a few MFU points
Packed-document masking any kept cu_seqlens boundaries, document-aware attention
Attention sinks (first 4 tokens) any kept on inference path lower training-time perplexity drift past 8K
Massive activations (handful of channels) any tracked, not actively suppressed telemetry only; intervention regressed loss
Query-dependent output gating any kept small win, composes with gated attention

The packed-mask shape we feed Flash Attention:

out = flash_attn_varlen_func(
    q, k, v,
    cu_seqlens_q=cu_seqlens, cu_seqlens_k=cu_seqlens,
    max_seqlen_q=max_len, max_seqlen_k=max_len,
    causal=True, softcap=softcap_value,
)
FAQ

Frequently asked questions

Is YaRN enough by itself to make a short-context checkpoint work at 32K or 64K?+
No. In these runs, YaRNQuick term guideYaRNA RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows. mattered only after masking, prefix-preserving packing, and the staged context curriculum were already correct. If packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a… still leak across document boundaries, fix document masking and curriculum before blaming positional scaling.
Are attention sinks a replacement for real long-range retrieval?+
No. A sink policy is a decode-memory and stability heuristic. Keeping the first few sink tokens plus a recent window can help serving, but it does not prove the model can retrieve a dependency that lives tens of thousands of tokens back in the middle of the sequence.
Where do document-relative sink tokens actually belong?+
On the serving side, after the cache policy knows which request or document owns the retained prefix. A packed training row should not promote row-global tokens into universal sinks; it should preserve boundaries through masks and cu_seqlens. During iterative decode, a document-relative sink window can then compose with the recent-token window and the KV-cache policy described in KV cache and paged attention. If you are measuring the effect on TPU, keep the sink summary fixed-shape as described in attention sinks and telemetry on TPU.
What should be checked first when long-context evals collapse?+
Check the boundary contracts before the positional story: packed-document masking, prefix-preserving packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…, and whether the curriculum ever trained the target length. In this lane those fixes moved the metric earlier than any more exotic sink or gating tweak.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

YaRN

A RoPE extrapolation recipe that rescales the low-frequency rotary bands and applies a small log-based attention correction so a short-context checkpoint can survive longer windows.

RNoPE

A hybrid RoPE/NoPE layout where some heads keep rotary position while a subset stay positionless for retrieval-style matching once masking is trustworthy.

Attention sinks

The long-context failure mode where a few tokens, often the first token, absorb disproportionate attention mass and behave like a null-attention valve.

Gated attention

A query-dependent sigmoid gate on the post-attention output, applied before c_proj so sink mitigation stays backend-agnostic instead of rewriting the attention kernel.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.