MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 8 min readDavid Gornshtein
Curriculum
Doc Masking
Long Context
Training
C++

Document masking and the curriculum: what to feed each specialist first

Why MegaCpp masks documents inside packed sequences, how the four-phase curriculum runs from 4K syntax to 64K repository graphs, and what the ablations told us about the right starting diet for each specialist.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Document masking and the curriculum: what to feed each specialist first
Published 8 min readDavid Gornshtein

If you train a code model long enough on packed sequences without document masking, it will eventually learn the wrong lesson: that the function in this file might secretly know about the class three documents back in the same packed row. At 4K context that is a noise problem; at 64K it is a correctness problem. This post explains why we mask documents end to end at MegaCpp, how the resulting four-phase curriculum is shaped, and what the ablations told us about the order in which different specialists should see different kinds of context.

The masking contract only makes sense alongside packed rows as the real training contract and building the C++ training data pipeline, because packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles policy, masking policy, and promotion policy have to agree on where a document begins and ends.

For first touch, the important terms are simple. A packed row is one fixed-length sequence assembled from one or more source documents. doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample are the per-token document identifiers that preserve which packed tokens came from the same original document. segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample are the segment labels some backends derive from those document boundaries so kernels can enforce the same mask contract without guessing resets. loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingExample: packed rows schema sample Example: FIM long-context metadata sample Reference: packed rows as the real training contract is the per-token training mask that says which positions should contribute to loss. The curriculum is the staged schedule that decides which context lengths and dataset families a specialist sees first. The local proof surfaces are Document-mask segment IDs sample, Packed rows schema sample, Packed row builder example, and Tokenized enriched pipeline on TPU.

Why mask documents at all

Sequence packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles is non-negotiable on hardware that bills you per second. We pack multiple short documents into a single fixed-length training sequence to push Model Flops Utilization up instead of paying explicit pad tokens. The catch is that vanilla causal self-attention, given a packed row of [doc A | doc B | doc C], lets every token in B attend to every token in A. There is no architectural reason it should not: the tokens are just earlier in the same sequence.

On 4K pretraining the cost of this contamination is small. On 16K and 64K it is not. A long row commonly carries unrelated documents from unrelated repositories, and unmasked attention will teach the model that any function might depend on any other function in the same packed window. That is the opposite of repository-level reasoning.

The mechanism is straightforward in concept. Every document gets a leading BOS token, and a doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample tensor is computed at the start of GPT.forward() as a cumulative sum over BOS positions in input_ids. Two tokens with the same doc_id may attend to each other causally; two tokens with different doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample may not.

In the public samples, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample can either be derived at runtime from the packed tokens or carried as an explicit audit/debug column in the row schema. The invariant that matters is the same in both cases: boundary information stays stable enough that masking, packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles, and loader behavior agree. Document-mask segment IDs sample is the checked-in receipt for the doc_ids -> segment_ids boundary conversion. That keeps the row contract aligned with SLM data: boundary information stays derivable from the same packed example the loader already consumes. The runtime cost side of the same boundary story is covered in dataloader throughput and stalls.

That boundary is enforced differently by different backends. Attention helpers usually consume segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample or an equivalent mask-facing form, while stateful mixers consume the same boundary as an explicit reset or carry rule. The public samples keep that split visible on purpose: one row contract can feed both an attention mask and a state-reset path without pretending every backend wants the same tensor.

End-to-end means every layer, not just attention

The naive read of "document masking" is "mask attention." That is necessary but not sufficient. Our model uses several layer families, and each one can leak across documents in its own way:

  1. Attention. This is the obvious one. The backend-specific implementation changes, but the invariant does not: tokens with different doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample must not exchange context.
  2. Mamba SSM state. Hidden state from document A has to be reset before document B.
  3. Mamba conv1d. Boundary tokens cannot leak through the convolution buffer.
  4. Other stateful mixers. Anything that accumulates state has to respect the same document boundaries.

The success criterion is not "the loss looks better." It is contractual: boundary tests must show that cross-document context really stays blocked and that stateful components really reset at the boundary.

The four-phase curriculum

With masking in place, the curriculum is what teaches each specialist to reason. We use four phases of progressively longer context, mapped to dataset families produced by different stages of the data pipeline.

Phase 1 — Syntax mastery (4K context). The goal is C++ syntax, basic structures, and short-range dependencies.

Phase 2 — File-level reasoning (16K context). The goal is to learn how functions and classes inside the same file relate.

Phase 3 — Repository graph reasoning (64K context). The goal is project-level awareness across files and build-resolved semantic links. That build-resolved lane is the training-side consumer of compile commands and semantic graphs.

Phase 4 — Structure-aware training (all context lengths). The goal is to teach the model code structure through learnable embeddings and relation-aware signals rather than through flat text alone.

The order matters. Trying to teach repository graph reasoning to a model that has not yet learned syntax produces a model that is confidently wrong about both. Trying to teach syntax to a model that has already been overfit on long graphs wastes the long-graph data.

That is also why this article routes naturally into dataset versions: the phase labels only make sense because different dataset generations are optimized for different kinds of context and structure.

What the ablations told us

We ran a battery of ablations to size the per-phase data mix per specialist. Three findings were robust enough to act on.

First, Phase 1 dominates early loss for every specialist, but the kind of Phase 1 data matters per specialist. Some specialists benefit most from simple short-context shards, while others benefit earlier from more structured diff-style or doc-rich inputs.

Second, the long-context win is not uniform. Specialists whose work is genuinely repository-level benefit much earlier from Phase 3 than specialists whose work is mostly local.

Third, Phase 4 structure-aware inputs help most after the model already understands the syntax. Structural signals do not rescue an under-trained short-context base.

Why we mask documentation specifically

There is a separate decision about what to do with documentation comments. Stripping them entirely is wrong: the model gets worse at writing readable code if it never sees the natural-language explanations engineers attach to functions.

But in some contexts we do want the model to learn from documentation without being able to cheat off it. The clearest case is fill-in-the-middle training on a function body where a Doxygen header already states the answer. The fix is not to delete the header globally. The fix is to use a finer-grained loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingExample: packed rows schema sample Example: FIM long-context metadata sample Reference: packed rows as the real training contract so the objective reflects what the model is supposed to infer.

That is why doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample and loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingExample: packed rows schema sample Example: FIM long-context metadata sample Reference: packed rows as the real training contract need to stay separate. doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample preserve boundaries between documents. loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingExample: packed rows schema sample Example: FIM long-context metadata sample Reference: packed rows as the real training contract decides which positions contribute to the objective. They often interact, but they are solving different problems.

Fill-in-the-middle makes that separation even stricter. Once a row is permuted into prefix, suffix, and middle order, the metadata has to move with it; rebuilding loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingExample: packed rows schema sample Example: FIM long-context metadata sample Reference: packed rows as the real training contract or documentation boundaries heuristically after the permutation is exactly how a model ends up learning from the answer span it was supposed to predict. The checked-in FIM long-context metadata sample is the compact receipt for that rule.

What we feed each specialist first

Pulling the threads together, the per-specialist starting diet is roughly:

  • Algorithms: heavy Phase 1 simple short-context shards, then Phase 2 file-context graph mix.
  • Templates: heavy Phase 1 structured-doc shards, then Phase 4 enriched parquet as soon as it is available.
  • Memory/RAII: heavy Phase 1 mixed with Phase 2; Phase 3 useful but not critical.
  • Build/Toolchain: light Phase 1, heavy Phase 2 and Phase 3.
  • Service-Framework / Orchestration: light Phase 1, heavy Phase 3 and Phase 4.
  • Testing: heavy Phase 1 simple short-context shards with over-sampled test-like patterns.
  • Systems/C: balanced across Phase 1-3.
  • Compilers/Toolchain: heavy Phase 2 and Phase 3 over LLVM-style corpora.

These are starting diets, not fixed recipes. Each specialist is later fine-tuned on a domain-skewed mix of the same base corpus.

The curriculum ablations also changed how we think about "skipping ahead." A specialist that benefits early from Phase 2 or Phase 3 is not skipping syntax; it is reaching the promotion threshold for those later phases sooner. The short-context floor still matters, because long-context repository signals do not compensate for a brittle syntax or local-control-flow prior.

The unsexy summary

Document masking and curriculum are the boring parts of training a code model, and they are also the parts that decide whether 64K context is real or theater. Masking is end to end or it is not real. Curriculum is empirical or it is cargo cult.

Phase-to-data map

Phase Context Datasets Goal
1 syntax 4K simple and doc-rich short-context mixes C++ syntax + short-range deps
2 file 16K file-context graph mix file-level reasoning
3 repo 64K build-resolved repository mix repository-level reasoning
4 structure 64K enriched structure-aware mix structure-aware bias
# doc_ids inferred at GPT.forward() entry.
is_bos = (input_ids == BOS_ID)
doc_ids = torch.cumsum(is_bos.to(torch.int32), dim=-1)
FAQ

Frequently asked questions

Why not just rely on causal masking inside a packed row?+
Because causal masking alone still lets tokens from document B attend to earlier tokens from document A when both sit in the same packed sequence. Document masking adds the boundary rule that unrelated documents must not share context.
Why infer doc_ids from BOS tokens instead of storing them as a separate column?+
Because the boundary signal is already present in input_ids, and deriving doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries. from that keeps the storage contract smaller and easier to version. It also avoids maintaining two boundary sources that can drift apart.
What is the difference between doc_ids and loss_mask?+
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries. preserve document boundaries for attention and state-reset logic. loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking. decides which tokens contribute to the objective. They often interact, especially around documentation headers or fill-in-the-middle targets, but they should not be conflated.
What are segment_ids doing if doc_ids already exist?+
They are the backend-facing boundary labels derived from doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries. when a kernel wants contiguous segment numbers rather than raw document identifiers. Document-mask segment IDs sample shows the conversion directly: the same row boundary story is preserved, but in the shape the masking backend expects.
Do all backends enforce document masking with the same tensor?+
No. Some paths derive doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries. on the fly, some consume segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape., and stateful mixers use boundary resets or carry masks instead of an attention matrix. The contract is shared even when the implementation surface differs.
Does a specialist that moves early to Phase 2 or Phase 3 skip Phase 1?+
No. Early promotion means the specialist starts paying for longer-context data sooner, not that syntax training disappears. The short-context floor is still what keeps later repository-scale signals from landing on a brittle local model.
When should a specialist actually pay for 64K context?+
When the task is genuinely repository-scale: build logic, orchestration, and cross-file reasoning benefit early. Specialists that mostly solve local problems still learn better by mastering short-context syntax first and only then moving into the longer phases.
Why keep doc_ids in public samples if some runtimes can infer them from BOS tokens?+
Because they are both a runtime signal and an audit signal. Some backends derive them on the fly, others prefer to carry them as explicit row columns, and public-safe samples need to show both the derivable rule and the stable row contract without assuming one backend owns the whole story.
What is still open in this masking lane?+
The remaining open questions are about measurement, not about whether boundaries matter. The current research brief still leaves the exact 64K MFU delta for document-masked long-context paths, the precise per-boundary reset cost for stateful mixers, and the exact specialist-by-specialist lift from Phase 4 structure-aware signals as benchmark questions. The grounded claim in this article is narrower: boundary enforcement and phase ordering are required contracts, while those cost curves still need dedicated measurement work.
What should a boundary test prove?+
It should compare the packed case against the isolated case, not just inspect a mask tensor. If the same target document produces different activations or predictions depending on which unrelated document was packed before it, the boundary contract is leaking. For stateful layers, the equivalent proof is that the carry state at a document boundary cannot influence the next document.
Which public examples show the masking contract end to end?+
Start with Packed row builder example for how rows are assembled, then Document-mask segment IDs sample for the boundary conversion, then Packed rows schema sample for the required row fields, then Tokenized enriched pipeline on TPU for one deployment-facing runtime path.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

segment_ids

The fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.

loss_mask

The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.