MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 19, 20267 min readDavid Gornshtein

Data

Packing

Long Context

FIM

Training Contract

Packed rows as the real training contract

Q: What is the practical difference between doc_ids and segment_ids?

doc_ids answer "which original document did this token come from?" while segment_ids answer "which contiguous same-document segment should the masking backend treat as one unit?" In many rows the numbering will look similar, but they serve different consumers. The row packer needs durable document provenance; the masking backend often wants compact segment numbering. Packed row builder example shows the document side and Document-mask segment IDs sample shows the segment conversion.

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a storage detail.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Packed rows as the real training contract

Published April 19, 2026•7 min read•David Gornshtein

The most honest place to describe a training data pipeline is not the crawler and not the tokenizer. It is the packed row. That is the first format the model actually consumes as a stable contract.

MegaCpp's public data examples make that unusually clear. They show enriched records, masking transforms, schema samples, and row builders separately, but all of them are converging on the same operational boundary: one row that tells the runtime what tokens are valid, where document boundaries are, what the loss should ignore, and which structure fields are still aligned to those tokens. That is why this post sits in the middle of The C/C++ Data Preparation Pipeline, End to End and Tokenized enriched packed rows on TPU. The policy-side lead-in is Building the C++ Training Data Pipeline: What Worked, What Broke, and the schema-side lead-in is C++ Data Versioning and Schema.

If these row-contract terms are new

A packed row is one fixed-length training example after tokenization and packing, not one raw document.
A loss maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample says which positions should contribute to training loss.
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample are the per-token document labels that preserve document boundaries after many documents are packed into one row.
segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample are the compact, contiguous boundary labels some masking backends derive from doc_ids when they want per-segment numbering rather than raw document labels.
valid_token_countQuick term guidevalid_token_countThe per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.GroundingExample: packed rows schema sample Example: packed row builder example Reference: tokenized enriched packed rows on TPU is the count of non-pad tokens the row actually contains.
Aligned side columns are optional metadata columns that still line up token-by-token with the packed text, such as structure or chunk fields.

The quickest local proof bundle is the Packed rows schema sample, Packed row builder example, Masking pipeline excerpt, Loader enriched columns sample, and Document-mask segment IDs sample.

Why the packed row is more important than the intermediate artifacts

Intermediates matter, but they are not the model contract. The model does not consume a repo clone, a build graph, or a raw enriched JSONL record directly. It consumes packed, token-aligned rows with explicit masks and metadata. That is why this contract has to stay connected to long-context and attention sinks: bad packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles and bad masks do not just waste tokens, they teach the model false cross-document relationships.

That is why the local example pack is structured the way it is:

one fixture for enriched records
one row-builder example
one schema sample
one masking excerpt that preserves alignment through transformations
one loader-side sample for reading optional enriched columns

Taken together, these files say something stronger than "we have a data pipeline." They say the pipeline is only finished once all those fields survive packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles in a form the model can actually train on.

What a packed row has to preserve

The public examples support a practical row contract built around a few durable surfaces:

token ids and target ids
a loss maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample
valid-token accounting
document-boundary or segment information
optional enriched columns that still line up with token positions

The checked-in schema sample makes that split precise. It keeps the minimal loader-required columns (input_ids, target_ids, loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample) separate from the packer-required boundary columns (doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, valid_token_countQuick term guidevalid_token_countThe per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.GroundingExample: packed rows schema sample Example: packed row builder example Reference: tokenized enriched packed rows on TPU, num_docs) and then lists the optional token- and chunk-level metadata columns that only matter if they stay aligned through packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles.

The useful rule is that every runtime view must be derivable from row metadata, not from a later guess. If an attention backend wants cumulative sequence lengths, a sparse block mask, or compact segment IDsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, the packer still needs to emit enough per-token boundary evidence for that view to be reconstructed and audited. Otherwise the same row can appear valid to the loader while loss, attention, and document-reset logic silently disagree.

That boundary split is why doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample and segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample should not be treated as synonyms. doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample preserve provenance from the original packed documents. segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample are the backend-facing relabeling of those boundaries when the masking path needs compact segment numbers. The row contract is correct only if those two views still describe the same boundaries. Document masking and curriculum explains the masking side; the checked-in receipt is Document-mask segment IDs sample.

The validation cost should stay proportional to the row, not to a rebuilt pipeline. Check required columns, fixed lengths, loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, and optional side-column shapes at ingest, then let the loader normalize optional metadata without inventing new batch shapes. The runtime side of that boundary is Dataloader throughput and stalls.

That is already enough to explain why packed rowsQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles deserve their own article. Once the row exists, many earlier data-pipeline arguments stop being abstract. You can now ask whether masking survived, whether enrichment stayed aligned, and whether the loader still knows how to read the optional columns without turning everything into an opaque sidecar. That is also why the TPU-facing continuation in OOM on v6e matters operationally: once packed rowsQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles define the contract, TPU memory failures and recompilation problems can be discussed against a stable input boundary instead of against vague "data pipeline" language.

Packing is not just an efficiency trick

Sequence packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles is often described as a throughput optimization, and that is true but incomplete. In MegaCpp it is also a correctness boundary.

If the packed row fails to encode boundaries and masks correctly, long-context training will teach the model false relationships across unrelated documents. If the row loses alignment between tokens and enriched structure fields, later structure-aware work becomes guesswork. If the row builder and schema disagree, the loader can still run while silently training on the wrong contract.

That is why the row builder, schema sample, and masking excerpt belong in the same public surface. They are three views of the same training boundary. The schema side of that boundary is also why C++ data versioning and schema belongs nearby. If the declared row shape drifts from what the loader or mask builder expects, the pipeline can keep running while the training contract is already broken.

Why this article belongs next to FIM and long-context notes

The packed row is where fill-in-the-middle, document masking, and long-context packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles finally meet one another. FIM is not just a transform on raw text. It changes what part of the row carries loss. Document masking is not just an attention idea. It depends on boundaries that the packed row still needs to preserve. Long-context packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles is not just "put more text in one example." It is the discipline of packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles without losing row-level semantic truth.

FIM also makes the alignment requirement visible in a way normal next-token rows can hide. The marker tokens that separate prefix, suffix, and middle spans must move with the same loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, and token-aligned metadata that the packer later validates; otherwise the row can look full while the infill objective is training across the wrong boundary. The checked-in FIM long-context metadata sample shows the token-metadata permutation, and the chunk boundary remap sample shows the matching structure-span remap.

This is exactly why the public examples in the data and long-context packs reinforce each other. The row contract is what lets those two families talk to each other honestly. Seen this way, document masking and curriculum is not an adjacent topic but a direct dependency of the row contract. The same input-boundary logic also matters for custom TPU kernels such as Pallas FlashAttention with logit softcap on TPU v6e, where segment and masking metadata only stay correct if the packed row stayed correct first.

Prior art and context

The general idea of packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles sequences efficiently without cross-sample leakage is well established. There is prior art on efficient sequence packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles with attention isolation, official Megatron and Torchtune docs on packed sequence formats, canonical FIM work, and broader long-context papers that explain why middle-position and boundary effects matter. MegaCpp's local contribution is the public-safe contract surface: examples that show how enriched records, masking, and row building remain aligned all the way to the model-facing row.

The newer packed-sequence docs also sharpen why row metadata is not only a correctness receipt. A naive block-triangular mask can isolate subsequences, but it changes the attention work from the sum of per-sequence squares to the square of the whole packed length. The variable-length packed-sequence path instead passes cumulative sequence lengths (cu_seqlensQuick term guidecu_seqlensThe cumulative sequence-length offsets passed to varlen attention kernels so packed subsequences stay isolated without computing then masking cross-document attention.Groundingtokenized enriched packed rows on TPU) into the attention kernel so attention between packed subsequences is never computed in the first place. That is exactly why row-level boundary fields belong in the contract here: they are what let the runtime keep isolation without paying a fake quadratic tax.

FAQ

Frequently asked questions

If tokenizer output looks fine, is the pipeline ready for training?+

No. The model never trains on raw token streams in isolation; it trains on packed rowsQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without… with masks, boundaries, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries., valid_token_countQuick term guidevalid_token_countThe per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads., and any enabled aligned side columns intact. A tokenizer-only sanity check can still hide broken packing, broken masking, or side-column drift that only shows up once the runtime loader reconstructs the row contract. The Packed row builder example and Loader enriched columns sample are the shortest checked-in pair for that point.

What breaks first when the packed row contract is wrong?+

Usually either cross-document leakage or column misalignment. The first shows up as false relationships across unrelated documents; the second shows up when enriched fields stop lining up with token positions. In practice that means the first debugging move is to inspect doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries., loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking., and token-aligned side columns together rather than blaming the model for bad long-context behavior. The Packed rows schema sample is the compact receipt for which columns belong in that inspection.

What is the practical difference between doc_ids and segment_ids?+

doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries. answer "which original document did this token come from?" while segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape. answer "which contiguous same-document segment should the masking backend treat as one unit?" In many rows the numbering will look similar, but they serve different consumers. The row packer needs durable document provenance; the masking backend often wants compact segment numbering. Packed row builder example shows the document side and Document-mask segment IDs sample shows the segment conversion.

Should the packed row store the final attention mask?+

Not necessarily. The stable contract is the boundary evidence, not one framework's mask representation. A runtime can derive cu_seqlensQuick term guidecu_seqlensThe cumulative sequence-length offsets passed to varlen attention kernels so packed subsequences stay isolated without computing then masking cross-document attention., a block mask, compact segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape., or a dense fallback from the same doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries. and valid-token accounting. Storing only an opaque mask makes debugging harder because it hides whether loss masking, attention isolation, and document-reset logic came from the same boundaries.

Is cu_seqlens part of the durable schema or just a runtime adapter?+

It is a runtime view derived from the durable row boundary fields. The packed row should preserve enough doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries., valid-token accounting, and aligned side-column evidence for the loader to regenerate cu_seqlensQuick term guidecu_seqlensThe cumulative sequence-length offsets passed to varlen attention kernels so packed subsequences stay isolated without computing then masking cross-document attention. or a sparse mask without guessing. Treating cu_seqlens as the only source of truth would make the fast attention path work while hiding whether the original document boundaries, loss maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking., and enriched columns still agree. The checked-in Packed row builder example and Document-mask segment IDs sample show those two views of the same boundary.

Where do input_pos and BlockMask fit in this contract?+

They are runtime views, not replacements for the row contract. A loader can reset input_pos at document boundaries or build a sparse BlockMask for an attention backend, but both views should be regenerated from the same durable row evidence: doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries., valid-token accounting, and aligned side columns. That keeps attention isolation, positional resets, and loss masking tied to one auditable source instead of three framework-specific guesses.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Packed rows

How the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…

Grounding

Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles

loss_mask

The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.

Grounding

valid_token_count

The per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.

Grounding

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

Grounding

segment_ids

The fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.

Grounding

cu_seqlens

The cumulative sequence-length offsets passed to varlen attention kernels so packed subsequences stay isolated without computing then masking cross-document attention.

Grounding

tokenized enriched packed rows on TPU

Topic hubs

Topic Hub

C++ Data Pipelines and Corpus Packaging

A curated archive for the C++ data path: corpus selection, semantic enrichment, packaging into training artifacts, and the file-level durability choices that keep the pipeline sane.

David Gornshtein • MegaCppMore posts →

Packed rows as the real training contract

If these row-contract terms are new

Why the packed row is more important than the intermediate artifacts

What a packed row has to preserve

Packing is not just an efficiency trick

Why this article belongs next to FIM and long-context notes

Prior art and context

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

C++ Data Pipelines and Corpus Packaging