Packed rows as the real training contract
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a storage detail.

The most honest place to describe a training data pipeline is not the crawler and not the tokenizer. It is the packed row. That is the first format the model actually consumes as a stable contract.
MegaCpp's public data examples make that unusually clear. They show enriched records, masking transforms, schema samples, and row builders separately, but all of them are converging on the same operational boundary: one row that tells the runtime what tokens are valid, where document boundaries are, what the loss should ignore, and which structure fields are still aligned to those tokens. That is why this post sits in the middle of The C/C++ Data Preparation Pipeline, End to End and Tokenized enriched packed rows on TPU. The policy-side lead-in is Building the C++ Training Data Pipeline: What Worked, What Broke, and the schema-side lead-in is C++ Data Versioning and Schema.
If these row-contract terms are new
- A packed row is one fixed-length training example after tokenization and packing, not one raw document.
- A loss maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample says which positions should contribute to training loss.
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample are the per-token document labels that preserve document boundaries after many documents are packed into one row.segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample are the compact, contiguous boundary labels some masking backends derive fromdoc_idswhen they want per-segment numbering rather than raw document labels.valid_token_countQuick term guidevalid_token_countThe per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.GroundingExample: packed rows schema sample Example: packed row builder example Reference: tokenized enriched packed rows on TPU is the count of non-pad tokens the row actually contains.- Aligned side columns are optional metadata columns that still line up token-by-token with the packed text, such as structure or chunk fields.
The quickest local proof bundle is the Packed rows schema sample, Packed row builder example, Masking pipeline excerpt, Loader enriched columns sample, and Document-mask segment IDs sample.
Why the packed row is more important than the intermediate artifacts
Intermediates matter, but they are not the model contract. The model does not consume a repo clone, a build graph, or a raw enriched JSONL record directly. It consumes packed, token-aligned rows with explicit masks and metadata. That is why this contract has to stay connected to long-context and attention sinks: bad packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles and bad masks do not just waste tokens, they teach the model false cross-document relationships.
That is why the local example pack is structured the way it is:
- one fixture for enriched records
- one row-builder example
- one schema sample
- one masking excerpt that preserves alignment through transformations
- one loader-side sample for reading optional enriched columns
Taken together, these files say something stronger than "we have a data pipeline." They say the pipeline is only finished once all those fields survive packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles in a form the model can actually train on.
What a packed row has to preserve
The public examples support a practical row contract built around a few durable surfaces:
- token ids and target ids
- a loss maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample
- valid-token accounting
- document-boundary or segment information
- optional enriched columns that still line up with token positions
The checked-in schema sample makes that split precise. It keeps the minimal
loader-required columns (input_ids, target_ids, loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample) separate from
the packer-required boundary columns (doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, valid_token_countQuick term guidevalid_token_countThe per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.GroundingExample: packed rows schema sample Example: packed row builder example Reference: tokenized enriched packed rows on TPU,
num_docs) and then lists the optional token- and chunk-level metadata columns
that only matter if they stay aligned through packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles.
The useful rule is that every runtime view must be derivable from row metadata, not from a later guess. If an attention backend wants cumulative sequence lengths, a sparse block mask, or compact segment IDsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, the packer still needs to emit enough per-token boundary evidence for that view to be reconstructed and audited. Otherwise the same row can appear valid to the loader while loss, attention, and document-reset logic silently disagree.
That boundary split is why doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample and segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample should not be treated as
synonyms. doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample preserve provenance from the original packed documents.
segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample are the backend-facing relabeling of those boundaries when the
masking path needs compact segment numbers. The row contract is correct only if
those two views still describe the same boundaries. Document masking and
curriculum explains the masking side; the
checked-in receipt is Document-mask segment IDs sample.
The validation cost should stay proportional to the row, not to a rebuilt
pipeline. Check required columns, fixed lengths, loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, and
optional side-column shapes at ingest, then let the loader normalize optional
metadata without inventing new batch shapes. The runtime side of that boundary
is Dataloader throughput and stalls.
That is already enough to explain why packed rowsQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles deserve their own article. Once the row exists, many earlier data-pipeline arguments stop being abstract. You can now ask whether masking survived, whether enrichment stayed aligned, and whether the loader still knows how to read the optional columns without turning everything into an opaque sidecar. That is also why the TPU-facing continuation in OOM on v6e matters operationally: once packed rowsQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles define the contract, TPU memory failures and recompilation problems can be discussed against a stable input boundary instead of against vague "data pipeline" language.
Packing is not just an efficiency trick
Sequence packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles is often described as a throughput optimization, and that is true but incomplete. In MegaCpp it is also a correctness boundary.
If the packed row fails to encode boundaries and masks correctly, long-context training will teach the model false relationships across unrelated documents. If the row loses alignment between tokens and enriched structure fields, later structure-aware work becomes guesswork. If the row builder and schema disagree, the loader can still run while silently training on the wrong contract.
That is why the row builder, schema sample, and masking excerpt belong in the same public surface. They are three views of the same training boundary. The schema side of that boundary is also why C++ data versioning and schema belongs nearby. If the declared row shape drifts from what the loader or mask builder expects, the pipeline can keep running while the training contract is already broken.
Why this article belongs next to FIM and long-context notes
The packed row is where fill-in-the-middle, document masking, and long-context packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles finally meet one another. FIM is not just a transform on raw text. It changes what part of the row carries loss. Document masking is not just an attention idea. It depends on boundaries that the packed row still needs to preserve. Long-context packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles is not just "put more text in one example." It is the discipline of packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles without losing row-level semantic truth.
FIM also makes the alignment requirement visible in a way normal next-token
rows can hide. The marker tokens that separate prefix, suffix, and middle spans
must move with the same loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, and token-aligned metadata that
the packer later validates; otherwise the row can look full while the infill
objective is training across the wrong boundary. The checked-in
FIM long-context metadata sample
shows the token-metadata permutation, and the
chunk boundary remap sample
shows the matching structure-span remap.
This is exactly why the public examples in the data and long-context packs reinforce each other. The row contract is what lets those two families talk to each other honestly. Seen this way, document masking and curriculum is not an adjacent topic but a direct dependency of the row contract. The same input-boundary logic also matters for custom TPU kernels such as Pallas FlashAttention with logit softcap on TPU v6e, where segment and masking metadata only stay correct if the packed row stayed correct first.
Prior art and context
The general idea of packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles sequences efficiently without cross-sample leakage is well established. There is prior art on efficient sequence packingQuick term guidePacked rowsHow the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…GroundingTokenized enriched packed rows on TPU: feeding structure to XLA without recompiles with attention isolation, official Megatron and Torchtune docs on packed sequence formats, canonical FIM work, and broader long-context papers that explain why middle-position and boundary effects matter. MegaCpp's local contribution is the public-safe contract surface: examples that show how enriched records, masking, and row building remain aligned all the way to the model-facing row.
The newer packed-sequence docs also sharpen why row metadata is not only a
correctness receipt. A naive block-triangular mask can isolate subsequences,
but it changes the attention work from the sum of per-sequence squares to the
square of the whole packed length. The variable-length packed-sequence path
instead passes cumulative sequence lengths (cu_seqlensQuick term guidecu_seqlensThe cumulative sequence-length offsets passed to varlen attention kernels so packed subsequences stay isolated without computing then masking cross-document attention.Groundingtokenized enriched packed rows on TPU) into the attention
kernel so attention between packed subsequences is never computed in the first
place. That is exactly why row-level boundary fields belong in the contract
here: they are what let the runtime keep isolation without paying a fake
quadratic tax.
Frequently asked questions
If tokenizer output looks fine, is the pipeline ready for training?+
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries., valid_token_countQuick term guidevalid_token_countThe per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads., and any enabled aligned side columns intact. A tokenizer-only sanity check can still hide broken packing, broken masking, or side-column drift that only shows up once the runtime loader reconstructs the row contract. The Packed row builder example and Loader enriched columns sample are the shortest checked-in pair for that point.What breaks first when the packed row contract is wrong?+
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries., loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking., and token-aligned side columns together rather than blaming the model for bad long-context behavior. The Packed rows schema sample is the compact receipt for which columns belong in that inspection.What is the practical difference between doc_ids and segment_ids?+
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries. answer "which original document did this token come from?" while segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape. answer "which contiguous same-document segment should the masking backend treat as one unit?" In many rows the numbering will look similar, but they serve different consumers. The row packer needs durable document provenance; the masking backend often wants compact segment numbering. Packed row builder example shows the document side and Document-mask segment IDs sample shows the segment conversion.Should the packed row store the final attention mask?+
cu_seqlensQuick term guidecu_seqlensThe cumulative sequence-length offsets passed to varlen attention kernels so packed subsequences stay isolated without computing then masking cross-document attention., a block mask, compact segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape., or a dense fallback from the same doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries. and valid-token accounting. Storing only an opaque mask makes debugging harder because it hides whether loss masking, attention isolation, and document-reset logic came from the same boundaries.Is cu_seqlens part of the durable schema or just a runtime adapter?+
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries., valid-token accounting, and aligned side-column evidence for the loader to regenerate cu_seqlensQuick term guidecu_seqlensThe cumulative sequence-length offsets passed to varlen attention kernels so packed subsequences stay isolated without computing then masking cross-document attention. or a sparse mask without guessing. Treating cu_seqlens as the only source of truth would make the fast attention path work while hiding whether the original document boundaries, loss maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking., and enriched columns still agree. The checked-in Packed row builder example and Document-mask segment IDs sample show those two views of the same boundary.Where do input_pos and BlockMask fit in this contract?+
input_pos at document boundaries or build a sparse BlockMask for an attention backend, but both views should be regenerated from the same durable row evidence: doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries., valid-token accounting, and aligned side columns. That keeps attention isolation, positional resets, and loss masking tied to one auditable source instead of three framework-specific guesses.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
How the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without…
The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.
The per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.
The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.
The fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.
The cumulative sequence-length offsets passed to varlen attention kernels so packed subsequences stay isolated without computing then masking cross-document attention.