Why keep Parquet in the pipeline if the loader contract starts later?

Because storage format and consumer contract solve different problems. The checked-in pipeline notes still want explicit columnar artifacts and promotion checks during curation, but the training-facing lane wants canonical packed rows with typed defaults rather than repeated schema repair on every batch. The shortest local proof surfaces for that split are Data preparation notes, Packed rows schema sample, and Loader enriched columns sample.

Why should masking be part of the row contract instead of a late loader trick?

Because masks have to stay aligned with tokens, chunk boundaries, and optional structure fields after packing or FIM-style transforms. The checked-in Masking pipeline excerpt remaps token-aligned metadata through the transform and drops chunks that cross the split, which is the same boundary this article is arguing for: make masking deterministic and inspectable before promotion, then let the loader consume the promoted row shape instead of recomputing ad hoc structure in the hot path. For the training-side rationale, see Documentation masking and curriculum.

SLM data: what the pipeline optimizes for and why the loader contract matters most

Small-model dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke discussions often stay too abstract. People argue about corpus mix, synthetic ratios, or token budgets without showing what the training input actually looks like. The public MegaCpp sample packs are useful because they keep the discussion on concrete surfaces: pinned-input rules, masking examples, compile-commandQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.GroundingAbout: compile commands and semantic graphs Example: compile commands context example Example: semantic indexing notes examples, and notes about structure-aware metadata. Taken together, those files show that in this stack, "SLM dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke" is not just a bag of documents. It is the whole path from pinned source inputs to packed training rows with explicit metadata and compatibility rules.

That framing matters because most dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke failures are interface failures. A pipeline can store tokens in a perfectly reasonable format and still train on the wrong thing if split rules, metadata defaults, or schema evolution are underspecified.

The base contract starts with pinned inputs and explicit splits

The first useful constraint is boring on purpose: inputs should be pinned and described as structured dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke. The public dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke-prep notes say to pin every upstream input to an explicit release tag, commit hash, or dataset revision, and the pinning note gives concrete examples such as llvm-project@llvmorg-19.1.0 rather than a floating branch. That is the right starting point because reproducible dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke work begins before tokenization.

The same notes also make the split story more concrete than many training writeups do. They describe a staged pipeline: collect public inputs, normalize them, attach license and provenance metadata, deduplicate, extract structure-aware metadata, write explicit columnar artifacts, and only then promote a snapshot after schema and consumer checks. In other words, a split is not just a train/validation percentage; it is part of a larger contract about what qualifies as a publishable snapshot.

Layer	What it does	Why it matters
input pinning	records exact source revisions and licenses	makes the corpus auditable and repeatable
preprocessing	normalization, masking, deduplication	removes accidental noise before tokenization
row materialization	writes columnar artifacts with explicit fields	defines what the trainer can actually read
schema checks	validates shape and field compatibility	prevents silent drift between producer and consumer

The real training unit is the packed row

The strongest theme across the public notes is that the important unit is the row consumed by the trainer, not the raw source document. The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke-prep note explicitly calls out deduplication before chunking, keeping build-aware metadata separate from plain lexical chunks, and running consumer smoke checks before promotion. That is exactly the mindset you want for an SLM pipeline: optimize for what the loader sees, not for a storage format headline.

The local Packed rows schema sample makes that boundary concrete. The loader-required columns stay narrow, the packer-required columns add row bookkeeping such as document IDs and valid-token counts, and the optional metadata families come with declared fallback fill values instead of ad hoc per-batch guesses. That is the practical version of a loader contract: upstream variation has to collapse into canonical row columns before promotion.

Even the small runnable examples reinforce that point. The masking example is tiny, but it is conceptually important because it treats document structure as something the pipeline preserves and edits intentionally rather than as an accident of text concatenation.

def mask_document_sections(tokens: list[str], mask_token: str = "<mask>") -> list[str]:
    masked = []
    for token in tokens:
        if token.startswith("DOC_"):
            masked.append(mask_token)
        else:
            masked.append(token)
    return masked

The point is not that this sample is a full trainer. The point is that the public example already encodes a loader-facing assumption: document markers survive far enough into preprocessing to be masked deterministically.

Enriched metadata turns the loader into a feature boundary

The public notes also make clear that "dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke" means more than token IDs. The pipeline description explicitly separates build-aware metadata from plain lexical chunks, and the semantic indexingQuick term guideSemantic indexingThe structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.GroundingAbout: Clang semantic indexing About: compile commands and semantic graphs Example: semantic indexing notes note describes structure-aware metadata as a first-class export surface rather than a side channel. That changes what the loader boundary is responsible for.

Once batches may contain token-aligned structure IDs, chunk boundaries, or graph-derived relations, the loader is no longer a passive transport layer. It becomes the feature boundary between corpus construction and model consumption. That is why schema discipline matters more than format branding. You can store rows in Parquet, Arrow IPC, or another columnar format and still fail if the meaning of a metadata field is unstable across versions.

That same split is why Parquet can still be useful upstream without becoming the model-facing contract. The local notes and the adjacent C++ data versioning and schema and Converting Parquet token shards into Megatron indexed datasets articles point to the same boundary: columnar Parquet snapshots are a good curation surface, while the loader-facing path wants a narrower typed handoff so schema repair does not turn into a per-batch CPU cost.

The compile-commandQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.GroundingAbout: compile commands and semantic graphs Example: compile commands context example Example: semantic indexing notes example is a good illustration of why this boundary matters:

{
  "directory": "/workspace/build",
  "file": "src/indexer.cpp",
  "arguments": [
    "clang++",
    "-std=c++20",
    "-Iinclude",
    "-Igenerated",
    "-c",
    "src/indexer.cpp"
  ]
}

This is not training dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke by itself. It is build context. But it is exactly the kind of structured input that can be threaded into chunk metadata or later retrieval features. If that context is kept separate, typed, and pinned, it can enrich the corpus. If it is smeared into free-form text, it becomes hard to validate and harder to evolve.

Most failures in a pipeline like this are boundary failures

The public files do not claim access to every internal failure mode, and they do not need to. They already point to the likely weak points.

One weak point is split integrity. If a promoted snapshot does not define train and validation materialization rules clearly, later comparisons become meaningless.

Another is metadata decoding. The more structure-aware fields a corpus carries, the more important it becomes to define canonical missing values and canonical field shapes before rows reach a model-facing loader. Loader enriched columns sample is the compact local proof surface: missing or malformed optional metadata falls back to declared defaults instead of forcing special-case loader branches.

A third is resume compatibility in the broad sense: when a dataset snapshot evolves, consumers need a stable rule for what happens to old rows, new rows, and missing fields. The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke-prep note's instruction to run schema and round-trip checks before promotion is a compact way of stating that requirement.

Failure surface	Publicly grounded signal	Why it matters
floating inputs	Reference corpus pins forbids floating revisions	prevents irreproducible corpora
mixed metadata shapes	Semantic indexing notes treats structure metadata as explicit export data	avoids consumer ambiguity
lossy preprocessing	Data preparation notes separates normalization, dedup, and metadata extraction	keeps transformations inspectable
build-context drift	Compile commands fixture shows typed build inputs	keeps structure features reproducible

What a robust SLM data pipeline should preserve

The public MegaCpp sample packs support a fairly strict checklist.

Pin every upstream repository, dataset, and tokenizer artifact to an exact revision.
Keep license metadata and provenance records as structured side dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke, not prose.
Deduplicate before chunking when possible.
Keep lexical content, build-aware metadata, and structure-aware metadata as distinct layers.
Validate snapshots with schema checks and at least one consumer smoke pass before promotion.
Treat missing metadata as a typed compatibility case, not as a reason for ad hoc loader branching.
Keep masking and similar preprocessing transforms deterministic and inspectable.

That is the useful summary. The important claim is not "this project uses Parquet" or "this project has metadata." The useful claim is that the published samples define a pipeline where pinned inputsQuick term guideInput pinningThe rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.GroundingAbout: license and corpus provenance Example: reference corpus pins become columnar training rows through explicit preprocessing, explicit metadata boundaries, and explicit compatibility checks. That is what makes the dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke path legible enough to debug.

SLM data: what the pipeline optimizes for and why the loader contract matters most

The base contract starts with pinned inputs and explicit splits

The real training unit is the packed row

Enriched metadata turns the loader into a feature boundary

Most failures in a pipeline like this are boundary failures

What a robust SLM data pipeline should preserve

Frequently asked questions

Terms used in this article

SLM data: what the pipeline optimizes for and why the loader contract matters most

The base contract starts with pinned inputs and explicit splits

The real training unit is the packed row

Enriched metadata turns the loader into a feature boundary

Most failures in a pipeline like this are boundary failures

What a robust SLM data pipeline should preserve

Read next

References

Frequently asked questions

Terms used in this article