MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 6 min readDavid Gornshtein
Data
SLM
Training
Dataloader
Dataset
Packing

SLM data: what the pipeline optimizes for and why the loader contract matters most

A grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in the public sample corpus.

MegaCpp
Focused on applied C++ model engineering
Article Preview
SLM data: what the pipeline optimizes for and why the loader contract matters most
Published 6 min readDavid Gornshtein

Small-model dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke discussions often stay too abstract. People argue about corpus mix, synthetic ratios, or token budgets without showing what the training input actually looks like. The public MegaCpp sample packs are useful because they keep the discussion on concrete surfaces: pinned-input rules, masking examples, compile-commandQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.GroundingAbout: compile commands and semantic graphs Example: compile commands context example Example: semantic indexing notes examples, and notes about structure-aware metadata. Taken together, those files show that in this stack, "SLM dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke" is not just a bag of documents. It is the whole path from pinned source inputs to packed training rows with explicit metadata and compatibility rules.

That framing matters because most dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke failures are interface failures. A pipeline can store tokens in a perfectly reasonable format and still train on the wrong thing if split rules, metadata defaults, or schema evolution are underspecified.

The base contract starts with pinned inputs and explicit splits

The first useful constraint is boring on purpose: inputs should be pinned and described as structured dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke. The public dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke-prep notes say to pin every upstream input to an explicit release tag, commit hash, or dataset revision, and the pinning note gives concrete examples such as llvm-project@llvmorg-19.1.0 rather than a floating branch. That is the right starting point because reproducible dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke work begins before tokenization.

The same notes also make the split story more concrete than many training writeups do. They describe a staged pipeline: collect public inputs, normalize them, attach license and provenance metadata, deduplicate, extract structure-aware metadata, write explicit columnar artifacts, and only then promote a snapshot after schema and consumer checks. In other words, a split is not just a train/validation percentage; it is part of a larger contract about what qualifies as a publishable snapshot.

Layer What it does Why it matters
input pinning records exact source revisions and licenses makes the corpus auditable and repeatable
preprocessing normalization, masking, deduplication removes accidental noise before tokenization
row materialization writes columnar artifacts with explicit fields defines what the trainer can actually read
schema checks validates shape and field compatibility prevents silent drift between producer and consumer

The real training unit is the packed row

The strongest theme across the public notes is that the important unit is the row consumed by the trainer, not the raw source document. The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke-prep note explicitly calls out deduplication before chunking, keeping build-aware metadata separate from plain lexical chunks, and running consumer smoke checks before promotion. That is exactly the mindset you want for an SLM pipeline: optimize for what the loader sees, not for a storage format headline.

The local Packed rows schema sample makes that boundary concrete. The loader-required columns stay narrow, the packer-required columns add row bookkeeping such as document IDs and valid-token counts, and the optional metadata families come with declared fallback fill values instead of ad hoc per-batch guesses. That is the practical version of a loader contract: upstream variation has to collapse into canonical row columns before promotion.

Even the small runnable examples reinforce that point. The masking example is tiny, but it is conceptually important because it treats document structure as something the pipeline preserves and edits intentionally rather than as an accident of text concatenation.

def mask_document_sections(tokens: list[str], mask_token: str = "<mask>") -> list[str]:
    masked = []
    for token in tokens:
        if token.startswith("DOC_"):
            masked.append(mask_token)
        else:
            masked.append(token)
    return masked

The point is not that this sample is a full trainer. The point is that the public example already encodes a loader-facing assumption: document markers survive far enough into preprocessing to be masked deterministically.

Enriched metadata turns the loader into a feature boundary

The public notes also make clear that "dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke" means more than token IDs. The pipeline description explicitly separates build-aware metadata from plain lexical chunks, and the semantic indexingQuick term guideSemantic indexingThe structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.GroundingAbout: Clang semantic indexing About: compile commands and semantic graphs Example: semantic indexing notes note describes structure-aware metadata as a first-class export surface rather than a side channel. That changes what the loader boundary is responsible for.

Once batches may contain token-aligned structure IDs, chunk boundaries, or graph-derived relations, the loader is no longer a passive transport layer. It becomes the feature boundary between corpus construction and model consumption. That is why schema discipline matters more than format branding. You can store rows in Parquet, Arrow IPC, or another columnar format and still fail if the meaning of a metadata field is unstable across versions.

That same split is why Parquet can still be useful upstream without becoming the model-facing contract. The local notes and the adjacent C++ data versioning and schema and Converting Parquet token shards into Megatron indexed datasets articles point to the same boundary: columnar Parquet snapshots are a good curation surface, while the loader-facing path wants a narrower typed handoff so schema repair does not turn into a per-batch CPU cost.

The compile-commandQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.GroundingAbout: compile commands and semantic graphs Example: compile commands context example Example: semantic indexing notes example is a good illustration of why this boundary matters:

{
  "directory": "/workspace/build",
  "file": "src/indexer.cpp",
  "arguments": [
    "clang++",
    "-std=c++20",
    "-Iinclude",
    "-Igenerated",
    "-c",
    "src/indexer.cpp"
  ]
}

This is not training dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke by itself. It is build context. But it is exactly the kind of structured input that can be threaded into chunk metadata or later retrieval features. If that context is kept separate, typed, and pinned, it can enrich the corpus. If it is smeared into free-form text, it becomes hard to validate and harder to evolve.

Most failures in a pipeline like this are boundary failures

The public files do not claim access to every internal failure mode, and they do not need to. They already point to the likely weak points.

One weak point is split integrity. If a promoted snapshot does not define train and validation materialization rules clearly, later comparisons become meaningless.

Another is metadata decoding. The more structure-aware fields a corpus carries, the more important it becomes to define canonical missing values and canonical field shapes before rows reach a model-facing loader. Loader enriched columns sample is the compact local proof surface: missing or malformed optional metadata falls back to declared defaults instead of forcing special-case loader branches.

A third is resume compatibility in the broad sense: when a dataset snapshot evolves, consumers need a stable rule for what happens to old rows, new rows, and missing fields. The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke-prep note's instruction to run schema and round-trip checks before promotion is a compact way of stating that requirement.

Failure surface Publicly grounded signal Why it matters
floating inputs Reference corpus pins forbids floating revisions prevents irreproducible corpora
mixed metadata shapes Semantic indexing notes treats structure metadata as explicit export data avoids consumer ambiguity
lossy preprocessing Data preparation notes separates normalization, dedup, and metadata extraction keeps transformations inspectable
build-context drift Compile commands fixture shows typed build inputs keeps structure features reproducible

What a robust SLM data pipeline should preserve

The public MegaCpp sample packs support a fairly strict checklist.

  • Pin every upstream repository, dataset, and tokenizer artifact to an exact revision.
  • Keep license metadata and provenance records as structured side dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke, not prose.
  • Deduplicate before chunking when possible.
  • Keep lexical content, build-aware metadata, and structure-aware metadata as distinct layers.
  • Validate snapshots with schema checks and at least one consumer smoke pass before promotion.
  • Treat missing metadata as a typed compatibility case, not as a reason for ad hoc loader branching.
  • Keep masking and similar preprocessing transforms deterministic and inspectable.

That is the useful summary. The important claim is not "this project uses Parquet" or "this project has metadata." The useful claim is that the published samples define a pipeline where pinned inputsQuick term guideInput pinningThe rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.GroundingAbout: license and corpus provenance Example: reference corpus pins become columnar training rows through explicit preprocessing, explicit metadata boundaries, and explicit compatibility checks. That is what makes the dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke path legible enough to debug.

FAQ

Frequently asked questions

Why keep Parquet in the pipeline if the loader contract starts later?+
Because storage format and consumer contract solve different problems. The checked-in pipeline notes still want explicit columnar artifacts and promotion checks during curation, but the training-facing lane wants canonical packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a… with typed defaults rather than repeated schema repair on every batch. The shortest local proof surfaces for that split are Data preparation notes, Packed rows schema sample, and Loader enriched columns sample.
Why should masking be part of the row contract instead of a late loader trick?+
Because masks have to stay aligned with tokens, chunk boundaries, and optional structure fields after packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a… or FIM-style transforms. The checked-in Masking pipeline excerpt remaps token-aligned metadata through the transform and drops chunks that cross the split, which is the same boundary this article is arguing for: make masking deterministic and inspectable before promotion, then let the loader consume the promoted row shape instead of recomputing ad hoc structure in the hot path. For the training-side rationale, see Documentation masking and curriculum.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

compile command database

The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.

Semantic indexing

The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

Input pinning

The rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

Data pipeline

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…