Building the C++ Training Data Pipeline: What Worked, What Broke
An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and the quality gates that catch our own mistakes.

The most important dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most decision in MegaCpp is not a model hyperparameter. It is deciding what bytes the model is allowed to see, how those bytes are pinned, and what checks are required before a dataset snapshot is promoted into a real training lane.
This article focuses on the public engineering contract behind that pipeline.
Start with a small pinned operational slice
MegaCpp keeps a clear distinction between:
- the operational slice that is actively wired into training
- the catalog of additional sources that may become future inputs
That split matters because dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most-pipeline work fails differently from model work. If the team tries to ingest every interesting repository on day one, most of the debugging time is spent on storage, format drift, and tooling gaps rather than on quality.
The public rule is simple: keep the active training slice pinned to explicit revisions, and keep the larger catalog as metadata until it is needed.
The pipeline shape that survived
MegaCpp's pipeline can be summarized in five stages:
| Stage | Output | What must be true before promotion |
|---|---|---|
| collect | pinned public inputs | revision and license metadata recorded |
| normalize | cleaned source tree | encodings and obvious noise normalized |
| enrich | structure-aware records | provenance of syntax-only vs build-aware signals preserved |
| tokenize and store | explicit columnar artifacts | schema and token checks pass |
| verify | candidate training snapshot | round-trip decode and consumer smoke checks pass |
This is deliberately conservative. The pipeline is designed so that a broken promotion fails on a measurable check rather than surviving as a vague feeling that "the dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most looked fine."
What we filter before chunking
Three filters do most of the work:
language and structural filtering
Keep the language mix intentional. Drop obvious binaries, blobs, and files that are clearly generated noise.license and provenance filtering
Treat license metadata as structured data, not as a comment someone might remember to read later. SPDX expressions and REUSE-style headers are useful because they make this machine-readable.PII and secretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingAbout: Modal training platform overview Reference: Modal debugging playbook Reference: Modal Secrets docs scrubbing
Secret-like tokens, direct personal addresses, and machine-local paths should be normalized before tokenization, not after.
The cheap filters should run before expensive enrichment. File size, line-length outliers, and low text density are not proof of corpus quality, but they catch obvious generated or non-source rows before tokenizer and build-aware stages spend time on them. SecretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingAbout: Modal training platform overview Reference: Modal debugging playbook Reference: Modal Secrets docs and PII scrubbing follows the same rule: normalize emails, IP addresses, API keys, SSH keys, and machine-local paths into stable placeholders before the shard is tokenized.
Repository-level license badges are too coarse for this job. Mixed-license trees, vendored directories, and exception clauses often diverge within the same checkout, so the promotion gate has to read machine-parseable file-level metadata rather than trusting one repository label.
A visible SPDX-License-Identifier header is still only the start of that
gate. Once a file carries a composite expression or an exception, promotion has
to parse the expression and validate that the referenced licenses or
exceptions are real SPDX identifiers, rather than treating header presence
alone as proof that the file is ready for ingestion.
The point is not to claim perfect safety. The point is to make the pipeline less likely to promote obviously bad inputs.
Deduplicate before you believe the token counts
Deduplication is valuable for code dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most, but it is easy to describe too strongly. MegaCpp uses dedup as a mitigation, not as proof that memorization risk or contamination risk is gone.
The safest public version of the claim is:
- exact duplicates should be removed before chunking
- near-duplicate handling is valuable for vendored and lightly modified code
- dedup helps training quality, but it does not by itself prove the corpus is safe
That wording is closer to what public code-model literature supports and avoids promising more than the data pipelineQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most can actually guarantee.
The ordering matters too. If dedup runs after chunking, repeated libraries and boilerplate can be split into many orphan fragments that look artificially unique once they are detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook from their original file context. Pre-chunk dedup keeps the unit of judgment at the file or snapshot level, so generated or vendored repeats are filtered before the packer turns them into training rows. Code deduplication at scale is the longer version of that trade-off.
Why build-aware enrichment stays in the loop
For C++, plain lexical chunking is not enough. MegaCpp therefore keeps a structure-aware enrichment lane that can use build context when it exists and syntax-only structure when it does not. That is how the dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most story connects to the semantic-indexing story: broad coverage and semantic trust are different axes, and the pipeline records which one produced a given artifact.
The practical effect is that later training or evaluation code can treat a build-aware slice differently from a syntax-only slice instead of pretending they are the same kind of evidence.
The public artifact should not be a copied build database. It should be a sanitized build-context record: compiler family or mode, selected include roots or definitions, and the normalized translation-unit identity, with local directories removed. The checked-in compile commands context example shows the boundary between useful build signal and environment-specific command text.
Tokenization and storage are part of the contract
Tokenizer reproducibility is not just "use the same tokenizer name." The safer rule is:
- pin the tokenizer artifact by revision or saved files
- record special-token and normalization settings
- store the resulting dataset in an explicit schema
MegaCpp uses explicit columnar artifacts for this reason. Columnar storage is not the schema itself, but it is a good fit for large corpora because it keeps the stored contract visible: token columns, structure columns, metadata columns, and per-snapshot versioning.
One useful refinement from the research lane is that persistence and the training-facing handoff do not need to be the same artifact. Parquet is a good long-lived storage envelope because additive nullable fields age well there, but the hot consumer path usually wants a narrower typed surface that carries row-core text, token columns, and only the small build context the loader actually consumes. That is the same boundary called out in C++ data versioning and schema and the downstream Converting Parquet token shards into Megatron indexed datasets handoff: rich cold metadata can stay in storage while the hot path reads a much smaller contract.
Long-context training made document masking non-optional
Once documents are packed into long sequences, document boundaries stop being a nice-to-have. They become part of the correctness story. A long packed row that does not preserve boundaries can quietly teach the model relationships between unrelated files.
That is why MegaCpp treats document masking as a first-class dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most contract. The public point is not one exact implementation. The public point is that packing, masking, and evaluation must agree about where a document ends.
That agreement needs an explicit boundary receipt the loader and masking path
can both read: doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample when the packed row keeps original document labels
directly, or segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample when a backend derives contiguous boundary labels
from them. An end-of-document token by itself is weaker, because a snapshot can
deserialize cleanly while different parts of the stack still disagree about
where one document stops and the next starts.
The real moat is quality gates
The pipeline only becomes believable once promotion is blocked by explicit checks. MegaCpp's checks include:
- schema validation
- token-range and dtype validation
- round-trip decode checks
- sample-level sanity checks
- a small consumer smoke run before a snapshot is promoted
That last consumer smoke run is load-bearing. A shard can satisfy declared Parquet types and still fail the first real loader handoff if indexed-dataset headers, offsets, or dtype expectations drift. Promotion therefore has to prove both row-level validity and one downstream consumer read, not just that a generic table reader succeeded. C++ data versioning and schema covers the typed-row side of that contract; Converting Parquet token shards into Megatron indexed datasets is the adjacent boundary where those rows become a training artifact.
The exact thresholds may change. The idea should not. A dataset snapshot either survives promotion checks or it does not.
What the public claim should be
The strongest defensible public claim is:
- the active corpus is pinned
- license and provenance metadata are recorded explicitly
- dedup happens before promotion
- build-aware enrichment is kept separate from syntax-only coverage
- tokenizer and dataset revisions are versioned
- long-context packing requires explicit document-boundary handling
- dataset snapshots must pass promotion checks before training uses them
That is a stronger and more useful story than listing many sources without explaining the contract that ties them together.
Frequently asked questions
Why do exact and near-dedup happen before chunking instead of after?+
Why is schema-valid Parquet still not enough for promotion?+
Why keep loader-core columns separate from richer enriched metadata?+
Why is a visible SPDX header still not enough for promotion?+
SPDX-License-Identifier line that still combines terms incorrectly or references identifiers and exceptions that do not parse cleanly, so promotion should validate the expression itself before the file is treated as a usable training input.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.
The fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.
The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.
Modal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.
A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.
A grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…