MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 7 min readDavid Gornshtein
Data
Pipeline
C++
Tokenizer
Quality

Building the C++ Training Data Pipeline: What Worked, What Broke

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and the quality gates that catch our own mistakes.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Building the C++ Training Data Pipeline: What Worked, What Broke
Published 7 min readDavid Gornshtein

The most important dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most decision in MegaCpp is not a model hyperparameter. It is deciding what bytes the model is allowed to see, how those bytes are pinned, and what checks are required before a dataset snapshot is promoted into a real training lane.

This article focuses on the public engineering contract behind that pipeline.

Start with a small pinned operational slice

MegaCpp keeps a clear distinction between:

  • the operational slice that is actively wired into training
  • the catalog of additional sources that may become future inputs

That split matters because dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most-pipeline work fails differently from model work. If the team tries to ingest every interesting repository on day one, most of the debugging time is spent on storage, format drift, and tooling gaps rather than on quality.

The public rule is simple: keep the active training slice pinned to explicit revisions, and keep the larger catalog as metadata until it is needed.

The pipeline shape that survived

MegaCpp's pipeline can be summarized in five stages:

Stage Output What must be true before promotion
collect pinned public inputs revision and license metadata recorded
normalize cleaned source tree encodings and obvious noise normalized
enrich structure-aware records provenance of syntax-only vs build-aware signals preserved
tokenize and store explicit columnar artifacts schema and token checks pass
verify candidate training snapshot round-trip decode and consumer smoke checks pass

This is deliberately conservative. The pipeline is designed so that a broken promotion fails on a measurable check rather than surviving as a vague feeling that "the dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most looked fine."

What we filter before chunking

Three filters do most of the work:

  1. language and structural filtering
    Keep the language mix intentional. Drop obvious binaries, blobs, and files that are clearly generated noise.

  2. license and provenance filtering
    Treat license metadata as structured data, not as a comment someone might remember to read later. SPDX expressions and REUSE-style headers are useful because they make this machine-readable.

  3. PII and secretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingAbout: Modal training platform overview Reference: Modal debugging playbook Reference: Modal Secrets docs scrubbing
    Secret-like tokens, direct personal addresses, and machine-local paths should be normalized before tokenization, not after.

The cheap filters should run before expensive enrichment. File size, line-length outliers, and low text density are not proof of corpus quality, but they catch obvious generated or non-source rows before tokenizer and build-aware stages spend time on them. SecretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingAbout: Modal training platform overview Reference: Modal debugging playbook Reference: Modal Secrets docs and PII scrubbing follows the same rule: normalize emails, IP addresses, API keys, SSH keys, and machine-local paths into stable placeholders before the shard is tokenized.

Repository-level license badges are too coarse for this job. Mixed-license trees, vendored directories, and exception clauses often diverge within the same checkout, so the promotion gate has to read machine-parseable file-level metadata rather than trusting one repository label.

A visible SPDX-License-Identifier header is still only the start of that gate. Once a file carries a composite expression or an exception, promotion has to parse the expression and validate that the referenced licenses or exceptions are real SPDX identifiers, rather than treating header presence alone as proof that the file is ready for ingestion.

The point is not to claim perfect safety. The point is to make the pipeline less likely to promote obviously bad inputs.

Deduplicate before you believe the token counts

Deduplication is valuable for code dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most, but it is easy to describe too strongly. MegaCpp uses dedup as a mitigation, not as proof that memorization risk or contamination risk is gone.

The safest public version of the claim is:

  • exact duplicates should be removed before chunking
  • near-duplicate handling is valuable for vendored and lightly modified code
  • dedup helps training quality, but it does not by itself prove the corpus is safe

That wording is closer to what public code-model literature supports and avoids promising more than the data pipelineQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most can actually guarantee.

The ordering matters too. If dedup runs after chunking, repeated libraries and boilerplate can be split into many orphan fragments that look artificially unique once they are detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook from their original file context. Pre-chunk dedup keeps the unit of judgment at the file or snapshot level, so generated or vendored repeats are filtered before the packer turns them into training rows. Code deduplication at scale is the longer version of that trade-off.

Why build-aware enrichment stays in the loop

For C++, plain lexical chunking is not enough. MegaCpp therefore keeps a structure-aware enrichment lane that can use build context when it exists and syntax-only structure when it does not. That is how the dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most story connects to the semantic-indexing story: broad coverage and semantic trust are different axes, and the pipeline records which one produced a given artifact.

The practical effect is that later training or evaluation code can treat a build-aware slice differently from a syntax-only slice instead of pretending they are the same kind of evidence.

The public artifact should not be a copied build database. It should be a sanitized build-context record: compiler family or mode, selected include roots or definitions, and the normalized translation-unit identity, with local directories removed. The checked-in compile commands context example shows the boundary between useful build signal and environment-specific command text.

Tokenization and storage are part of the contract

Tokenizer reproducibility is not just "use the same tokenizer name." The safer rule is:

  • pin the tokenizer artifact by revision or saved files
  • record special-token and normalization settings
  • store the resulting dataset in an explicit schema

MegaCpp uses explicit columnar artifacts for this reason. Columnar storage is not the schema itself, but it is a good fit for large corpora because it keeps the stored contract visible: token columns, structure columns, metadata columns, and per-snapshot versioning.

One useful refinement from the research lane is that persistence and the training-facing handoff do not need to be the same artifact. Parquet is a good long-lived storage envelope because additive nullable fields age well there, but the hot consumer path usually wants a narrower typed surface that carries row-core text, token columns, and only the small build context the loader actually consumes. That is the same boundary called out in C++ data versioning and schema and the downstream Converting Parquet token shards into Megatron indexed datasets handoff: rich cold metadata can stay in storage while the hot path reads a much smaller contract.

Long-context training made document masking non-optional

Once documents are packed into long sequences, document boundaries stop being a nice-to-have. They become part of the correctness story. A long packed row that does not preserve boundaries can quietly teach the model relationships between unrelated files.

That is why MegaCpp treats document masking as a first-class dataQuick term guideData pipelineA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most contract. The public point is not one exact implementation. The public point is that packing, masking, and evaluation must agree about where a document ends.

That agreement needs an explicit boundary receipt the loader and masking path can both read: doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample when the packed row keeps original document labels directly, or segment_idsQuick term guidesegment_idsThe fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample when a backend derives contiguous boundary labels from them. An end-of-document token by itself is weaker, because a snapshot can deserialize cleanly while different parts of the stack still disagree about where one document stops and the next starts.

The real moat is quality gates

The pipeline only becomes believable once promotion is blocked by explicit checks. MegaCpp's checks include:

  • schema validation
  • token-range and dtype validation
  • round-trip decode checks
  • sample-level sanity checks
  • a small consumer smoke run before a snapshot is promoted

That last consumer smoke run is load-bearing. A shard can satisfy declared Parquet types and still fail the first real loader handoff if indexed-dataset headers, offsets, or dtype expectations drift. Promotion therefore has to prove both row-level validity and one downstream consumer read, not just that a generic table reader succeeded. C++ data versioning and schema covers the typed-row side of that contract; Converting Parquet token shards into Megatron indexed datasets is the adjacent boundary where those rows become a training artifact.

The exact thresholds may change. The idea should not. A dataset snapshot either survives promotion checks or it does not.

What the public claim should be

The strongest defensible public claim is:

  • the active corpus is pinned
  • license and provenance metadata are recorded explicitly
  • dedup happens before promotion
  • build-aware enrichment is kept separate from syntax-only coverage
  • tokenizer and dataset revisions are versioned
  • long-context packing requires explicit document-boundary handling
  • dataset snapshots must pass promotion checks before training uses them

That is a stronger and more useful story than listing many sources without explaining the contract that ties them together.

FAQ

Frequently asked questions

Why do exact and near-dedup happen before chunking instead of after?+
Because post-chunk dedup can turn one repeated file into many semantically orphaned fragments. The safer place to decide uniqueness is before AST or packed-row chunking, while the file still carries its real provenance and local context. Code deduplication at scale is the longer version of that argument.
Why is schema-valid Parquet still not enough for promotion?+
Because the first real consumer may still disagree about offsets, dtypes, or indexed-dataset headers even when the shard looks fine at the table level. Promotion has to prove both "the rows decode under the declared schema" and "the next loader-facing handoff still reads the same contract."
Why keep loader-core columns separate from richer enriched metadata?+
Because the hot training path needs a narrow typed contract that every shard can satisfy, while richer build-aware fields need room to evolve without breaking old snapshots. The Packed rows schema sample shows the small required row surface; the Loader enriched columns sample shows the matching rule on the read side, where optional enriched columns stay defaultable instead of turning one malformed metadata field into a hard loader failure.
Why is a visible SPDX header still not enough for promotion?+
Because the gate is trying to prove machine-readable licensing, not just comment presence. A file can expose an SPDX-License-Identifier line that still combines terms incorrectly or references identifiers and exceptions that do not parse cleanly, so promotion should validate the expression itself before the file is treated as a usable training input.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

compile command database

The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.

segment_ids

The fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

Secrets

Modal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

Data pipeline

A grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…

Topic hubs