MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202612 min readDavid Gornshtein

Data

Pipeline

C++

Operations

Tokenizer

The C/C++ Data Preparation Pipeline, End to End

Q: Why keep the pipeline split into so many explicit stages?

Because the expensive failures happen at different boundaries. Near-dedup has different failure modes than license filtering, BOS-aware masking fails differently from token packing, and schema verification fails differently from semantic enrichment. Keeping those as explicit stages gives each one a reviewable output contract and lets the verify gate tell you whether the regression came from source selection, dedup, tokenization, or packing instead of giving you one opaque "data got worse" answer.

Q: Why is BOS-based document masking treated as a pipeline invariant?

Because the runtime side is allowed to trust it. The loader, packer, and any later structure-aware consumer all assume BOS marks real document starts, so the moment that invariant drifts, doc_ids, masks, and packed rows stop describing the same sample boundaries. That is how you get cross-document leakage that looks like a model bug but is really a producer bug.

Q: Why does this article emphasize schema tolerance so much?

Because producer code changes faster than long-lived consumers. The stable thing is the parquet contract, not whichever script emitted it this month. A tolerant loader plus a strict schema/verify gate lets you evolve non-essential fields, add enriched columns, or change stage internals without forcing every downstream reader to be rewritten or every older shard to be re-materialized.

Q: Why stage in parquet if the training run ultimately consumes .bin/.idx?

Because the two formats solve different jobs. The parquet stage is the typed columnar staging format where enrichment columns, provenance, and verification receipts stay inspectable; the Megatron indexed dataset is the narrower final read format the trainer streams. If you want that bridge spelled out end to end, keep Parquet to Megatron indexed dataset sample and Converting parquet token shards into Megatron indexed datasets nearby.

Q: Why keep the AST-aware chunker around if bucket labels already exist?

Because bucket labels are planning labels, not proof of a token-faithful split. The AST-aware lane gives you cleaner semantic boundaries and a tighter token budget than the old chars-per-token heuristic, which is exactly why nominal 4k shards from the heuristic path can drift into the mid-4K range after real tokenization. If the chunking question turns into "what structural context are we preserving?" rather than "how many tokens fit?", keep compile commands and semantic graphs and the Clang semantic indexer nearby.

Q: Why not run chunk-level dedup on every semantic unit?

Because some chunks are only meaningful inside their surrounding file context. Restricting chunk-level dedup to self-contained kinds such as bodies, declarations, typedefs, and namespaces avoids deleting preambles, forward declarations, or class members that later stages still need to interpret correctly. The more general dedup tradeoff is spelled out in Code Deduplication at Scale.

Q: Where do tokenizer, provenance, and eval posts fit into this pipeline?

They describe adjacent boundaries of the same data story. Inside the MegaCpp C++ tokenizer and Tokenizer evolution for C++ code explain the 131K vocabulary and why uint32 tokens are required. License Hygiene and Provenance for a C++ Training Corpus explains the pinning and SPDX side of stage 0 through stage 3. Eval harness plumbing and Verifier-first C++ evals are downstream consumers of the same contract: they only make sense if the promoted dataset snapshot is pinned, typed, and auditable.

Q: Which public examples are the fastest proof surfaces for this lane?

Start with Data preparation notes for the public pipeline shape, then Data and masking examples for the local example map, then the narrow stage receipts: Enriched JSONL record to parquet, Packed row builder example, Token-level enriched parquet materialization example, Prepare-format MegaCpp sample, and Parquet to Megatron indexed dataset sample.

Every stage of the MegaCpp data preparation pipeline: ingest, dedup, license filtering, document masking, tokenization, packed rows, and the checks that keep dataset snapshots trustworthy.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

The C/C++ Data Preparation Pipeline, End to End

Published April 18, 2026•12 min read•David Gornshtein

This is the implementation-focused view of how raw C/C++ source becomes packed training rows for MegaCpp. It is the sibling post to Building the C++ Training Data Pipeline: What Worked, What Broke: that one frames the design decisions, this one walks through the stages and the checks that keep them stable. It also pairs naturally with compile commands and semantic graphs, because the later semantic-enrichment stages only make sense if the preparation contract stays deterministic. For the source-selection side of the same lane, read Building a C/C++ corpus for training: what we keep, what we throw away, and why.

Why MegaCpp cares about this

The model only ever sees what the pipeline emits. A duplicated repo doubles the training weight of someone's preferred coding style. A missed license header bakes copyleft into the weights. A broken document mask lets one file leak attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns into the next, and at 64K context that is the difference between repository reasoning and confabulation. The pipeline is the gatekeeper, and its quality gates are the only thing standing between a clean training run and a model that has memorized a well-known systems library surface instead of learning to write it.

Two engineering facts shape every decision below. First, MegaCpp's hybrid C++ tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped is 131,072 entries, which means token IDs no longer fit in uint16 and the on-disk format must use uint32; Inside the MegaCpp C++ tokenizer and Tokenizer evolution for C++ code are the companion posts if you want the vocabulary story behind that number. Second, the columnar dataset contract is the stable interface, while producer implementations may evolve. In this lane a columnar dataset contract means the exported field names, types, fallback values, and verification gates that every later loader or formatter is allowed to trust, even if the producer code changes. Loaders should be tolerant and producers should be replaceable, which is the same schema-discipline argument made in C++ Data Versioning and Schema.

Public pipeline contract

The public-facing dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-preparation pipeline has five numbered stages: download, tokenize, format, cache, verify. Underneath, the actual work spans a semantic chunker, dedup passes, enrichment jobs, and packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles stages. Stage by stage:

Stage 0 - acquisition. Start from a pinned set of public C and C++ repositories at explicit revisions. A broader public source list can be tracked separately for future evaluation, but keeping the working set small makes it easier to debug dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most quality instead of spending early effort on ingestion overhead. The public source-selection contract is the same narrower one described in Building a C/C++ corpus for training: what we keep, what we throw away, and why and Reference corpus pinning notes.

Stage 1 - ingest and chunking. Two coexisting producers are common during transitions. The mainline chunker can split at function boundaries with an AST-aware budget, while an older path may split at top-level brace boundaries and budget by approximate token count. Both write normalized text records. Bucket labels such as 4k, 8k, and 16k should be treated as planning labels unless the producer enforces exact token budgets.

There is one trap in this stage that will bite anyone who skips the docs. Bucket names like 4k, 8k, 16k, 64k, 128k are target buckets, not exact-token guarantees. The legacy chunker budgets by a chars-per-token heuristic, which is wrong by 5-15% under the current tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped. A nominal 4k shard often contains documents that tokenize to 4400-4800 tokens. The strict producer lanes are exact-token-budgeted; the older ones are not. The loader will silently crop if you trust the bucket name as a contract.

Stage 2 - dedup. Two passes, two scopes. First is within-corpus dedup: SHA-256 exact first, then MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample-LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample near-dedup with 128 permutations, JaccardQuick term guideJaccard thresholdThe overlap threshold used to decide when two shingle sets are similar enough to treat as near-duplicates during corpus cleanup.GroundingAbout: code deduplication at scale Example: dedup pipeline sample 0.7, 5-token shinglesQuick term guideShinglingThe step that turns text or token streams into overlapping k-grams so similarity can be estimated from shared local fragments rather than exact string identity.GroundingAbout: code deduplication at scale Reference: semantic indexing notes. Second is cross-source dedup: provenance-aware grouping, 112-permutation LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample (14 bands of 8 rows, the MixMinMatch parameterization), Union-Find to cluster, plus optional chunk-level dedup restricted to self-contained semantic units. The whitelist matters (FUNC_BODY, CLASS_DECL, TYPEDEF, NAMESPACE); the blacklist matters more (OTHER, PREAMBLE, FUNC_SIGNATURE, CLASS_MEMBER, COMMENT). Deduping an #include block in isolation breaks surrounding code; deduping a forward declaration silently drops something downstream expects.

Operational gotcha: the MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample-LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample index is single-process and memory-bound. On the 27.6 M-document corpus we hit ~40 GB resident before tuning the shingle iterator to stream rather than materialize. The two-pass design is not optional - exact dedup removes 30-40% before the expensive near-dedup pass even starts. The operational reason for keeping those stages explicit is the same one described in dataset versions v2 to v6: producer evolution is manageable only when each stage has a stable, reviewable output contract. The concrete MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample/LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample tradeoffs are broken out in Code Deduplication at Scale.

Stage 3 - license and quality filter. ScanCode-style license scan per file, accepting the permissive set plus weak copyleft, with Linux GPL-2.0 tagged so downstream mixes can opt in or out. Heuristic quality filters: max 1 MB per file, max line length 1000, min size 100 B, unique-lines ratio > 30%, comment-to-code ratio < 80%, strict extension whitelist. Auto-generated markers (// Generated by, DO NOT EDIT) are cheap regex wins. An entropy check above 4.5 bits/byte catches binary-in-ASCII dumps.

PII and secretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingAbout: Modal training platform overview Reference: Modal debugging playbook Reference: Modal Secrets docs scrubbing run before tokenization, not after. Email addresses become the synthetic marker <redacted-email>, network addresses become <redacted-network-address>, high-entropy strings get replaced with API_KEY_REDACTED, and any user paths that survived in source comments are normalized to <redacted-path>/. The order matters: scrubbing after tokenization means you have to round-trip through detokenize, which is fragile, and you lose the ability to fail closed on an unredacted token leak. The provenance and refusal-list side of that filter is the same policy described in License Hygiene and Provenance for a C++ Training Corpus and Reference corpus pinning notes. The checked-in normalization proof surfaces for that stage are Enriched record normalization example and Enriched JSONL record to parquet, which keep provenance-bearing fields explicit instead of burying them in prose.

Stage 4 - doc-mask preparation. Document masking is not a separate file format in our pipeline; it is an invariant the producer respects so the consumer can recover boundaries cheaply. Every document gets a leading BOS token. That is the contract. The training loader infers doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample at runtime via a cumulative sum over BOS positions, which is O(T) per batch and requires zero storage-format change. The reason this is a stage at all: producers that pre-pack documents into rows must guarantee that BOS-aligned best-fit packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles never inserts a document without a BOS, or the inferred doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample will silently merge two documents into one. We have hit this. Fix: a producer-side assert that every packed row's BOS positions equal its num_docs value. The minimal public proof surfaces for that invariant are Document-mask segment IDs sample and Packed row builder example.

Stage 5 - tokenize. The tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped has its own writeup; pipeline-relevant facts: 131,072 entries, BOS-prepended per document, uint32 output. A pretokenized column here means the token IDs and token-aligned side arrays are already materialized offline into the shard, rather than reconstructed inside the training dataloader. The offline tokenization step emits those pretokenized columns and stores per-token character spans next to IDs. The spans bridge to enrichment columns (structure IDs, dep levels, AST features) that live at character level; without spans we fail closed rather than emit unaligned metadata. The narrow checked-in materialization surfaces for that handoff are Enriched JSONL record to parquet, Token-level enriched parquet materialization example, and Token chunk layout sample.

Stage 6 - packed-row shard. Offline packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles is the enriched-row packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles stage, which takes per-document tokenized rows and repacks them into fixed-length training rows without truncation: best-fit decreasing, padded on the right, emitted as input_ids / target_ids / loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample plus document boundary metadata (pack_id, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, valid_token_countQuick term guidevalid_token_countThe per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.GroundingAbout: packed rows as the real training contract Example: packed rows schema sample Example: packed row builder example, num_docs, slack tokens, source provenance). The packed-row contract is the same runtime boundary described in Packed rows as the real training contract, and the runtime loader reads exactly those columns. The smallest checked-in row-contract surfaces are Packed rows schema sample, Packed row builder example, and Packed row example. Shard size is 50,000 docs per parquet file, 1024-row row-groups for fast random access, plus a validation shard carved off as the last 1% and a completion sentinel written when the producer is done.

Stage 7 - format and verify. In production, the parquet shards are converted to Megatron's .bin/.idx pair through a deterministic formatter that prefers the standard indexed-dataset builder and falls back to a raw writer when that dependency is absent. uint32 token width is mandatory at 131K vocab. A verify gate here is the small set of explicit checks that must pass before a shard set is allowed to be promoted: artifact presence, parseable index, token range, and a narrow sample readback. Verify is prepare_verify, which checks .bin/.idx existence and non-empty, parses the index, asserts max(token_id) < vocab_size, prints the first 64 tokens of document zero, and returns non-zero on any failure. No silent fallbacks at verify time. The narrow checked-in bridge is visible in Parquet to Megatron indexed dataset sample, Prepare-format MegaCpp sample, and the storage-side explainer Converting parquet token shards into Megatron indexed datasets.

How it lands in MegaCpp

The lift is small because the contract is small. MegaCpp owns the public dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-preparation pipeline plus the five numbered stages plus the Megatron .bin/.idx writer. Everything below stage 2 - the semantic indexerQuick term guideSemantic indexingThe structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.GroundingAbout: Clang semantic indexing About: compile commands and semantic graphs Example: semantic indexing notes, the tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped, the enrichment materializer - is handled by separately maintained tooling at pinned versions. Vendoring those pieces into MegaCpp would duplicate several thousand lines of actively maintained tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped and indexer code; the dependency is the smaller cost.

What is being lifted as-is: the parquet schema, the tolerant loader contract, the BOS-based doc-mask inference, the offline packer, and the verify gate. What is being rewritten: the legacy flat-text producer is sunset in MegaCpp; only the strict producer with exact-token budgeting and pretokenized columns ships. What is being dropped: the uint16 binary dataset path. What is moving to a kernel path: nothing in dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most prep is on the kernel critical path; the structure-aware consumer side is where accelerator-friendly kernels matter. What remains a feature flag: the chunk-level dedup whitelist, because some public corpora benefit from preserving more raw context. On the training side, packed rows as the real training contract is the article that explains why that loader boundary matters more than any single upstream producer implementation.

The old multi-environment split that historically lived in separate launch paths is collapsed in MegaCpp to a single configurable dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most root. As long as the public dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-preparation pipeline and the launcher agree on that root, no script edits are required to move dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most between environments. The sibling checked-in public operator view is Data preparation notes plus Prepare-format MegaCpp sample, which together record the stage handoff, artifact naming, and output-family boundary without depending on a private training-tree checkout.

Ablations and what we kept

The ablations that survived contact with real GPUs are not the headline ones. They are the boring ones.

The pretokenized-vs-char-level choice. We keep the pretokenized path because moving char-to-token alignment out of the hot loop and into offline materialization scales more cleanly to long-context loaders. The char-level path remains useful as an offline materialization input, not as the runtime contract.

The lazy-vs-eager segment materialization choice. We kept eager precomputation for the fully enriched path because relation metadata is cheaper to validate once per document than to rebuild repeatedly inside the row-pack hot loop. Partial-enriched configurations can still justify lazier materialization when the extra metadata is absent.

The document-mask implementation choice. We keep the vectorized masking path and avoid Python-loop handling in the hot path. The lesson is simple: document-boundary logic belongs in fixed-shape tensor operations, not in per-batch Python control flow.

The bottleneck dimension on the structure embedding path. The public contract keeps a narrow bottleneck because structure features must stay cheap enough to justify carrying them through the loader boundary.

The shape of MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample-LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample itself we did not ablate; we adopted the bigcode parameterization (numPerm=128, threshold=0.7, shingleK=5) for within-corpus and the MixMinMatch parameterization for cross-source. Both have published evidence behind them and our role here is dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most engineering, not novel similarity research.

Production checklist

Pin all repository refs by tag, never by branch. Mirror raw clones to cold storage if absolute reproducibility matters.
Treat bucket names (4k, 16k, 64k) as targets, not contracts. Re-measure with the tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped in use when in doubt.
Run exact dedup before MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample-LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample; the cheap pass removes 30-40% before the expensive pass starts.
Restrict chunk-level dedup to the whitelisted self-contained kinds. Never deduplicate preambles, forward declarations, or class members in isolation.
Scrub PII and secretsQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingAbout: Modal training platform overview Reference: Modal debugging playbook Reference: Modal Secrets docs before tokenization, not after.
Every document gets a leading BOS. Producer-side assert num_docs == count(BOS positions) on every packed row.
uint32 token width at 131K vocab. uint16 is invalid and the verify gate must catch it.
The producer writes a completion sentinel only after the last shard is closed. Consumers must refuse incomplete directories.
Verify is non-zero on any failure: missing index, parse error, out-of-vocab token, broken round-trip on document zero.
The training loader fails closed on wrong-length doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, malformed token-structure arrays, or invalid valid_token_countQuick term guidevalid_token_countThe per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.GroundingAbout: packed rows as the real training contract Example: packed rows schema sample Example: packed row builder example. Optional metadata may fall back to deterministic defaults; required metadata must not.
A pipeline-level dashboard alerts on running-pod count, never on scheduled-pod count. We learned this the hard way during a Kubernetes ImagePullBackOff outage that produced zero dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most while reporting healthy.
Keep producer-revision labels separate from schema-version labels in launcher configs and in human writeups. Conflating them costs onboarding hours.

Pipeline snapshot

Stage	Input	Output	Gate
Ingest	raw repos	normalized docs	license allow-list
Dedup	normalized docs	unique docs	minhash-LSH threshold
License filter	unique docs	permissive subset	SPDX match
Doc-mask	permissive subset	docs + loss mask	schema check
Tokenize	masked docs	token streams	vocab coverage check
Pack	token streams	packed shards	row-validity contract

Single-stage rerun example:
- stage: pack
- slice: core_cpp
- input: tokenized shards
- output: packed shards
- row length: 8192

FAQ

Frequently asked questions

Why keep the pipeline split into so many explicit stages?+

Because the expensive failures happen at different boundaries. Near-dedup has different failure modes than license filtering, BOS-aware masking fails differently from token packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…, and schema verification fails differently from semantic enrichment. Keeping those as explicit stages gives each one a reviewable output contract and lets the verify gate tell you whether the regression came from source selection, dedup, tokenization, or packing instead of giving you one opaque "dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and… got worse" answer.

Why is BOS-based document masking treated as a pipeline invariant?+

Because the runtime side is allowed to trust it. The loader, packer, and any later structure-aware consumer all assume BOS marks real document starts, so the moment that invariant drifts, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries., masks, and packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a… stop describing the same sample boundaries. That is how you get cross-document leakage that looks like a model bug but is really a producer bug.

Why does this article emphasize schema tolerance so much?+

Because producer code changes faster than long-lived consumers. The stable thing is the parquet contract, not whichever script emitted it this month. A tolerant loader plus a strict schema/verify gate lets you evolve non-essential fields, add enriched columns, or change stage internals without forcing every downstream reader to be rewritten or every older shard to be re-materialized.

Why stage in parquet if the training run ultimately consumes .bin/.idx?+

Because the two formats solve different jobs. The parquet stage is the typed columnar staging format where enrichment columns, provenance, and verification receipts stay inspectable; the Megatron indexed dataset is the narrower final read format the trainer streams. If you want that bridge spelled out end to end, keep Parquet to Megatron indexed dataset sample and Converting parquet token shards into Megatron indexed datasets nearby.

Why keep the AST-aware chunker around if bucket labels already exist?+

Because bucket labels are planning labels, not proof of a token-faithful split. The AST-aware lane gives you cleaner semantic boundaries and a tighter token budget than the old chars-per-token heuristic, which is exactly why nominal 4k shards from the heuristic path can drift into the mid-4K range after real tokenization. If the chunking question turns into "what structural context are we preserving?" rather than "how many tokens fit?", keep compile commands and semantic graphs and the Clang semantic indexer nearby.

Why not run chunk-level dedup on every semantic unit?+

Because some chunks are only meaningful inside their surrounding file context. Restricting chunk-level dedup to self-contained kinds such as bodies, declarations, typedefs, and namespaces avoids deleting preambles, forward declarations, or class members that later stages still need to interpret correctly. The more general dedup tradeoff is spelled out in Code Deduplication at Scale.

Where do tokenizer, provenance, and eval posts fit into this pipeline?+

They describe adjacent boundaries of the same dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and… story. Inside the MegaCpp C++ tokenizer and Tokenizer evolution for C++ code explain the 131K vocabulary and why uint32 tokens are required. License Hygiene and Provenance for a C++ Training Corpus explains the pinning and SPDX side of stage 0 through stage 3. Eval harness plumbing and Verifier-first C++ evals are downstream consumers of the same contract: they only make sense if the promoted dataset snapshot is pinned, typed, and auditable.

Which public examples are the fastest proof surfaces for this lane?+

Start with Data preparation notes for the public pipeline shape, then Data and masking examples for the local example map, then the narrow stage receipts: Enriched JSONL record to parquet, Packed row builder example, Token-level enriched parquet materialization example, Prepare-format MegaCpp sample, and Parquet to Megatron indexed dataset sample.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

compile command database

The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.

Grounding

Semantic indexing

The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.

Grounding

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

Grounding

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

Data pipeline

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…

Grounding

Tokenizer

A deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…

Grounding

segment_ids

The fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.

Grounding

MinHash

A compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.

Grounding

loss_mask

The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.

Grounding

LSH

Locality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.

Grounding

Shingling

The step that turns text or token streams into overlapping k-grams so similarity can be estimated from shared local fragments rather than exact string identity.

Grounding

valid_token_count

The per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.

Grounding

Jaccard threshold

The overlap threshold used to decide when two shingle sets are similar enough to treat as near-duplicates during corpus cleanup.

Grounding

Secrets

Modal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.

Grounding

Topic hubs

Topic Hub

C++ Data Pipelines and Corpus Packaging

A curated archive for the C++ data path: corpus selection, semantic enrichment, packaging into training artifacts, and the file-level durability choices that keep the pipeline sane.

David Gornshtein • MegaCppMore posts →

The C/C++ Data Preparation Pipeline, End to End

Why MegaCpp cares about this

Public pipeline contract

How it lands in MegaCpp

Ablations and what we kept

Production checklist

Pipeline snapshot

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

C++ Data Pipelines and Corpus Packaging