The C/C++ Data Preparation Pipeline, End to End
Every stage of the MegaCpp data preparation pipeline: ingest, dedup, license filtering, document masking, tokenization, packed rows, and the checks that keep dataset snapshots trustworthy.

This is the implementation-focused view of how raw C/C++ source becomes packed training rows for MegaCpp. It is the sibling post to Building the C++ Training Data Pipeline: What Worked, What Broke: that one frames the design decisions, this one walks through the stages and the checks that keep them stable. It also pairs naturally with compile commands and semantic graphs, because the later semantic-enrichment stages only make sense if the preparation contract stays deterministic. For the source-selection side of the same lane, read Building a C/C++ corpus for training: what we keep, what we throw away, and why.
Why MegaCpp cares about this
The model only ever sees what the pipeline emits. A duplicated repo doubles the training weight of someone's preferred coding style. A missed license header bakes copyleft into the weights. A broken document mask lets one file leak attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns into the next, and at 64K context that is the difference between repository reasoning and confabulation. The pipeline is the gatekeeper, and its quality gates are the only thing standing between a clean training run and a model that has memorized a well-known systems library surface instead of learning to write it.
Two engineering facts shape every decision below. First, MegaCpp's hybrid C++ tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped is 131,072 entries, which means token IDs no longer fit in uint16 and the on-disk format must use uint32; Inside the MegaCpp C++ tokenizer and Tokenizer evolution for C++ code are the companion posts if you want the vocabulary story behind that number. Second, the columnar dataset contract is the stable interface, while producer implementations may evolve. In this lane a columnar dataset contract means the exported field names, types, fallback values, and verification gates that every later loader or formatter is allowed to trust, even if the producer code changes. Loaders should be tolerant and producers should be replaceable, which is the same schema-discipline argument made in C++ Data Versioning and Schema.
Public pipeline contract
The public-facing dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-preparation pipeline has five numbered stages: download, tokenize, format, cache, verify. Underneath, the actual work spans a semantic chunker, dedup passes, enrichment jobs, and packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles stages. Stage by stage:
Stage 0 - acquisition. Start from a pinned set of public C and C++ repositories at explicit revisions. A broader public source list can be tracked separately for future evaluation, but keeping the working set small makes it easier to debug dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most quality instead of spending early effort on ingestion overhead. The public source-selection contract is the same narrower one described in Building a C/C++ corpus for training: what we keep, what we throw away, and why and Reference corpus pinning notes.
Stage 1 - ingest and chunking. Two coexisting producers are common during transitions. The mainline chunker can split at function boundaries with an AST-aware budget, while an older path may split at top-level brace boundaries and budget by approximate token count. Both write normalized text records. Bucket labels such as 4k, 8k, and 16k should be treated as planning labels unless the producer enforces exact token budgets.
There is one trap in this stage that will bite anyone who skips the docs. Bucket names like 4k, 8k, 16k, 64k, 128k are target buckets, not exact-token guarantees. The legacy chunker budgets by a chars-per-token heuristic, which is wrong by 5-15% under the current tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped. A nominal 4k shard often contains documents that tokenize to 4400-4800 tokens. The strict producer lanes are exact-token-budgeted; the older ones are not. The loader will silently crop if you trust the bucket name as a contract.
Stage 2 - dedup. Two passes, two scopes. First is within-corpus dedup: SHA-256 exact first, then MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample-LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample near-dedup with 128 permutations, JaccardQuick term guideJaccard thresholdThe overlap threshold used to decide when two shingle sets are similar enough to treat as near-duplicates during corpus cleanup.GroundingAbout: code deduplication at scale Example: dedup pipeline sample 0.7, 5-token shinglesQuick term guideShinglingThe step that turns text or token streams into overlapping k-grams so similarity can be estimated from shared local fragments rather than exact string identity.GroundingAbout: code deduplication at scale Reference: semantic indexing notes. Second is cross-source dedup: provenance-aware grouping, 112-permutation LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample (14 bands of 8 rows, the MixMinMatch parameterization), Union-Find to cluster, plus optional chunk-level dedup restricted to self-contained semantic units. The whitelist matters (FUNC_BODY, CLASS_DECL, TYPEDEF, NAMESPACE); the blacklist matters more (OTHER, PREAMBLE, FUNC_SIGNATURE, CLASS_MEMBER, COMMENT). Deduping an #include block in isolation breaks surrounding code; deduping a forward declaration silently drops something downstream expects.
Operational gotcha: the MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample-LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample index is single-process and memory-bound. On the 27.6 M-document corpus we hit ~40 GB resident before tuning the shingle iterator to stream rather than materialize. The two-pass design is not optional - exact dedup removes 30-40% before the expensive near-dedup pass even starts. The operational reason for keeping those stages explicit is the same one described in dataset versions v2 to v6: producer evolution is manageable only when each stage has a stable, reviewable output contract. The concrete MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample/LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample tradeoffs are broken out in Code Deduplication at Scale.
Stage 3 - license and quality filter. ScanCode-style license scan per file, accepting the permissive set plus weak copyleft, with Linux GPL-2.0 tagged so downstream mixes can opt in or out. Heuristic quality filters: max 1 MB per file, max line length 1000, min size 100 B, unique-lines ratio > 30%, comment-to-code ratio < 80%, strict extension whitelist. Auto-generated markers (// Generated by, DO NOT EDIT) are cheap regex wins. An entropy check above 4.5 bits/byte catches binary-in-ASCII dumps.
PII and secretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingAbout: Modal training platform overview Reference: Modal debugging playbook Reference: Modal Secrets docs scrubbing run before tokenization, not after. Email addresses become the synthetic marker <redacted-email>, network addresses become <redacted-network-address>, high-entropy strings get replaced with API_KEY_REDACTED, and any user paths that survived in source comments are normalized to <redacted-path>/. The order matters: scrubbing after tokenization means you have to round-trip through detokenize, which is fragile, and you lose the ability to fail closed on an unredacted token leak.
The provenance and refusal-list side of that filter is the same policy
described in License Hygiene and Provenance for a C++ Training Corpus
and Reference corpus pinning notes.
The checked-in normalization proof surfaces for that stage are
Enriched record normalization example
and Enriched JSONL record to parquet,
which keep provenance-bearing fields explicit instead of burying them in prose.
Stage 4 - doc-mask preparation. Document masking is not a separate file format in our pipeline; it is an invariant the producer respects so the consumer can recover boundaries cheaply. Every document gets a leading BOS token. That is the contract. The training loader infers doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample at runtime via a cumulative sum over BOS positions, which is O(T) per batch and requires zero storage-format change. The reason this is a stage at all: producers that pre-pack documents into rows must guarantee that BOS-aligned best-fit packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles never inserts a document without a BOS, or the inferred doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample will silently merge two documents into one. We have hit this. Fix: a producer-side assert that every packed row's BOS positions equal its num_docs value.
The minimal public proof surfaces for that invariant are
Document-mask segment IDs sample
and Packed row builder example.
Stage 5 - tokenize. The tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped has its own writeup; pipeline-relevant facts: 131,072 entries, BOS-prepended per document, uint32 output. A pretokenized column here means the token IDs and token-aligned side arrays are already materialized offline into the shard, rather than reconstructed inside the training dataloader. The offline tokenization step emits those pretokenized columns and stores per-token character spans next to IDs. The spans bridge to enrichment columns (structure IDs, dep levels, AST features) that live at character level; without spans we fail closed rather than emit unaligned metadata. The narrow checked-in materialization surfaces for that handoff are Enriched JSONL record to parquet, Token-level enriched parquet materialization example, and Token chunk layout sample.
Stage 6 - packed-row shard. Offline packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles is the enriched-row packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles stage, which takes per-document tokenized rows and repacks them into fixed-length training rows without truncation: best-fit decreasing, padded on the right, emitted as input_ids / target_ids / loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample plus document boundary metadata (pack_id, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, valid_token_countQuick term guidevalid_token_countThe per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.GroundingAbout: packed rows as the real training contract Example: packed rows schema sample Example: packed row builder example, num_docs, slack tokens, source provenance). The packed-row contract is the same runtime boundary described in Packed rows as the real training contract, and the runtime loader reads exactly those columns. The smallest checked-in row-contract surfaces are Packed rows schema sample, Packed row builder example, and Packed row example. Shard size is 50,000 docs per parquet file, 1024-row row-groups for fast random access, plus a validation shard carved off as the last 1% and a completion sentinel written when the producer is done.
Stage 7 - format and verify. In production, the parquet shards are converted to Megatron's .bin/.idx pair through a deterministic formatter that prefers the standard indexed-dataset builder and falls back to a raw writer when that dependency is absent. uint32 token width is mandatory at 131K vocab. A verify gate here is the small set of explicit checks that must pass before a shard set is allowed to be promoted: artifact presence, parseable index, token range, and a narrow sample readback. Verify is prepare_verify, which checks .bin/.idx existence and non-empty, parses the index, asserts max(token_id) < vocab_size, prints the first 64 tokens of document zero, and returns non-zero on any failure. No silent fallbacks at verify time. The narrow checked-in bridge is visible in Parquet to Megatron indexed dataset sample, Prepare-format MegaCpp sample, and the storage-side explainer Converting parquet token shards into Megatron indexed datasets.
How it lands in MegaCpp
The lift is small because the contract is small. MegaCpp owns the public dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-preparation pipeline plus the five numbered stages plus the Megatron .bin/.idx writer. Everything below stage 2 - the semantic indexerQuick term guideSemantic indexingThe structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.GroundingAbout: Clang semantic indexing About: compile commands and semantic graphs Example: semantic indexing notes, the tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped, the enrichment materializer - is handled by separately maintained tooling at pinned versions. Vendoring those pieces into MegaCpp would duplicate several thousand lines of actively maintained tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped and indexer code; the dependency is the smaller cost.
What is being lifted as-is: the parquet schema, the tolerant loader contract, the BOS-based doc-mask inference, the offline packer, and the verify gate. What is being rewritten: the legacy flat-text producer is sunset in MegaCpp; only the strict producer with exact-token budgeting and pretokenized columns ships. What is being dropped: the uint16 binary dataset path. What is moving to a kernel path: nothing in dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most prep is on the kernel critical path; the structure-aware consumer side is where accelerator-friendly kernels matter. What remains a feature flag: the chunk-level dedup whitelist, because some public corpora benefit from preserving more raw context. On the training side, packed rows as the real training contract is the article that explains why that loader boundary matters more than any single upstream producer implementation.
The old multi-environment split that historically lived in separate launch paths is collapsed in MegaCpp to a single configurable dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most root. As long as the public dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-preparation pipeline and the launcher agree on that root, no script edits are required to move dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most between environments. The sibling checked-in public operator view is Data preparation notes plus Prepare-format MegaCpp sample, which together record the stage handoff, artifact naming, and output-family boundary without depending on a private training-tree checkout.
Ablations and what we kept
The ablations that survived contact with real GPUs are not the headline ones. They are the boring ones.
The pretokenized-vs-char-level choice. We keep the pretokenized path because moving char-to-token alignment out of the hot loop and into offline materialization scales more cleanly to long-context loaders. The char-level path remains useful as an offline materialization input, not as the runtime contract.
The lazy-vs-eager segment materialization choice. We kept eager precomputation for the fully enriched path because relation metadata is cheaper to validate once per document than to rebuild repeatedly inside the row-pack hot loop. Partial-enriched configurations can still justify lazier materialization when the extra metadata is absent.
The document-mask implementation choice. We keep the vectorized masking path and avoid Python-loop handling in the hot path. The lesson is simple: document-boundary logic belongs in fixed-shape tensor operations, not in per-batch Python control flow.
The bottleneck dimension on the structure embedding path. The public contract keeps a narrow bottleneck because structure features must stay cheap enough to justify carrying them through the loader boundary.
The shape of MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample-LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample itself we did not ablate; we adopted the bigcode parameterization (numPerm=128, threshold=0.7, shingleK=5) for within-corpus and the MixMinMatch parameterization for cross-source. Both have published evidence behind them and our role here is dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most engineering, not novel similarity research.
Production checklist
- Pin all repository refs by tag, never by branch. Mirror raw clones to cold storage if absolute reproducibility matters.
- Treat bucket names (
4k,16k,64k) as targets, not contracts. Re-measure with the tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped in use when in doubt. - Run exact dedup before MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample-LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample; the cheap pass removes 30-40% before the expensive pass starts.
- Restrict chunk-level dedup to the whitelisted self-contained kinds. Never deduplicate preambles, forward declarations, or class members in isolation.
- Scrub PII and secretsQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingAbout: Modal training platform overview Reference: Modal debugging playbook Reference: Modal Secrets docs before tokenization, not after.
- Every document gets a leading BOS. Producer-side assert
num_docs == count(BOS positions)on every packed row. uint32token width at 131K vocab.uint16is invalid and the verify gate must catch it.- The producer writes a completion sentinel only after the last shard is closed. Consumers must refuse incomplete directories.
- Verify is non-zero on any failure: missing index, parse error, out-of-vocab token, broken round-trip on document zero.
- The training loader fails closed on wrong-length
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, malformed token-structure arrays, or invalidvalid_token_countQuick term guidevalid_token_countThe per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.GroundingAbout: packed rows as the real training contract Example: packed rows schema sample Example: packed row builder example. Optional metadata may fall back to deterministic defaults; required metadata must not. - A pipeline-level dashboard alerts on running-pod count, never on scheduled-pod count. We learned this the hard way during a Kubernetes
ImagePullBackOffoutage that produced zero dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most while reporting healthy. - Keep producer-revision labels separate from schema-version labels in launcher configs and in human writeups. Conflating them costs onboarding hours.
Pipeline snapshot
| Stage | Input | Output | Gate |
|---|---|---|---|
| Ingest | raw repos | normalized docs | license allow-list |
| Dedup | normalized docs | unique docs | minhash-LSH threshold |
| License filter | unique docs | permissive subset | SPDX match |
| Doc-mask | permissive subset | docs + loss mask | schema check |
| Tokenize | masked docs | token streams | vocab coverage check |
| Pack | token streams | packed shards | row-validity contract |
Single-stage rerun example:
- stage: pack
- slice: core_cpp
- input: tokenized shards
- output: packed shards
- row length: 8192
Frequently asked questions
Why keep the pipeline split into so many explicit stages?+
Why is BOS-based document masking treated as a pipeline invariant?+
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries., masks, and packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a… stop describing the same sample boundaries. That is how you get cross-document leakage that looks like a model bug but is really a producer bug.Why does this article emphasize schema tolerance so much?+
Why stage in parquet if the training run ultimately consumes .bin/.idx?+
Why keep the AST-aware chunker around if bucket labels already exist?+
4k shards from the heuristic path can drift into the mid-4K range after real tokenization. If the chunking question turns into "what structural context are we preserving?" rather than "how many tokens fit?", keep compile commands and semantic graphs and the Clang semantic indexer nearby.Why not run chunk-level dedup on every semantic unit?+
Where do tokenizer, provenance, and eval posts fit into this pipeline?+
uint32 tokens are required. License Hygiene and Provenance for a C++ Training Corpus explains the pinning and SPDX side of stage 0 through stage 3. Eval harness plumbing and Verifier-first C++ evals are downstream consumers of the same contract: they only make sense if the promoted dataset snapshot is pinned, typed, and auditable.Which public examples are the fastest proof surfaces for this lane?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.
The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…
The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…
A deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…
The fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.
A compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.
The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.
Locality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.
The step that turns text or token streams into overlapping k-grams so similarity can be estimated from shared local fragments rather than exact string identity.
The per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.
The overlap threshold used to decide when two shingle sets are similar enough to treat as near-duplicates during corpus cleanup.
Modal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.