MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 12 min readDavid Gornshtein
Data
Pipeline
C++
Operations
Tokenizer

The C/C++ Data Preparation Pipeline, End to End

Every stage of the MegaCpp data preparation pipeline: ingest, dedup, license filtering, document masking, tokenization, packed rows, and the checks that keep dataset snapshots trustworthy.

MegaCpp
Focused on applied C++ model engineering
Article Preview
The C/C++ Data Preparation Pipeline, End to End
Published 12 min readDavid Gornshtein

This is the implementation-focused view of how raw C/C++ source becomes packed training rows for MegaCpp. It is the sibling post to Building the C++ Training Data Pipeline: What Worked, What Broke: that one frames the design decisions, this one walks through the stages and the checks that keep them stable. It also pairs naturally with compile commands and semantic graphs, because the later semantic-enrichment stages only make sense if the preparation contract stays deterministic. For the source-selection side of the same lane, read Building a C/C++ corpus for training: what we keep, what we throw away, and why.

Why MegaCpp cares about this

The model only ever sees what the pipeline emits. A duplicated repo doubles the training weight of someone's preferred coding style. A missed license header bakes copyleft into the weights. A broken document mask lets one file leak attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns into the next, and at 64K context that is the difference between repository reasoning and confabulation. The pipeline is the gatekeeper, and its quality gates are the only thing standing between a clean training run and a model that has memorized a well-known systems library surface instead of learning to write it.

Two engineering facts shape every decision below. First, MegaCpp's hybrid C++ tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped is 131,072 entries, which means token IDs no longer fit in uint16 and the on-disk format must use uint32; Inside the MegaCpp C++ tokenizer and Tokenizer evolution for C++ code are the companion posts if you want the vocabulary story behind that number. Second, the columnar dataset contract is the stable interface, while producer implementations may evolve. In this lane a columnar dataset contract means the exported field names, types, fallback values, and verification gates that every later loader or formatter is allowed to trust, even if the producer code changes. Loaders should be tolerant and producers should be replaceable, which is the same schema-discipline argument made in C++ Data Versioning and Schema.

Public pipeline contract

The public-facing dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-preparation pipeline has five numbered stages: download, tokenize, format, cache, verify. Underneath, the actual work spans a semantic chunker, dedup passes, enrichment jobs, and packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles stages. Stage by stage:

Stage 0 - acquisition. Start from a pinned set of public C and C++ repositories at explicit revisions. A broader public source list can be tracked separately for future evaluation, but keeping the working set small makes it easier to debug dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most quality instead of spending early effort on ingestion overhead. The public source-selection contract is the same narrower one described in Building a C/C++ corpus for training: what we keep, what we throw away, and why and Reference corpus pinning notes.

Stage 1 - ingest and chunking. Two coexisting producers are common during transitions. The mainline chunker can split at function boundaries with an AST-aware budget, while an older path may split at top-level brace boundaries and budget by approximate token count. Both write normalized text records. Bucket labels such as 4k, 8k, and 16k should be treated as planning labels unless the producer enforces exact token budgets.

There is one trap in this stage that will bite anyone who skips the docs. Bucket names like 4k, 8k, 16k, 64k, 128k are target buckets, not exact-token guarantees. The legacy chunker budgets by a chars-per-token heuristic, which is wrong by 5-15% under the current tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped. A nominal 4k shard often contains documents that tokenize to 4400-4800 tokens. The strict producer lanes are exact-token-budgeted; the older ones are not. The loader will silently crop if you trust the bucket name as a contract.

Stage 2 - dedup. Two passes, two scopes. First is within-corpus dedup: SHA-256 exact first, then MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample-LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample near-dedup with 128 permutations, JaccardQuick term guideJaccard thresholdThe overlap threshold used to decide when two shingle sets are similar enough to treat as near-duplicates during corpus cleanup.GroundingAbout: code deduplication at scale Example: dedup pipeline sample 0.7, 5-token shinglesQuick term guideShinglingThe step that turns text or token streams into overlapping k-grams so similarity can be estimated from shared local fragments rather than exact string identity.GroundingAbout: code deduplication at scale Reference: semantic indexing notes. Second is cross-source dedup: provenance-aware grouping, 112-permutation LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample (14 bands of 8 rows, the MixMinMatch parameterization), Union-Find to cluster, plus optional chunk-level dedup restricted to self-contained semantic units. The whitelist matters (FUNC_BODY, CLASS_DECL, TYPEDEF, NAMESPACE); the blacklist matters more (OTHER, PREAMBLE, FUNC_SIGNATURE, CLASS_MEMBER, COMMENT). Deduping an #include block in isolation breaks surrounding code; deduping a forward declaration silently drops something downstream expects.

Operational gotcha: the MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample-LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample index is single-process and memory-bound. On the 27.6 M-document corpus we hit ~40 GB resident before tuning the shingle iterator to stream rather than materialize. The two-pass design is not optional - exact dedup removes 30-40% before the expensive near-dedup pass even starts. The operational reason for keeping those stages explicit is the same one described in dataset versions v2 to v6: producer evolution is manageable only when each stage has a stable, reviewable output contract. The concrete MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample/LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample tradeoffs are broken out in Code Deduplication at Scale.

Stage 3 - license and quality filter. ScanCode-style license scan per file, accepting the permissive set plus weak copyleft, with Linux GPL-2.0 tagged so downstream mixes can opt in or out. Heuristic quality filters: max 1 MB per file, max line length 1000, min size 100 B, unique-lines ratio > 30%, comment-to-code ratio < 80%, strict extension whitelist. Auto-generated markers (// Generated by, DO NOT EDIT) are cheap regex wins. An entropy check above 4.5 bits/byte catches binary-in-ASCII dumps.

PII and secretQuick term guideSecretsModal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.GroundingAbout: Modal training platform overview Reference: Modal debugging playbook Reference: Modal Secrets docs scrubbing run before tokenization, not after. Email addresses become the synthetic marker <redacted-email>, network addresses become <redacted-network-address>, high-entropy strings get replaced with API_KEY_REDACTED, and any user paths that survived in source comments are normalized to <redacted-path>/. The order matters: scrubbing after tokenization means you have to round-trip through detokenize, which is fragile, and you lose the ability to fail closed on an unredacted token leak. The provenance and refusal-list side of that filter is the same policy described in License Hygiene and Provenance for a C++ Training Corpus and Reference corpus pinning notes. The checked-in normalization proof surfaces for that stage are Enriched record normalization example and Enriched JSONL record to parquet, which keep provenance-bearing fields explicit instead of burying them in prose.

Stage 4 - doc-mask preparation. Document masking is not a separate file format in our pipeline; it is an invariant the producer respects so the consumer can recover boundaries cheaply. Every document gets a leading BOS token. That is the contract. The training loader infers doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample at runtime via a cumulative sum over BOS positions, which is O(T) per batch and requires zero storage-format change. The reason this is a stage at all: producers that pre-pack documents into rows must guarantee that BOS-aligned best-fit packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles never inserts a document without a BOS, or the inferred doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample will silently merge two documents into one. We have hit this. Fix: a producer-side assert that every packed row's BOS positions equal its num_docs value. The minimal public proof surfaces for that invariant are Document-mask segment IDs sample and Packed row builder example.

Stage 5 - tokenize. The tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped has its own writeup; pipeline-relevant facts: 131,072 entries, BOS-prepended per document, uint32 output. A pretokenized column here means the token IDs and token-aligned side arrays are already materialized offline into the shard, rather than reconstructed inside the training dataloader. The offline tokenization step emits those pretokenized columns and stores per-token character spans next to IDs. The spans bridge to enrichment columns (structure IDs, dep levels, AST features) that live at character level; without spans we fail closed rather than emit unaligned metadata. The narrow checked-in materialization surfaces for that handoff are Enriched JSONL record to parquet, Token-level enriched parquet materialization example, and Token chunk layout sample.

Stage 6 - packed-row shard. Offline packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles is the enriched-row packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles stage, which takes per-document tokenized rows and repacks them into fixed-length training rows without truncation: best-fit decreasing, padded on the right, emitted as input_ids / target_ids / loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample plus document boundary metadata (pack_id, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, valid_token_countQuick term guidevalid_token_countThe per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.GroundingAbout: packed rows as the real training contract Example: packed rows schema sample Example: packed row builder example, num_docs, slack tokens, source provenance). The packed-row contract is the same runtime boundary described in Packed rows as the real training contract, and the runtime loader reads exactly those columns. The smallest checked-in row-contract surfaces are Packed rows schema sample, Packed row builder example, and Packed row example. Shard size is 50,000 docs per parquet file, 1024-row row-groups for fast random access, plus a validation shard carved off as the last 1% and a completion sentinel written when the producer is done.

Stage 7 - format and verify. In production, the parquet shards are converted to Megatron's .bin/.idx pair through a deterministic formatter that prefers the standard indexed-dataset builder and falls back to a raw writer when that dependency is absent. uint32 token width is mandatory at 131K vocab. A verify gate here is the small set of explicit checks that must pass before a shard set is allowed to be promoted: artifact presence, parseable index, token range, and a narrow sample readback. Verify is prepare_verify, which checks .bin/.idx existence and non-empty, parses the index, asserts max(token_id) < vocab_size, prints the first 64 tokens of document zero, and returns non-zero on any failure. No silent fallbacks at verify time. The narrow checked-in bridge is visible in Parquet to Megatron indexed dataset sample, Prepare-format MegaCpp sample, and the storage-side explainer Converting parquet token shards into Megatron indexed datasets.

How it lands in MegaCpp

The lift is small because the contract is small. MegaCpp owns the public dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-preparation pipeline plus the five numbered stages plus the Megatron .bin/.idx writer. Everything below stage 2 - the semantic indexerQuick term guideSemantic indexingThe structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.GroundingAbout: Clang semantic indexing About: compile commands and semantic graphs Example: semantic indexing notes, the tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped, the enrichment materializer - is handled by separately maintained tooling at pinned versions. Vendoring those pieces into MegaCpp would duplicate several thousand lines of actively maintained tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped and indexer code; the dependency is the smaller cost.

What is being lifted as-is: the parquet schema, the tolerant loader contract, the BOS-based doc-mask inference, the offline packer, and the verify gate. What is being rewritten: the legacy flat-text producer is sunset in MegaCpp; only the strict producer with exact-token budgeting and pretokenized columns ships. What is being dropped: the uint16 binary dataset path. What is moving to a kernel path: nothing in dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most prep is on the kernel critical path; the structure-aware consumer side is where accelerator-friendly kernels matter. What remains a feature flag: the chunk-level dedup whitelist, because some public corpora benefit from preserving more raw context. On the training side, packed rows as the real training contract is the article that explains why that loader boundary matters more than any single upstream producer implementation.

The old multi-environment split that historically lived in separate launch paths is collapsed in MegaCpp to a single configurable dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most root. As long as the public dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-preparation pipeline and the launcher agree on that root, no script edits are required to move dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most between environments. The sibling checked-in public operator view is Data preparation notes plus Prepare-format MegaCpp sample, which together record the stage handoff, artifact naming, and output-family boundary without depending on a private training-tree checkout.

Ablations and what we kept

The ablations that survived contact with real GPUs are not the headline ones. They are the boring ones.

The pretokenized-vs-char-level choice. We keep the pretokenized path because moving char-to-token alignment out of the hot loop and into offline materialization scales more cleanly to long-context loaders. The char-level path remains useful as an offline materialization input, not as the runtime contract.

The lazy-vs-eager segment materialization choice. We kept eager precomputation for the fully enriched path because relation metadata is cheaper to validate once per document than to rebuild repeatedly inside the row-pack hot loop. Partial-enriched configurations can still justify lazier materialization when the extra metadata is absent.

The document-mask implementation choice. We keep the vectorized masking path and avoid Python-loop handling in the hot path. The lesson is simple: document-boundary logic belongs in fixed-shape tensor operations, not in per-batch Python control flow.

The bottleneck dimension on the structure embedding path. The public contract keeps a narrow bottleneck because structure features must stay cheap enough to justify carrying them through the loader boundary.

The shape of MinHashQuick term guideMinHashA compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.GroundingAbout: code deduplication at scale Example: dedup pipeline sample-LSHQuick term guideLSHLocality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.GroundingAbout: code deduplication at scale Example: dedup pipeline sample itself we did not ablate; we adopted the bigcode parameterization (numPerm=128, threshold=0.7, shingleK=5) for within-corpus and the MixMinMatch parameterization for cross-source. Both have published evidence behind them and our role here is dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most engineering, not novel similarity research.

Production checklist

Pipeline snapshot

Stage Input Output Gate
Ingest raw repos normalized docs license allow-list
Dedup normalized docs unique docs minhash-LSH threshold
License filter unique docs permissive subset SPDX match
Doc-mask permissive subset docs + loss mask schema check
Tokenize masked docs token streams vocab coverage check
Pack token streams packed shards row-validity contract
Single-stage rerun example:
- stage: pack
- slice: core_cpp
- input: tokenized shards
- output: packed shards
- row length: 8192
FAQ

Frequently asked questions

Why keep the pipeline split into so many explicit stages?+
Because the expensive failures happen at different boundaries. Near-dedup has different failure modes than license filtering, BOS-aware masking fails differently from token packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…, and schema verification fails differently from semantic enrichment. Keeping those as explicit stages gives each one a reviewable output contract and lets the verify gate tell you whether the regression came from source selection, dedup, tokenization, or packing instead of giving you one opaque "dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and… got worse" answer.
Why is BOS-based document masking treated as a pipeline invariant?+
Because the runtime side is allowed to trust it. The loader, packer, and any later structure-aware consumer all assume BOS marks real document starts, so the moment that invariant drifts, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries., masks, and packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a… stop describing the same sample boundaries. That is how you get cross-document leakage that looks like a model bug but is really a producer bug.
Why does this article emphasize schema tolerance so much?+
Because producer code changes faster than long-lived consumers. The stable thing is the parquet contract, not whichever script emitted it this month. A tolerant loader plus a strict schema/verify gate lets you evolve non-essential fields, add enriched columns, or change stage internals without forcing every downstream reader to be rewritten or every older shard to be re-materialized.
Why stage in parquet if the training run ultimately consumes .bin/.idx?+
Because the two formats solve different jobs. The parquet stage is the typed columnar staging format where enrichment columns, provenance, and verification receipts stay inspectable; the Megatron indexed dataset is the narrower final read format the trainer streams. If you want that bridge spelled out end to end, keep Parquet to Megatron indexed dataset sample and Converting parquet token shards into Megatron indexed datasets nearby.
Why keep the AST-aware chunker around if bucket labels already exist?+
Because bucket labels are planning labels, not proof of a token-faithful split. The AST-aware lane gives you cleaner semantic boundaries and a tighter token budget than the old chars-per-token heuristic, which is exactly why nominal 4k shards from the heuristic path can drift into the mid-4K range after real tokenization. If the chunking question turns into "what structural context are we preserving?" rather than "how many tokens fit?", keep compile commands and semantic graphs and the Clang semantic indexer nearby.
Why not run chunk-level dedup on every semantic unit?+
Because some chunks are only meaningful inside their surrounding file context. Restricting chunk-level dedup to self-contained kinds such as bodies, declarations, typedefs, and namespaces avoids deleting preambles, forward declarations, or class members that later stages still need to interpret correctly. The more general dedup tradeoff is spelled out in Code Deduplication at Scale.
Where do tokenizer, provenance, and eval posts fit into this pipeline?+
They describe adjacent boundaries of the same dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and… story. Inside the MegaCpp C++ tokenizer and Tokenizer evolution for C++ code explain the 131K vocabulary and why uint32 tokens are required. License Hygiene and Provenance for a C++ Training Corpus explains the pinning and SPDX side of stage 0 through stage 3. Eval harness plumbing and Verifier-first C++ evals are downstream consumers of the same contract: they only make sense if the promoted dataset snapshot is pinned, typed, and auditable.
Which public examples are the fastest proof surfaces for this lane?+
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

compile command database

The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.

Semantic indexing

The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Data pipeline

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…

Tokenizer

A deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…

segment_ids

The fixed-width segment labeling used to preserve document boundaries without changing the TPU kernel shape.

MinHash

A compact sketch that approximates set overlap so large corpora can be deduplicated without comparing every full token or shingle set directly.

loss_mask

The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.

LSH

Locality-sensitive hashing: a bucketing scheme that groups similar sketches so near-duplicate candidates can be found without exhaustive pairwise search.

Shingling

The step that turns text or token streams into overlapping k-grams so similarity can be estimated from shared local fragments rather than exact string identity.

valid_token_count

The per-row prefix length of non-pad tokens; runtimes use it as the cheap validity receipt instead of rescanning variable-length packed payloads.

Jaccard threshold

The overlap threshold used to decide when two shingle sets are similar enough to treat as near-duplicates during corpus cleanup.

Secrets

Modal's credential-injection surface for environment variables and access tokens, kept separate from both the pinned image and writable Volumes.

Topic hubs