Building a C/C++ corpus for training: what we keep, what we throw away, and why
A detailed walkthrough of how MegaCpp builds a C/C++ corpus: source selection, pins, deduplication, compile-command metadata, chunking, structure-aware exports, and refusal rules.

A usable C/C++ trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 corpus is not just a dump of repositories. The work is in deciding which public inputs are eligible, how they are pinned, which metadata survives preprocessing, and which sources should stay out until they can be described cleanly. The public MegaCpp sample pack files are enough to outline that process without leaning on unpublished inventories. The adjacent Building the C++ Training Data Pipeline: What Worked, What Broke post covers the end-to-end lane, The C/C++ Data Preparation Pipeline, End to End walks the operational stages, and License Hygiene and Provenance for a C++ Training Corpus explains why promotion has to stay narrower than discovery.
What the corpus story should keep explicit
The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep and pinning notes define the important parts of the construction story.
- Every promoted input is pinned to a tag, commit, or datasetQuick term guideDataset versionsWhat changed between v2, v3, v4, v5, and v6 of the C++ training corpus, why each step happened, why we kept the older formats backwards-compatible,…GroundingAbout: v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All History: Dataset Versions v2 to v6: The Long-Form Ablation History revision.
- License metadata is treated as structured dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most.
- Deduplication happens before chunking when possible.
- Build-aware metadata stays separate from plain lexical chunks.
- A snapshot is promoted only after schema checks and a smoke consumer pass.
That is already a stronger corpus story than most model cards provide. It says the corpus is a versioned trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 snapshot with entry criteria, not an ever-shifting collection of repositories. The narrow checked-in policy surfaces for that claim are Data preparation notes and Reference corpus pins.
Source selection is narrower than source discovery
A corpus builder should distinguish between three things: sources worth considering, sources worth pinning, and sources that are currently eligible for promotion. Public discussions often collapse those categories and make the resulting corpus sound more settled than it is.
The public notes support a cleaner rule. Discovery can be broad. Promotion should be narrow. A source becomes promotion-eligible only after it has a stable revision, acceptable license metadata, and a place in the schema and verification flow.
That distinction matters most for awkward hosts, mirrored repositories, and access-gated inputs. A link is not the same as a reproducible source record.
Archival identifiers help, but only as evidence handles. A Software Heritage SWHID can name archived content, directories, revisions, releases, or snapshots, with optional qualifiers for origin, path, and lines. In a corpus ledger that is useful provenance metadata; it is not a substitute for the license, schema, and promotion checks that decide whether a source enters the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 snapshot.
Build context belongs in the corpus pipeline
The compile-commandQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.GroundingAbout: compile commands and semantic graphs Example: compile commands context example Example: semantic indexing notes sample shows why C/C++ corpora need more than raw files.
[
{
"directory": "/workspace/build",
"file": "src/parser.cpp",
"arguments": [
"clang++",
"-std=c++20",
"-Iinclude",
"-DMEGACPP_EXAMPLE=1",
"-c",
"src/parser.cpp"
]
}
]
This kind of metadata matters because C and C++ meaning is partly build-defined. Include roots, language mode flags, generated directories, and compile units all shape what a parser or structure extractor can see. The public dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep note reflects that by explicitly separating build-aware metadata from plain lexical chunks.
The boring but useful rule is to prefer typed arguments arrays over one
shell-escaped command string whenever the exporter can do it. That keeps
include paths and macro flags parseable across platforms instead of making
every downstream consumer re-implement shell parsing.
That separation is one of the most important design choices in the corpus pipeline. If build metadata is flattened into prose or discarded too early, later structure-aware features lose their anchor. The checked-in proof surfaces are Compilation database sample and Compile commands context example, both of which keep build context as typed records. That is also the point of Compile Commands and Semantic Graphs: build context is part of the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 record, not decoration.
The useful lesson is narrower than "add more metadata." Keep compile flags, include roots, and translation-unit context as optional structured fields that can evolve with the schema, then make promotion depend on schema validation plus a consumer smoke pass. That is the same contract carried by C++ Data Versioning and Schema: How to Keep Training Rows Stable While the Corpus Evolves and Compile Commands and Semantic Graphs.
Deduplication and normalization come before chunking
The public pipeline shape is explicit about order: normalize encodings and line endings, remove obviously generated noise, apply license and provenance tagging, deduplicate exact and near-duplicate content, then extract structure-aware metadata and write columnar artifacts.
That order is not cosmetic. Deduplicating after chunking is weaker because template boilerplate, vendored code, and near-clone headers have already been allowed to dominate chunk statistics. Doing it earlier keeps repeated infrastructure from overwhelming the rarer patterns a specialist model actually needs.
There is another reason to prefer pre-chunk deduplication: post-chunk cleanup can damage the exact structure the later stages were trying to preserve. If duplicate chunks are removed after AST or lexical slicing, the pipeline can keep the shell of a file while dropping the repeated interior that made the structure interpretable in the first place. Deduplicating whole sources first is narrower and less glamorous, but it preserves coherent units for later structure-aware export.
Normalization also has to stay conservative. Line endings, encodings, and obviously generated noise are good normalization targets. Semantic rewrite of code style is not. The point is to remove accidental variation, not to erase meaningful formatting or build distinctions. Code Deduplication at Scale covers the same ordering from the duplicate-control side: boilerplate is easiest to suppress before chunk statistics are already polluted.
Structure-aware exports should stay typed and separate
The semantic indexingQuick term guideSemantic indexingThe structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.GroundingAbout: Clang semantic indexing About: compile commands and semantic graphs Example: semantic indexing notes note and the dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep note both point in the same direction: structure-aware metadata is part of the export contract, not a vague aspiration.
That means chunk rows should keep their main lexical content separate from additional fields such as structure IDs, chunk boundaries, compile-commandQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.GroundingAbout: compile commands and semantic graphs Example: compile commands context example Example: semantic indexing notes-derived context, or graph-style relations. Typed side fields are easier to validate, easier to evolve, and much easier to consume than one overloaded text field that tries to carry everything.
The chunk budget should follow structural content, not line counts. Comments and neighboring code nodes should stay together when they describe the same unit, because splitting on arbitrary line windows is exactly how C++ chunks turn from structured records back into lossy text.
This is also where many C/C++ corpus projects quietly fail. They gather rich parser output, then collapse it back into lossy text before the loader boundary. The public MegaCpp materials argue for the opposite choice: keep the richer metadata explicit and versioned. The narrow checked-in proof surfaces are Enriched record sample and Enriched JSONL record to parquet, which preserve typed chunk, relation, and provenance-bearing fields.
Versioning is part of corpus construction, not post-processing
The reference pinning note includes schema version as minimal metadata per input. That is important because schema versioning is not an afterthought once rows are already written. It is part of how the corpus is built.
If a chunk row gains a new metadata field, the pipeline should have a canonical way to represent older rows, newer rows, and missing fields. Otherwise every consumer becomes a schema detective. Public sample notes cannot prove every downstream implementation detail, but they clearly endorse the right discipline: explicit schemas, round-trip checks, and consumer smoke tests before promotion. The checked-in schema surfaces for that are Packed rows schema sample and Loader enriched columns sample.
The promotion rule follows from that: schema validation alone is not enough. A snapshot should also survive a consumer smoke pass that proves the typed columns still load and the expected metadata relationships still hold. That is the difference between "the rows parse" and "the snapshot is actually ready to feed the next stage without ad hoc repair."
What we keep and what we throw away
The public materials imply a straightforward keep/discard policy.
Keep:
- pinned public source files
- structured license and provenance metadata
- build-aware metadata that affects parsing or chunk meaning
- typed structure-aware exports
- versioned columnar artifacts that pass schema and smoke checks
Throw away or keep out of the promoted snapshot:
- floating revisions
- sources that cannot be pinned or described cleanly
- obviously generated noise
- duplicate or near-duplicate content that would dominate trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 statistics
- ambiguous metadata that cannot be represented in the current schema without ad hoc interpretation
Practical checklist
- Start from a revision ledger, not a clone directory.
- Treat build metadata as corpus input, not incidental tooling output.
- Deduplicate before chunking whenever possible.
- Keep lexical, structural, and provenance fields separate.
- Promote only snapshots that pass schema and consumer checks.
- Do not describe review inventory as if it were already promoted trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most.
That is the detailed corpus-construction story the public files support. It is narrow enough to defend, concrete enough to implement, and much more useful than a generic claim about “trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 on lots of open-source C++.”
Frequently asked questions
Why not just train on raw repository dumps?+
Why keep compile commands separate from lexical chunks?+
What if the compilation database covers only part of a repository?+
Why deduplicate before AST or lexical chunking instead of after?+
Why require a consumer smoke pass if the schema already validates?+
What counts as "obviously generated noise" here?+
// Generated by or DO NOT EDIT, extreme line geometry that looks like emitted output instead of maintained code, and binary-in-ASCII dumps. The point is to strip machine-emitted boilerplate before dedup and chunking without pretending that this filter settles every harder provenance or quality question; those stay with the narrower rules in The C/C++ Data Preparation Pipeline, End to End and License Hygiene and Provenance for a C++ Training Corpus.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.
The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.
An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…
What changed between v2, v3, v4, v5, and v6 of the C++ training corpus, why each step happened, why we kept the older formats backwards-compatible,…