MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 7 min readDavid Gornshtein
C++
Corpus
Dataset
Training
Data
Tokenizer

Building a C/C++ corpus for training: what we keep, what we throw away, and why

A detailed walkthrough of how MegaCpp builds a C/C++ corpus: source selection, pins, deduplication, compile-command metadata, chunking, structure-aware exports, and refusal rules.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Building a C/C++ corpus for training: what we keep, what we throw away, and why
Published 7 min readDavid Gornshtein

A usable C/C++ trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 corpus is not just a dump of repositories. The work is in deciding which public inputs are eligible, how they are pinned, which metadata survives preprocessing, and which sources should stay out until they can be described cleanly. The public MegaCpp sample pack files are enough to outline that process without leaning on unpublished inventories. The adjacent Building the C++ Training Data Pipeline: What Worked, What Broke post covers the end-to-end lane, The C/C++ Data Preparation Pipeline, End to End walks the operational stages, and License Hygiene and Provenance for a C++ Training Corpus explains why promotion has to stay narrower than discovery.

What the corpus story should keep explicit

The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep and pinning notes define the important parts of the construction story.

That is already a stronger corpus story than most model cards provide. It says the corpus is a versioned trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 snapshot with entry criteria, not an ever-shifting collection of repositories. The narrow checked-in policy surfaces for that claim are Data preparation notes and Reference corpus pins.

Source selection is narrower than source discovery

A corpus builder should distinguish between three things: sources worth considering, sources worth pinning, and sources that are currently eligible for promotion. Public discussions often collapse those categories and make the resulting corpus sound more settled than it is.

The public notes support a cleaner rule. Discovery can be broad. Promotion should be narrow. A source becomes promotion-eligible only after it has a stable revision, acceptable license metadata, and a place in the schema and verification flow.

That distinction matters most for awkward hosts, mirrored repositories, and access-gated inputs. A link is not the same as a reproducible source record.

Archival identifiers help, but only as evidence handles. A Software Heritage SWHID can name archived content, directories, revisions, releases, or snapshots, with optional qualifiers for origin, path, and lines. In a corpus ledger that is useful provenance metadata; it is not a substitute for the license, schema, and promotion checks that decide whether a source enters the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 snapshot.

Build context belongs in the corpus pipeline

The compile-commandQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.GroundingAbout: compile commands and semantic graphs Example: compile commands context example Example: semantic indexing notes sample shows why C/C++ corpora need more than raw files.

[
  {
    "directory": "/workspace/build",
    "file": "src/parser.cpp",
    "arguments": [
      "clang++",
      "-std=c++20",
      "-Iinclude",
      "-DMEGACPP_EXAMPLE=1",
      "-c",
      "src/parser.cpp"
    ]
  }
]

This kind of metadata matters because C and C++ meaning is partly build-defined. Include roots, language mode flags, generated directories, and compile units all shape what a parser or structure extractor can see. The public dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep note reflects that by explicitly separating build-aware metadata from plain lexical chunks.

The boring but useful rule is to prefer typed arguments arrays over one shell-escaped command string whenever the exporter can do it. That keeps include paths and macro flags parseable across platforms instead of making every downstream consumer re-implement shell parsing.

That separation is one of the most important design choices in the corpus pipeline. If build metadata is flattened into prose or discarded too early, later structure-aware features lose their anchor. The checked-in proof surfaces are Compilation database sample and Compile commands context example, both of which keep build context as typed records. That is also the point of Compile Commands and Semantic Graphs: build context is part of the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 record, not decoration.

The useful lesson is narrower than "add more metadata." Keep compile flags, include roots, and translation-unit context as optional structured fields that can evolve with the schema, then make promotion depend on schema validation plus a consumer smoke pass. That is the same contract carried by C++ Data Versioning and Schema: How to Keep Training Rows Stable While the Corpus Evolves and Compile Commands and Semantic Graphs.

Deduplication and normalization come before chunking

The public pipeline shape is explicit about order: normalize encodings and line endings, remove obviously generated noise, apply license and provenance tagging, deduplicate exact and near-duplicate content, then extract structure-aware metadata and write columnar artifacts.

That order is not cosmetic. Deduplicating after chunking is weaker because template boilerplate, vendored code, and near-clone headers have already been allowed to dominate chunk statistics. Doing it earlier keeps repeated infrastructure from overwhelming the rarer patterns a specialist model actually needs.

There is another reason to prefer pre-chunk deduplication: post-chunk cleanup can damage the exact structure the later stages were trying to preserve. If duplicate chunks are removed after AST or lexical slicing, the pipeline can keep the shell of a file while dropping the repeated interior that made the structure interpretable in the first place. Deduplicating whole sources first is narrower and less glamorous, but it preserves coherent units for later structure-aware export.

Normalization also has to stay conservative. Line endings, encodings, and obviously generated noise are good normalization targets. Semantic rewrite of code style is not. The point is to remove accidental variation, not to erase meaningful formatting or build distinctions. Code Deduplication at Scale covers the same ordering from the duplicate-control side: boilerplate is easiest to suppress before chunk statistics are already polluted.

Structure-aware exports should stay typed and separate

The semantic indexingQuick term guideSemantic indexingThe structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.GroundingAbout: Clang semantic indexing About: compile commands and semantic graphs Example: semantic indexing notes note and the dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep note both point in the same direction: structure-aware metadata is part of the export contract, not a vague aspiration.

That means chunk rows should keep their main lexical content separate from additional fields such as structure IDs, chunk boundaries, compile-commandQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.GroundingAbout: compile commands and semantic graphs Example: compile commands context example Example: semantic indexing notes-derived context, or graph-style relations. Typed side fields are easier to validate, easier to evolve, and much easier to consume than one overloaded text field that tries to carry everything.

The chunk budget should follow structural content, not line counts. Comments and neighboring code nodes should stay together when they describe the same unit, because splitting on arbitrary line windows is exactly how C++ chunks turn from structured records back into lossy text.

This is also where many C/C++ corpus projects quietly fail. They gather rich parser output, then collapse it back into lossy text before the loader boundary. The public MegaCpp materials argue for the opposite choice: keep the richer metadata explicit and versioned. The narrow checked-in proof surfaces are Enriched record sample and Enriched JSONL record to parquet, which preserve typed chunk, relation, and provenance-bearing fields.

Versioning is part of corpus construction, not post-processing

The reference pinning note includes schema version as minimal metadata per input. That is important because schema versioning is not an afterthought once rows are already written. It is part of how the corpus is built.

If a chunk row gains a new metadata field, the pipeline should have a canonical way to represent older rows, newer rows, and missing fields. Otherwise every consumer becomes a schema detective. Public sample notes cannot prove every downstream implementation detail, but they clearly endorse the right discipline: explicit schemas, round-trip checks, and consumer smoke tests before promotion. The checked-in schema surfaces for that are Packed rows schema sample and Loader enriched columns sample.

The promotion rule follows from that: schema validation alone is not enough. A snapshot should also survive a consumer smoke pass that proves the typed columns still load and the expected metadata relationships still hold. That is the difference between "the rows parse" and "the snapshot is actually ready to feed the next stage without ad hoc repair."

What we keep and what we throw away

The public materials imply a straightforward keep/discard policy.

Keep:

  • pinned public source files
  • structured license and provenance metadata
  • build-aware metadata that affects parsing or chunk meaning
  • typed structure-aware exports
  • versioned columnar artifacts that pass schema and smoke checks

Throw away or keep out of the promoted snapshot:

Practical checklist

That is the detailed corpus-construction story the public files support. It is narrow enough to defend, concrete enough to implement, and much more useful than a generic claim about “trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 on lots of open-source C++.”

FAQ

Frequently asked questions

Why not just train on raw repository dumps?+
Because raw dumps hide the hard parts: revision pinning, license state, deduplication, build context, and schema stability. Those are the parts that decide whether a corpus is reproducible.
Why keep compile commands separate from lexical chunks?+
Because build context changes what C and C++ code means. The lexical text and the build metadata need to survive as separate typed surfaces.
What if the compilation database covers only part of a repository?+
Then the corpus should keep two lanes legible instead of pretending one partial semantic pass is complete truth: a broad syntax-first lane for coverage, and a compile-aware lane for the files whose build context is actually known. Reader- facing outputs should say whether a gap came from missing build metadata or from a parser/tooling miss. Compile Commands and Semantic Graphs and The C/C++ Data Preparation Pipeline, End to End cover that split in more detail.
Why deduplicate before AST or lexical chunking instead of after?+
Because whole-source dedup removes repeated boilerplate before it can dominate chunk statistics or fragment structure-aware exports.
Why require a consumer smoke pass if the schema already validates?+
Because schema validation only proves that the rows are internally well-typed. Promotion should also prove that at least one downstream reader can load the snapshot without ad hoc repair.
What counts as "obviously generated noise" here?+
Only the cheap, reviewable cases: explicit generated markers such as // Generated by or DO NOT EDIT, extreme line geometry that looks like emitted output instead of maintained code, and binary-in-ASCII dumps. The point is to strip machine-emitted boilerplate before dedup and chunking without pretending that this filter settles every harder provenance or quality question; those stay with the narrower rules in The C/C++ Data Preparation Pipeline, End to End and License Hygiene and Provenance for a C++ Training Corpus.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

compile command database

The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.

Semantic indexing

The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.

Data pipeline

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

Dataset versions

What changed between v2, v3, v4, v5, and v6 of the C++ training corpus, why each step happened, why we kept the older formats backwards-compatible,…

Topic hubs