License Hygiene and Provenance for a C++ Training Corpus
How MegaCpp describes source provenance, revision pinning, SPDX metadata, and refusal-list rules for a public C/C++ corpus narrative without overstating legal certainty.

A C++ training corpus should not be described with one blanket licensing sentence. Public source corpora mix Apache-2.0, BSL-1.0, MIT, GPL-family licenses, exception clauses, repository-level LICENSE files, and files that only carry provenance through version control history. If a corpus description ignores that mix, it stops being a provenance statement and turns into marketing.
If you want the narrowest checked-in proof surfaces before the rest of the prose, start with Reference corpus pins, Data prep notes, the data example index, Enriched record normalization example, and Enriched JSONL record to parquet.
In this article, SPDX means the machine-readable license expression recorded for a file or repo input, SWHID means a stable Software Heritage identifier for an archived source object or revision, and a refusal list is the explicit set of source families that remain out of scope under current policy. C++ data versioning and schema is the nearby consumer-contract view of the same record.
Why this matters
License hygiene is not only about compliance review. It is also a dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-correctness property. If a model output or regression sends you back to the corpus, you need to know which repository, revision, and file family produced the relevant tokens.
1. A provenance-first corpus story
The public story does not need to be an exhaustive inventory of every candidate source. It does need a strong admission rule. The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep flow in C++ data preparation pipeline deep dive provides one: pin every upstream input to an explicit tag, commit hash, or dataset revision, and treat license metadata as structured dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most. The naming and ledger side in C++ data versioning and schema adds the minimum fields that make those inputs auditable.
That is enough to define a useful working rule: a source is not "in the corpus" in any meaningful reproducible sense until it has those fields.
2. Describe the license mix honestly
The public pinning and dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep notes do not claim a single-corpus license. They imply the opposite: the corpus is a set of pinned inputsQuick term guideInput pinningThe rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.Groundingreference corpus pins, each with its own metadata.
For a C/C++ corpus built from common public infrastructure projects, a realistic mix will often include categories like these:
| Source family | Common license patterns |
|---|---|
| toolchains and infra libraries | Apache-2.0, BSD-3-Clause, MIT |
| Boost-family inputs | BSL-1.0 |
| test and utility libraries | MIT, BSD-style |
| kernel-adjacent headers and systems code | GPL-family, LGPL-family, or exception-bearing variants |
The exact operational set should be reported from the ledger, not reconstructed in prose.
Repository metadata can seed the queue, but the ledger should preserve when a file header, subdirectory notice, or later audit overruled that first impression. Code Deduplication at Scale is the nearby explanation for why identity, license, and admission all have to survive repeated copies without turning one root notice into a blanket answer.
3. SPDX detection should stay in metadata even if training text is normalized
The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep note says to treat license metadata as structured dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most, not prose. That one line carries an important consequence: SPDX should stay a machine-readable license expression attached to the record even when other text normalization steps strip comments, boilerplate, or repeated headers from the training-facing representation.
A practical scan order looks like this:
- Check the file header for
SPDX-License-Identifiermarkers and parse them as SPDX expressions. - If that fails, fall back to repository-level license metadata or a dedicated license scanner.
- Record both the detected expression and the source of the detection.
- Keep that metadata outside the model-facing token stream.
The point is deliberately narrow: SPDX expressions and scanner outputs are useful metadata fields, not a claim that one scan settles every legal question.
Native SPDX-License-Identifier headers are therefore best treated as a fast path, not as the whole policy. Some mature systems projects do carry explicit per-file SPDX tags, but a public C/C++ corpus still has to assume that many admissible sources will fall back to repository metadata or slower file-level scanning. That is another reason to record the detection source alongside the expression itself: later audits need to know whether a license field came from a file header, a repo-level notice, or a scanner pass.
4. Provenance means pinned revisions, not floating branches
The public pinning note is explicit on this point: do not publish training or evaluation claims against floating main, master, or tip. That rule matters as much for provenance as it does for benchmarking. A floating branch is not a reproducible input.
Optional SWHIDs are useful here as a second anchor. They do not replace repository commits, but they do provide a stable cross-host provenance pointer when one is available. The checked-in Reference corpus pins note already frames SWHIDs the right way: optional, audit-friendly, and never a substitute for the concrete revision pin. Because the core identifier is intrinsic to the object, attaching one does not require a live archive lookup during ingestion; it can be computed locally and checked later against the same object identity.
5. Build metadata needs provenance too
The checked-in Compile-commands fixture is small, but together with compile commands and semantic graphs it shows an important part of provenance work that is often missed: build context is dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most. Include paths, language mode flags, and the compilation unit path are all part of what later structure-aware stages may consume.
The normalization side of that contract is visible in Compile commands context example, which keeps the build metadata typed and public-safe instead of flattening it into free-form text.
That typed normalization matters because the Clang JSON Compilation Database allows either an arguments array or a shell-escaped command string, and the structured arguments form is the less ambiguous one to pin and replay. Provenance work should preserve that distinction instead of forcing later consumers to guess how a command line was split.
If build context is used during chunking or metadata extraction, it should be pinned and versioned like any other source input. Otherwise a corpus can drift even when the source repository revision stays fixed.
6. A refusal list is part of the public contract
A refusal list is part of the public contract: the explicit set of source families we can describe but do not currently admit into the training snapshot. Some sources should remain outside the current operational corpus because they are gated, ambiguously licensed, dominated by generated code, or difficult to pin in a way that supports public claims.
That is not a weakness. It is a sign that the provenance story is honest enough to say "not yet" or "not under this policy."
It is also a good reason to keep the refusal list structured rather than flat. Some exclusions are unconditional, but others depend on what ingestion discovered: a contradictory file-level notice, missing provenance fields, or build-linked context that changed the admission decision.
Operational checklist
- Pin every source to an exact tag, commit, or dataset revision.
- Store license metadata as structured side dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most.
- Keep provenance fields alongside schema version and retrieval date.
- Preserve build metadata when it affects downstream extraction.
- Report the actual license mix from the ledger, not from memory.
- Use a refusal list for sources that do not satisfy the current provenance rules.
- Prefer optional SWHIDs when available for stable public reference.
The useful public claim is therefore narrow and defensible: the corpus is described by pinned inputsQuick term guideInput pinningThe rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.Groundingreference corpus pins plus structured provenance metadata, not by a vague sentence about “public C++ code.”
Frequently asked questions
Why insist on pinned revisions instead of repository names alone?+
If some files already carry native SPDX headers, why keep detection-source fields?+
Where does SPDX 3.0 fit in this ledger?+
Do SWHIDs require a live Software Heritage lookup during ingestion?+
Why not treat the repository root license as the final answer?+
Why does compile-commands provenance care about arguments versus command?+
arguments, which is why Compile commands context example normalizes compile metadata into typed fields instead of passing raw command text through unchanged.Can license scanner results be cached by content identity?+
What is the refusal list doing operationally?+
Should refusal rules be only repository-level?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.
An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…
The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.
The rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.