MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 6 min readDavid Gornshtein
Corpus
License
Provenance
SPDX
Data

License Hygiene and Provenance for a C++ Training Corpus

How MegaCpp describes source provenance, revision pinning, SPDX metadata, and refusal-list rules for a public C/C++ corpus narrative without overstating legal certainty.

MegaCpp
Focused on applied C++ model engineering
Article Preview
License Hygiene and Provenance for a C++ Training Corpus
Published 6 min readDavid Gornshtein

A C++ training corpus should not be described with one blanket licensing sentence. Public source corpora mix Apache-2.0, BSL-1.0, MIT, GPL-family licenses, exception clauses, repository-level LICENSE files, and files that only carry provenance through version control history. If a corpus description ignores that mix, it stops being a provenance statement and turns into marketing.

If you want the narrowest checked-in proof surfaces before the rest of the prose, start with Reference corpus pins, Data prep notes, the data example index, Enriched record normalization example, and Enriched JSONL record to parquet.

In this article, SPDX means the machine-readable license expression recorded for a file or repo input, SWHID means a stable Software Heritage identifier for an archived source object or revision, and a refusal list is the explicit set of source families that remain out of scope under current policy. C++ data versioning and schema is the nearby consumer-contract view of the same record.

Why this matters

License hygiene is not only about compliance review. It is also a dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-correctness property. If a model output or regression sends you back to the corpus, you need to know which repository, revision, and file family produced the relevant tokens.

1. A provenance-first corpus story

The public story does not need to be an exhaustive inventory of every candidate source. It does need a strong admission rule. The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep flow in C++ data preparation pipeline deep dive provides one: pin every upstream input to an explicit tag, commit hash, or dataset revision, and treat license metadata as structured dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most. The naming and ledger side in C++ data versioning and schema adds the minimum fields that make those inputs auditable.

That is enough to define a useful working rule: a source is not "in the corpus" in any meaningful reproducible sense until it has those fields.

2. Describe the license mix honestly

The public pinning and dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep notes do not claim a single-corpus license. They imply the opposite: the corpus is a set of pinned inputsQuick term guideInput pinningThe rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.Groundingreference corpus pins, each with its own metadata.

For a C/C++ corpus built from common public infrastructure projects, a realistic mix will often include categories like these:

Source family Common license patterns
toolchains and infra libraries Apache-2.0, BSD-3-Clause, MIT
Boost-family inputs BSL-1.0
test and utility libraries MIT, BSD-style
kernel-adjacent headers and systems code GPL-family, LGPL-family, or exception-bearing variants

The exact operational set should be reported from the ledger, not reconstructed in prose.

Repository metadata can seed the queue, but the ledger should preserve when a file header, subdirectory notice, or later audit overruled that first impression. Code Deduplication at Scale is the nearby explanation for why identity, license, and admission all have to survive repeated copies without turning one root notice into a blanket answer.

3. SPDX detection should stay in metadata even if training text is normalized

The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep note says to treat license metadata as structured dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most, not prose. That one line carries an important consequence: SPDX should stay a machine-readable license expression attached to the record even when other text normalization steps strip comments, boilerplate, or repeated headers from the training-facing representation.

A practical scan order looks like this:

  1. Check the file header for SPDX-License-Identifier markers and parse them as SPDX expressions.
  2. If that fails, fall back to repository-level license metadata or a dedicated license scanner.
  3. Record both the detected expression and the source of the detection.
  4. Keep that metadata outside the model-facing token stream.

The point is deliberately narrow: SPDX expressions and scanner outputs are useful metadata fields, not a claim that one scan settles every legal question.

Native SPDX-License-Identifier headers are therefore best treated as a fast path, not as the whole policy. Some mature systems projects do carry explicit per-file SPDX tags, but a public C/C++ corpus still has to assume that many admissible sources will fall back to repository metadata or slower file-level scanning. That is another reason to record the detection source alongside the expression itself: later audits need to know whether a license field came from a file header, a repo-level notice, or a scanner pass.

4. Provenance means pinned revisions, not floating branches

The public pinning note is explicit on this point: do not publish training or evaluation claims against floating main, master, or tip. That rule matters as much for provenance as it does for benchmarking. A floating branch is not a reproducible input.

Optional SWHIDs are useful here as a second anchor. They do not replace repository commits, but they do provide a stable cross-host provenance pointer when one is available. The checked-in Reference corpus pins note already frames SWHIDs the right way: optional, audit-friendly, and never a substitute for the concrete revision pin. Because the core identifier is intrinsic to the object, attaching one does not require a live archive lookup during ingestion; it can be computed locally and checked later against the same object identity.

5. Build metadata needs provenance too

The checked-in Compile-commands fixture is small, but together with compile commands and semantic graphs it shows an important part of provenance work that is often missed: build context is dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most. Include paths, language mode flags, and the compilation unit path are all part of what later structure-aware stages may consume.

The normalization side of that contract is visible in Compile commands context example, which keeps the build metadata typed and public-safe instead of flattening it into free-form text.

That typed normalization matters because the Clang JSON Compilation Database allows either an arguments array or a shell-escaped command string, and the structured arguments form is the less ambiguous one to pin and replay. Provenance work should preserve that distinction instead of forcing later consumers to guess how a command line was split.

If build context is used during chunking or metadata extraction, it should be pinned and versioned like any other source input. Otherwise a corpus can drift even when the source repository revision stays fixed.

6. A refusal list is part of the public contract

A refusal list is part of the public contract: the explicit set of source families we can describe but do not currently admit into the training snapshot. Some sources should remain outside the current operational corpus because they are gated, ambiguously licensed, dominated by generated code, or difficult to pin in a way that supports public claims.

That is not a weakness. It is a sign that the provenance story is honest enough to say "not yet" or "not under this policy."

It is also a good reason to keep the refusal list structured rather than flat. Some exclusions are unconditional, but others depend on what ingestion discovered: a contradictory file-level notice, missing provenance fields, or build-linked context that changed the admission decision.

Operational checklist

  • Pin every source to an exact tag, commit, or dataset revision.
  • Store license metadata as structured side dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most.
  • Keep provenance fields alongside schema version and retrieval date.
  • Preserve build metadata when it affects downstream extraction.
  • Report the actual license mix from the ledger, not from memory.
  • Use a refusal list for sources that do not satisfy the current provenance rules.
  • Prefer optional SWHIDs when available for stable public reference.

The useful public claim is therefore narrow and defensible: the corpus is described by pinned inputsQuick term guideInput pinningThe rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.Groundingreference corpus pins plus structured provenance metadata, not by a vague sentence about “public C++ code.”

FAQ

Frequently asked questions

Why insist on pinned revisions instead of repository names alone?+
Because repository identity is not enough to reproduce a corpus. Without an exact revision, the input can drift while the prose still sounds the same, and any later eval or provenance audit becomes ambiguous. Reference corpus pins is the compact checked-in checklist.
If some files already carry native SPDX headers, why keep detection-source fields?+
Because native SPDX coverage is helpful but uneven in public C/C++ code. A provenance record still needs to say whether the stored expression came from a file header, repository metadata, or a fallback scan.
Where does SPDX 3.0 fit in this ledger?+
Treat SPDX 3.0 as an interchange shape for the same pinned facts, not as the admission rule itself. Dataset and AI profiles can help export the corpus story later, but the local ledger still has to carry the source pin, license expression, detection source, build context, and refusal reason first. That keeps C++ data versioning and schema responsible for consumer-facing meaning instead of hiding policy decisions inside a standards label.
Do SWHIDs require a live Software Heritage lookup during ingestion?+
No. The core SWHID is intrinsic to the object itself, so a provenance pipeline can compute it locally and keep it as an optional audit anchor alongside the concrete repository pin. If an archive copy is consulted later, the same identifier is what lets the pipeline check that the object identity still matches.
Why not treat the repository root license as the final answer?+
Because root metadata often describes only part of a C/C++ tree. Vendored subdirectories, generated fragments, and contradictory file-level notices are common enough that a permissive repository label is only a starting point. The final provenance record still needs the narrower detection source that justified admission for the file or path that actually entered the corpus.
Why does compile-commands provenance care about arguments versus command?+
Because a structured argv list is easier to pin and replay faithfully than a shell-escaped string. The Clang compilation-database spec allows both, but the less ambiguous form is arguments, which is why Compile commands context example normalizes compile metadata into typed fields instead of passing raw command text through unchanged.
Can license scanner results be cached by content identity?+
Yes, but only as scanner evidence. A content-identity cache can avoid repeating the same expensive license scan for identical files, just as Code Deduplication at Scale treats exact identity as a first pass before broader grouping. The admission ledger still has to keep the repository pin, path context, detection source, build context, and refusal decision, because two identical files can arrive through different source families and policy gates.
What is the refusal list doing operationally?+
It keeps the public claim honest by explicitly naming source families that are public but still out of scope under the current provenance rules. In practice it is part of the same admission ledger as the pin, schema version, and license fields described in Reference corpus pins and C++ data versioning and schema.
Should refusal rules be only repository-level?+
No. Repository names are discovery handles, not admission decisions. A useful refusal record should keep the trigger scope, reason, and whether the rule came from a file-level notice, missing provenance, or build-linked context. That lets later audits separate unconditional exclusions from conditional ones without exposing private inventory details.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

compile command database

The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.

Data pipeline

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…

Semantic indexing

The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.

Input pinning

The rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.

Topic hubs