Compile Commands and Semantic Graphs: Why C++ Training Needs Real Build Context
How compilation-database-driven semantic extraction improves C++ corpus quality, where clang indexers fail, and why build-aware graphs matter more than raw text proximity.

C++ is not just text. It is a collection of translation units compiled under concrete flags, include paths, defines, generated headers, and standard-library choices. If a training pipeline ignores that build context, it can still produce useful syntax-heavy examples, but it will routinely blur the cross-file relationships that matter most on real repositories.
That is why MegaCpp keeps two different extraction lanes. The broad lane is syntax-first and optimized for coverage. The narrow lane is build-aware and optimized for semantic trust. We do not treat those lanes as interchangeable, because they are not. The checked-in public proof surfaces for that split are Semantic indexing notes, Data and masking examples, and Compile commands context example.
For first touch, compile command databaseQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.Groundingcompile commands context example semantic indexing notes means the Clang JSON Compilation Database: the checked-in or generated file that describes how each translation unit was compiled. A translation unit is the compiler's view of one source file plus the headers, macros, and flags that shape that compile. A semantic graph here is the exported set of higher-trust call, type, and dependency relations extracted under that build contract.
Here, the syntax-first lane means lightweight chunking and local structure without compiler-resolved symbols, while the build-aware lane means replaying real compile metadata before emitting graph facts.
Why compilation databases change the data story
Clang tooling has a standard way to describe how a repository was compiled: compile command databaseQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.Groundingcompile commands context example semantic indexing notes. In the best case, that file gives each translation unit the exact arguments, include paths, and language mode that the real build used. Once that metadata is available, semantic extraction can move from "these two files look related" toward "this symbol actually resolves under the project's build."
That difference matters for training dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most:
- call edges become more trustworthy
- type references stop depending only on local syntax
- header and implementation relationships are less likely to be guessed wrong
- build-specific symbols become visible under the same flags the compiler used
For C++, that is not a cosmetic improvement. It is the difference between a model learning a plausible file neighborhood and a model learning a compiler-level view of the codebase. The more expensive continuation of that same idea is the full clang semantic indexing lane, where the compiler view is materialized directly instead of approximated.
What MegaCpp keeps separate
MegaCpp's public pipeline keeps a layered contract:
| Lane | Input | Strength | Typical limitation |
|---|---|---|---|
| syntax-first lane | file text and lightweight structure | broad coverage, fast chunking | weak on cross-file semantics |
| build-aware lane | build database plus Clang tooling | stronger symbol and type resolution | fragile when build metadata is incomplete |
That separation is deliberate. A successful syntax pass is not evidence that a repository had complete build context. A successful build-aware pass is not evidence that every file in the repository indexed cleanly.
What the build database is and is not
The build database is a valuable input, but it is not magic. A minimal sample looks like this:
[
{
"directory": "/workspace/build",
"file": "src/parser.cpp",
"arguments": [
"clang++",
"-std=c++20",
"-Iinclude",
"-DMEGACPP_EXAMPLE=1",
"-c",
"src/parser.cpp"
]
}
]
That record tells Clang tooling what the compiler saw for that translation unit. It does not guarantee that the surrounding environment is still complete. Generated headers can be missing. The database can be stale. Only a subset of targets may be present.
One detail from the Clang spec is worth making explicit: command objects can use either arguments or command, and arguments is the preferred form because it avoids shell-splitting ambiguity. That matters for any downstream indexer that needs compile flags to stay lossless.
MegaCpp therefore treats build-aware extraction as a high-trust lane, not a blindly trusted lane.
The failure modes that matter in practice
Most semantic-indexing problems are not spectacular crashes. They are partial-truth problems.
The common ones are:
- missing build databases: some repositories simply do not publish one
- partial target coverage: the database exists, but only for part of the tree
- generated-header drift: the command is real, but generated inputs are absent
- macro drift: the build database resolves a different conditional world than the one you intended to index
- silent per-file degradation: some files index cleanly while others fall back or fail
If a pipeline hides those distinctions, the resulting graph looks cleaner than it really is. MegaCpp's safer approach is to record the confidence boundary: which outputs came from build-aware extraction, which came from syntax-only extraction, and where coverage was partial.
The checked-in public examples also make one implementation detail clearer than the high-level article usually does: the loader-facing contract is not "store raw compiler shell lines forever." Compile commands context example normalizes either arguments or command into a stable record with a preferred filepath, filtered compile args, and compact build metadata.
Why this affects training quality
The model benefits from build-aware slices even when they cover less raw text, because those slices are disproportionately valuable on the hardest C++ tasks:
- finding the right declaration across files
- connecting template use to the correct definition path
- understanding build-specific includes and generated surfaces
- keeping symbol neighborhoods honest instead of merely adjacent
That is why MegaCpp keeps semantic enrichment in the pipeline. Not because every repository has perfect build metadata, but because the repositories that do have it can supply higher-trust cross-file examples.
What the public contract should say
The public version of the claim is simple:
- MegaCpp uses syntax-first extraction for coverage.
- MegaCpp uses build-aware extraction when real compile metadata exists.
- Those outputs are not treated as equally trustworthy.
- Build-aware outputs are more valuable for cross-file training signals.
That wording is both honest and useful. It captures the value of semantic graphsQuick term guideSemantic indexingThe structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.GroundingAbout: Clang semantic indexing Example: semantic indexing notes without pretending that compile command databaseQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.Groundingcompile commands context example semantic indexing notes alone solves the full indexing problem.
Frequently asked questions
Does MegaCpp require a compilation database for every repository?+
Should a generated build database be treated the same as a checked-in one?+
Why not treat build-aware extraction as fully trusted once the database exists?+
Do generated-header drift or macro drift invalidate the whole repository?+
What happens when enriched build-aware columns are malformed?+
Why does arguments matter more than command in practice?+
arguments avoids shell parsing ambiguity. If an indexer has to recover quoted defines and include paths later, a structured argument list is a much safer input contract.Which checked-in files show this contract fastest?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.
The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.
An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…