MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 7 min readDavid Gornshtein
Data
Dataset
Ablation
Schema
History

Dataset Versions v2 to v6: The Long-Form Ablation History

A detailed walk through every schema generation of the C++ training corpus - what each version added, the schema diff, the storage cost, the val_bpb delta we attribute to each step, what we deprecated and why.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Dataset Versions v2 to v6: The Long-Form Ablation History
Published 7 min readDavid Gornshtein

The high-level overview of v2 through v6 lives in v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All; this one is the long-form ablation history for engineers about to add a v7. The discipline that has saved us the most pain is that no version replaced the previous - they coexist and load through the same tolerant consumer.

For first touch, three labels must stay separate. A schema generation is the reader-facing contract change such as v4 or v6. A producer revision is a newer emitter under an existing schema, such as a later clang-semantic wave that still speaks v5. A tolerant consumer is the loader contract that accepts optional fields only when they remain shape-valid and semantically well-defined. The shortest local proof surfaces are C++ data versioning and schema, Loader enriched columns sample, and Packed rows schema sample.

Why MegaCpp cares about this

Migrating a corpus across schema boundaries is the most expensive thing you can do in a data stack. Re-tokenizing a large, already-approved corpus because you renamed a field burns both compute and operator attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns. Writing the history down is how the next version avoids the same mistake.

What we built in the MegaCpp data pipeline

The shared substrate before any version: eight pinned C/C++ repositories cloned shallow at explicit refs, totaling about 15 GB on disk after shallow clone, plus a larger catalog tracked separately. One hybrid C++ tokenizer, later expanded to 131,072 entries. The differences between versions are how we chunk, order, enrich, and resolve cross-file references over those same bytes.

v2 - full files, pre and post commit. The first version that was more than a flat dump. Walks each repo's commit history and, for each commit touching a C++ file, emits the file before and after as two documents. The point is temporal signal: flat files have no notion of code as something that used to look different.

Schema: JSONL with {"text", "repo", "commit_hash", "filepath", "timestamp"}. No structure metadata. Tokenized to uint16 binaries because the early tokenizer fit in 16 bits.

Val_bpb attribution: v2 is the baseline against which every later version is measured operationally, not statistically. Phase 1 of the curriculum (4K context, syntax mastery) lives entirely on v2/v3 packed into 4096-token sequences. We never ran a clean v2-vs-flat-dump ablation at scale because the flat-dump baseline was already gone by the time we had the infrastructure to do it.

v3 - structured inline diffs. Same commit walk, different rendering. Each commit becomes a single document under a synthesized file header, with removed lines as C++ comments (// Removed: ...) and added lines as live code. Two header styles ship side by side: v3_doxygen (Javadoc) and v3_simple (plain // File: ...).

Schema diff vs v2: same JSONL envelope; text is now a synthesized commit document. New optional header_style field. Still uint16. Storage roughly doubles to carry both header styles.

Why both styles: we did not know which the model would learn faster from. Doxygen matches the conventions of LLVM and folly; simple is less noisy. Rather than guess, we built both, packed them at equal weight, and let evals speak. They did not differ enough to justify keeping only one, and once both exist the marginal cost is storage.

Val_bpb attribution: v3 teaches "what was changed and what it replaced" instead of v2's "what does this file look like before and after." Cleanly attributing a BPB delta to v3-vs-v2 alone is hard because the curriculum phases that consumed v3 also enabled FIM, document masking, and tokenizer changes simultaneously. Operational evidence: v3 unblocked diff-shape learning and we kept it.

v4 - tree-sitter context graph. The first version that emits a graph. For every modified function in a commit, build a strict 64K-token window containing the target plus its direct callers and direct callees, extracted with a tree-sitter AST walker.

Schema diff vs v3: still text-shaped at the consumer surface, but text is now a graph-assembled window. New optional language_info, platform_info, and build_info fields start to separate lexical text from parser context and platform hints.

Storage cost: significant. Shards live in multiple context-length buckets that share schema lineage with v4.

Operational history: v4 had an embarrassing first month. A JSON-schema mismatch in the producer could emit empty outputs while the outer pipeline still looked healthy. The reason to record that here is not the bug itself; it is the lesson that round-trip gates belong in the version contract, not as optional debugging extras.

v4 is approximate by design. Tree-sitter does not resolve names across files, does not see overloads, does not know which foo() a call_expression resolves to under namespaces or templates. For 16K curriculum windows that is fine; the point of v4 is throughput.

Val_bpb attribution: v4 is what made Phase 2 (file-level reasoning, 16K context) work. Phase 1 ablation shows DSA on v3-shaped data at val_bpb ~1.562 (the largest single-feature improvement, -16.3% over the attn-only baseline of ~1.866). We do not have a clean "DSA on v4 vs DSA on v3" number at fixed model config because the context-length change confounds it.

v5 - libclang semantic graph. The answer to "what does v4 lie about." Where tree-sitter approximates, libclang resolves. v5 drives Clang with each project's compile command databaseQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.GroundingAbout: compile commands and semantic graphs Example: compile commands context example Example: semantic indexing notes, walks git history incrementally, and emits higher-trust semantic relationships under the repository's real build contract.

Schema diff vs v4: same envelope; graph content is now semantically resolved instead of AST-approximated. build_info becomes authoritative per record. language_info is preserved when uniform across constituent files. The checked-in public-safe bridge for this section is Compile commands context example plus Compile commands and semantic graphs: one shows the build database turning into a typed record, the other explains why that matters more than raw graph existence. One small but practical detail matters here: structured arguments-style compile records preserve flags more safely than shell-flattened command strings, which is exactly the kind of contract detail that keeps a semantic pass from silently drifting.

Storage cost: materially higher than v4 in operator effort, even when the stored outputs stay manageable. The producer is far more expensive than v4 because real build reproduction is slower and build coverage is uneven by repository.

Operational history: deployment took longer to stabilize than the indexer itself. The public-safe lesson is narrower: build-aware extraction must account for partial compile databases, hung translation units, stale build context, and producer health checks that fail loudly instead of publishing an apparently valid empty lane. The detailed version lives in The Clang semantic indexer.

Val_bpb attribution: v5 is the only producer in the stack whose call edges we trust at long context. Tree-sitter's are wrong often enough at 64K that we cannot use them as the sole source for repository-reasoning training. Phase 3 (64K, repository graph reasoning) does not exist without v5.

v6 - enriched parquet. The version where the dataset stops being just text. Same commit walk. Same v5-quality semantic edges where available, v4 tree-sitter edges as fallback. The change is the schema: each parquet record now carries dense structural metadata as additional columns, which is the same enriched surface consumed later in Tokenized enriched packed rows on TPU.

Schema diff vs v5: parquet instead of JSONL. New columns include structure_ids, chunk_boundaries, call_edges/type_edges, optional AST metadata, and the preserved platform_info/language_info/build_info triple. A token-level extension materialized offline is what the production loader actually consumes. The checked-in proofs are Enriched JSONL record to parquet, Enriched record normalization example, Loader enriched columns sample, and Token-level enriched parquet materialization example.

Storage cost: enriched parquet is manageable precisely because the new columns are sparse and columnar. The token-level extension is still expensive enough that we materialize it selectively.

The tokenizer also jumped under v6: 131,072 entries, exceeding uint16. Pretokenized v6 shards switched to uint32. Older v2/v3 uint16 archives remain valid against the older tokenizer; the new ones are not.

Val/bpb attribution: this is the phase where we keep the strongest local engineering evidence that richer structure can help while also costing real throughput. The checked-in Phase-5 ablation receipt sample is intentionally narrow: one real operating point, not a universal promise. The stronger public-safe statement is that v6 only stays because the structure-aware win is repeatable enough to justify engineering the loader and kernel path around it, which is the deployment-side story told in Tokenized enriched packed rows on TPU and Dataloader throughput and stalls. The throughput recovery story is not "parquet is magically faster"; it is that offline token alignment and fixed-shape packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles remove the char-to-token and shape-variation work from the hot path.

# Tolerant loader contract: trust optional columns only when shape-valid.
def load_row(row, T):
    ids = row["input_ids"]
    assert ids.shape[-1] == T
    doc_ids = row.get("doc_ids")
    if doc_ids is None or doc_ids.shape != ids.shape:
        doc_ids = infer_from_bos(ids)
    return ids, doc_ids, row.get("loss_mask")
FAQ

Frequently asked questions

Why did TreeFFN stay while RelationBiasComputer was removed?+
Because they failed on different axes. The local ablation story here is that TreeFFN still earns its keep once the loader preserves token-aligned structure and chunk metadata, while RelationBiasComputer added cost on top of that stack without a measurable BPB win at scale.
Did switching v6 token shards from uint16 to uint32 erase the pretokenized-loader win?+
No. The wider token id doubles the bytes for the id column, so v6 treats tokenizer width as part of the schema contract instead of a silent implementation detail. The win that made pretokenized shards worth keeping came from moving char-to-token alignment and row-shape decisions offline: the runtime loader reads fixed packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a… and static-shape metadata instead of rebuilding those tensors per batch. That is why older uint16 archives stay readable for their original tokenizer, while v6 keeps the larger vocabulary and pays the uint32 cost inside the fixed-shape path described in Tokenized enriched packed rows on TPU and Dataloader throughput and stalls.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

compile command database

The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.

Semantic indexing

The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

loss_mask

The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.

Topic hubs