Topic Hub

C++ Data Pipelines and Corpus Packaging

A curated archive for the C++ data path: corpus selection, semantic enrichment, packaging into training artifacts, and the file-level durability choices that keep the pipeline sane.

This cluster works best in sequence. Start with the broad pipeline and corpus articles, then move into the semantic graph and packaging layers that make the training rows useful.

data

C++

pipeline

dataset

tokenizer

Curated set

Articles in reading order

Why this hub

Best if you want to understand where the C++ training rows come from and why the pipeline is intentionally shard-heavy.

Corpus and Preparation

These define what goes into the corpus and how the base rows are formed.

01
April 18, 2026•7 min read•David Gornshtein
Building the C++ Training Data Pipeline: What Worked, What Broke
An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and the quality gates that catch our own mistakes.
The broad retrospective: what worked, what broke, and how the pipeline settled.
Data
Pipeline
C++
Tokenizer
Read article
02
April 18, 2026•7 min read•David Gornshtein
Building a C/C++ corpus for training: what we keep, what we throw away, and why
A detailed walkthrough of how MegaCpp builds a C/C++ corpus: source selection, pins, deduplication, compile-command metadata, chunking, structure-aware exports, and refusal rules.
The keep-versus-drop policy for raw C/C++ source and why the filtering matters downstream.
C++
Corpus
Dataset
Training
Read article
03
April 18, 2026•12 min read•David Gornshtein
The C/C++ Data Preparation Pipeline, End to End
Every stage of the MegaCpp data preparation pipeline: ingest, dedup, license filtering, document masking, tokenization, packed rows, and the checks that keep dataset snapshots trustworthy.
The end-to-end article for the shaping and movement of training rows before packaging.
Data
Pipeline
C++
Operations
Read article
04
April 18, 2026•10 min read•David Gornshtein
Data Enhancements: Why structure IDs, AST features, and the Clang graph earn their cost
The structural metadata layered on top of raw C++ source: structure IDs, chunk boundaries, call edges, type edges, tree-sitter AST features, and the optional libclang semantic graph. What each one is for, what the ablations justified, and what we pay in storage and runtime.
The shortest explanation of which enrichment layers were kept in the pipeline and why they survived the cost review.
Data
Enrichment
Tree Sitter
Clang
Read article
05
April 18, 2026•9 min read•David Gornshtein
Code Deduplication at Scale: MinHash, LSH, and What a 142-Repo C++ Catalog Actually Looks Like
How MegaCpp deduplicates C++ at scale: shingling choices, MinHash/LSH parameters, exact-dup SHA-256, and the tradeoffs behind near-duplicate removal.
The deduplication-side companion once the raw corpus is big enough that redundancy becomes a quality and cost problem.
Corpus
Dedup
Minhash
Lsh
Read article

Semantic Enrichment and Packaging

These explain why the pipeline costs more than plain text ingestion and why that cost is accepted.

06
April 18, 2026•5 min read•David Gornshtein
Compile Commands and Semantic Graphs: Why C++ Training Needs Real Build Context
How compilation-database-driven semantic extraction improves C++ corpus quality, where clang indexers fail, and why build-aware graphs matter more than raw text proximity.
Why real build context is required for C++ training quality rather than optional metadata decoration.
Clang
Semantic Indexing
C++
Data
Read article
07
April 18, 2026•12 min read•David Gornshtein
The Clang semantic indexer: translation units, call graphs, and the perf wall
How the libclang-based semantic indexer feeds v6_enriched parquet: compilation-database handling, the per-file translation-unit graph, call and type edges, the failure modes we hit, and the wall-clock cost of ground-truth semantics.
The most direct explanation of the semantic indexer, its perf wall, and what it buys the stack.
Clang
Data
Indexer
C++
Read article
08
April 19, 2026•5 min read•David Gornshtein
Converting parquet token shards into Megatron indexed datasets
Why MegaCpp keeps a narrow data bridge from tokenized parquet shards to Megatron indexed datasets instead of tying data preparation to one runtime import surface.
The bridge from prepared training shards into the indexed dataset artifacts the training lane actually consumes.
Data
Megatron
Parquet
Dataset
Read article
09
April 19, 2026•5 min read•David Gornshtein
Megatron bin/idx pipeline from parquet token shards
Why a parquet-to-binidx bridge matters, what contract it has to preserve, and why a thin formatting wrapper is worth keeping separate from the low-level converter.
How prepared shards are turned into the Megatron-friendly artifact format used by the training lane.
Data
Megatron
Binidx
Parquet
Read article
10
April 19, 2026•7 min read•David Gornshtein
Packed rows as the real training contract
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a storage detail.
The most direct description of why packed rows became the real training contract instead of a minor loader detail.
Data
Packing
Long Context
FIM
Read article

Durability, Versions, and Provenance

These complete the archive with the file-level contracts that keep the data lane reproducible.

11
April 19, 2026•6 min read•David Gornshtein
Protobuf, the 2 GB Wall, and Why MegaCpp Prefers Shards Over Giant Messages
Why large-message serialization becomes fragile near protobuf's practical limits, and how MegaCpp's checkpoint and data paths avoid single huge payloads by using sharded files, streaming conversion, and explicit completion markers.
A good durability companion piece: why the pipeline keeps choosing shards and markers over giant blobs.
Protobuf
Serialization
Streaming
Checkpoints
Read article
12
April 18, 2026•7 min read•David Gornshtein
C++ Data Versioning and Schema: How to Keep Training Rows Stable While the Corpus Evolves
Why schema discipline, canonical fallback values, and explicit versioning matter more than format churn when a C/C++ training corpus gains structure-aware metadata.
The schema- and versioning-side document once the enriched rows have to stay compatible across pipeline passes.
C++
Data
Schema
Versioning
Read article
13
April 18, 2026•7 min read•David Gornshtein
Dataset Versions v2 to v6: The Long-Form Ablation History
A detailed walk through every schema generation of the C++ training corpus - what each version added, the schema diff, the storage cost, the val_bpb delta we attribute to each step, what we deprecated and why.
A useful explanation of how dataset version drift shows up in practice instead of only in high-level release notes.
Data
Dataset
Ablation
Schema
Read article
14
April 18, 2026•6 min read•David Gornshtein
License Hygiene and Provenance for a C++ Training Corpus
How MegaCpp describes source provenance, revision pinning, SPDX metadata, and refusal-list rules for a public C/C++ corpus narrative without overstating legal certainty.
The provenance companion piece for why the data lane keeps explicit records around source, license, and recoverability.
Corpus
License
Provenance
SPDX
Read article

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.

C++ Data Pipelines and Corpus Packaging

Corpus and Preparation

Building the C++ Training Data Pipeline: What Worked, What Broke

Building a C/C++ corpus for training: what we keep, what we throw away, and why

The C/C++ Data Preparation Pipeline, End to End

Data Enhancements: Why structure IDs, AST features, and the Clang graph earn their cost

Code Deduplication at Scale: MinHash, LSH, and What a 142-Repo C++ Catalog Actually Looks Like

Semantic Enrichment and Packaging

Compile Commands and Semantic Graphs: Why C++ Training Needs Real Build Context

The Clang semantic indexer: translation units, call graphs, and the perf wall

Converting parquet token shards into Megatron indexed datasets

Megatron bin/idx pipeline from parquet token shards

Packed rows as the real training contract

Durability, Versions, and Provenance

Protobuf, the 2 GB Wall, and Why MegaCpp Prefers Shards Over Giant Messages

C++ Data Versioning and Schema: How to Keep Training Rows Stable While the Corpus Evolves

Dataset Versions v2 to v6: The Long-Form Ablation History

License Hygiene and Provenance for a C++ Training Corpus

Adjacent topic hubs

GB10 and Blackwell Bring-Up

Modal Training and Benchmark Operations

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

MLA Integration, Dispatch, and Weight Absorption

Evaluation, Benchmarks, and Verifier Loops

Megatron Parallelism and Layout Boundaries

TPU Sparse Attention and Pallas Kernels

H200 Training and Kernel Bring-Up

TPU v6e and XLA Runtime Surfaces

MoE, Routing, and Distributed Model Splits