Topic Hub

C++ Data Pipelines and Corpus Packaging

A curated archive for the C++ data path: corpus selection, semantic enrichment, packaging into training artifacts, and the file-level durability choices that keep the pipeline sane.

This cluster works best in sequence. Start with the broad pipeline and corpus articles, then move into the semantic graph and packaging layers that make the training rows useful.

data
C++
pipeline
dataset
tokenizer
Curated set
14
Articles in reading order
Why this hub

Best if you want to understand where the C++ training rows come from and why the pipeline is intentionally shard-heavy.

Corpus and Preparation

These define what goes into the corpus and how the base rows are formed.

  1. 01
    April 18, 20267 min readDavid Gornshtein

    Building the C++ Training Data Pipeline: What Worked, What Broke

    An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and the quality gates that catch our own mistakes.

    The broad retrospective: what worked, what broke, and how the pipeline settled.

    Data
    Pipeline
    C++
    Tokenizer
  2. 03
    April 18, 202612 min readDavid Gornshtein

    The C/C++ Data Preparation Pipeline, End to End

    Every stage of the MegaCpp data preparation pipeline: ingest, dedup, license filtering, document masking, tokenization, packed rows, and the checks that keep dataset snapshots trustworthy.

    The end-to-end article for the shaping and movement of training rows before packaging.

    Data
    Pipeline
    C++
    Operations
  3. 04
    April 18, 202610 min readDavid Gornshtein

    Data Enhancements: Why structure IDs, AST features, and the Clang graph earn their cost

    The structural metadata layered on top of raw C++ source: structure IDs, chunk boundaries, call edges, type edges, tree-sitter AST features, and the optional libclang semantic graph. What each one is for, what the ablations justified, and what we pay in storage and runtime.

    The shortest explanation of which enrichment layers were kept in the pipeline and why they survived the cost review.

    Data
    Enrichment
    Tree Sitter
    Clang

Semantic Enrichment and Packaging

These explain why the pipeline costs more than plain text ingestion and why that cost is accepted.

  1. 07
    April 18, 202612 min readDavid Gornshtein

    The Clang semantic indexer: translation units, call graphs, and the perf wall

    How the libclang-based semantic indexer feeds v6_enriched parquet: compilation-database handling, the per-file translation-unit graph, call and type edges, the failure modes we hit, and the wall-clock cost of ground-truth semantics.

    The most direct explanation of the semantic indexer, its perf wall, and what it buys the stack.

    Clang
    Data
    Indexer
    C++
  2. 08
    April 19, 20265 min readDavid Gornshtein

    Converting parquet token shards into Megatron indexed datasets

    Why MegaCpp keeps a narrow data bridge from tokenized parquet shards to Megatron indexed datasets instead of tying data preparation to one runtime import surface.

    The bridge from prepared training shards into the indexed dataset artifacts the training lane actually consumes.

    Data
    Megatron
    Parquet
    Dataset
  3. 09
    April 19, 20265 min readDavid Gornshtein

    Megatron bin/idx pipeline from parquet token shards

    Why a parquet-to-binidx bridge matters, what contract it has to preserve, and why a thin formatting wrapper is worth keeping separate from the low-level converter.

    How prepared shards are turned into the Megatron-friendly artifact format used by the training lane.

    Data
    Megatron
    Binidx
    Parquet
  4. 10
    April 19, 20267 min readDavid Gornshtein

    Packed rows as the real training contract

    Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a storage detail.

    The most direct description of why packed rows became the real training contract instead of a minor loader detail.

    Data
    Packing
    Long Context
    FIM

Durability, Versions, and Provenance

These complete the archive with the file-level contracts that keep the data lane reproducible.

  1. 11
    April 19, 20266 min readDavid Gornshtein

    Protobuf, the 2 GB Wall, and Why MegaCpp Prefers Shards Over Giant Messages

    Why large-message serialization becomes fragile near protobuf's practical limits, and how MegaCpp's checkpoint and data paths avoid single huge payloads by using sharded files, streaming conversion, and explicit completion markers.

    A good durability companion piece: why the pipeline keeps choosing shards and markers over giant blobs.

    Protobuf
    Serialization
    Streaming
    Checkpoints
  2. 13
    April 18, 20267 min readDavid Gornshtein

    Dataset Versions v2 to v6: The Long-Form Ablation History

    A detailed walk through every schema generation of the C++ training corpus - what each version added, the schema diff, the storage cost, the val_bpb delta we attribute to each step, what we deprecated and why.

    A useful explanation of how dataset version drift shows up in practice instead of only in high-level release notes.

    Data
    Dataset
    Ablation
    Schema
  3. 14
    April 18, 20266 min readDavid Gornshtein

    License Hygiene and Provenance for a C++ Training Corpus

    How MegaCpp describes source provenance, revision pinning, SPDX metadata, and refusal-list rules for a public C/C++ corpus narrative without overstating legal certainty.

    The provenance companion piece for why the data lane keeps explicit records around source, license, and recoverability.

    Corpus
    License
    Provenance
    SPDX

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.