MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 7 min readDavid Gornshtein
C++
Data
Schema
Versioning
Dataset
Training

C++ Data Versioning and Schema: How to Keep Training Rows Stable While the Corpus Evolves

Why schema discipline, canonical fallback values, and explicit versioning matter more than format churn when a C/C++ training corpus gains structure-aware metadata.

MegaCpp
Focused on applied C++ model engineering
Article Preview
C++ Data Versioning and Schema: How to Keep Training Rows Stable While the Corpus Evolves
Published 7 min readDavid Gornshtein

As soon as a C/C++ corpus carries more than plain text, schema versioning becomes part of model quality work. The public MegaCpp notes already describe the ingredients that make this true: pinned inputsQuick term guideInput pinningThe rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.GroundingAbout: license and corpus provenance Example: reference corpus pins, explicit columnar artifacts, build-aware metadata, structure-aware exports, and promotion gates based on schema and consumer checks. Once those pieces exist, the hard question is no longer "what file format should we use?" It is "how do we keep rows readable and semantically stable while the corpus evolves?"

For first touch, the terms here are narrower than they sound. Semantic stability means a field keeps the same meaning, type, and fallback rule across datasetQuick term guideDataset versionsWhat changed between v2, v3, v4, v5, and v6 of the C++ training corpus, why each step happened, why we kept the older formats backwards-compatible,…GroundingAbout: v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All History: Dataset Versions v2 to v6: The Long-Form Ablation History revisions. Field families are the separate groups of row-core, provenance, build-aware, and structure-aware columns. Fallback semantics are the typed defaults or null rules a consumer applies when an optional field is missing. A schema version is the explicit datasetQuick term guideDataset versionsWhat changed between v2, v3, v4, v5, and v6 of the C++ training corpus, why each step happened, why we kept the older formats backwards-compatible,…GroundingAbout: v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All History: Dataset Versions v2 to v6: The Long-Form Ablation History-level marker that says which contract those meanings belong to, while a producer revision is a newer emitter that can still write the same consumer-visible schema. The quickest checked-in proof surfaces are Packed rows schema sample, Loader enriched columns sample, and Parquet to Megatron indexed dataset sample.

The important boundary is semantic stability

Format churn is easy to overstate. Parquet, Arrow-style tables, and JSONL sidecars can all work if the meaning of each exported field stays stable. What breaks consumers is not usually the container; it is field drift.

The public notes support a straightforward principle: write explicit columnar artifacts, keep schema version as first-class metadata, and require round-trip plus consumer smoke checks before promotion. That principle matters more than any particular serialization choice because it keeps versioning tied to meaning.

One practical refinement from the research lane is that cold storage and hot-consumer formats do not need to be identical. Parquet is a good additive storage surface, while a loader-facing Arrow IPC or similarly explicit in-memory hand-off can be the better choice once the schema is fixed and the consumer needs predictable typed access instead of repeated schema merging on every batch.

Why C/C++ corpora drift faster than plain text corpora

A structure-aware C/C++ corpus has more moving parts than a plain text datasetQuick term guideDataset versionsWhat changed between v2, v3, v4, v5, and v6 of the C++ training corpus, why each step happened, why we kept the older formats backwards-compatible,…GroundingAbout: v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All History: Dataset Versions v2 to v6: The Long-Form Ablation History.

  • lexical content changes with source revisions
  • build-aware metadata changes with toolchain flags and generated include roots
  • structure-aware fields change when parsers, chunkers, or relation extractors evolve
  • provenance fields change when the pinning ledger changes

That is why the public notes insist on pinned inputsQuick term guideInput pinningThe rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.GroundingAbout: license and corpus provenance Example: reference corpus pins and explicit schema checks. Without that discipline, two snapshots can look similar at the file level while differing materially in model-facing fields.

Canonical field families help control drift

The easiest way to lose control of schema evolution is to mix unlike things in the same field family. The public notes point the other way: keep build-aware metadata separate from plain lexical chunks, and treat structure-aware metadata as its own export surface.

In practice that usually means keeping at least these families distinct:

Field family Examples Why separation helps
row-core fields text chunk, source id, revision stable minimum contract
provenance fields license metadata, retrieval date, schema version makes the row auditable
build-aware fields compile command, include roots, language mode preserves parser context
structure-aware fields structure ids, chunk boundaries, graph relations isolates higher-churn semantic features

Once those families are separated, additive schema evolution becomes much easier. A new relation field can be introduced without changing the meaning of row-core text fields. A new provenance field can be added without forcing model code to reinterpret build metadata. Packed rows schema sample and Reference corpus pins are the compact checked-in surfaces for that split.

Versioning should be explicit and boring

The reference pinning note lists schema version as part of minimal metadata per input. That is the right habit. Schema versioning should be explicit, monotonically understandable, and close to the artifact itself.

A useful rule set is simple:

  • adding an optional field is a schema change
  • changing the meaning of an existing field is a breaking change
  • reusing an old field name for a new concept is worse than adding a new field
  • consumers should read one canonical representation, not a grab bag of historical variants

This sounds obvious, but many pipelines fail exactly here. They treat backward compatibility as "accept whatever old rows contain," then push parsing ambiguity into model-facing code.

Fallback values should be typed, not improvised

The public notes do not enumerate every fallback table, but they do imply the correct design rule: schema checks and consumer smoke tests happen before promotion, which means missing fields and optional metadata must have a deterministic interpretation.

That interpretation should be typed. Some fields can use zero as a canonical fill. Others need a sentinel that is not also a valid value. Provenance fields may need explicit nulls. Relation fields may need empty lists rather than absent columns. The important part is not the literal token chosen as the fill. The important part is that consumers do not need to invent the rule on the fly.

This is also where explicit declared schemas matter. Repeated fields are the place where dynamic inference usually fails first: if the first shards contain only nulls or empty placeholders, a loader can lock the column into the wrong type before populated rows arrive. Declaring the repeated field up front and using typed empty lists where appropriate keeps the consumer from learning a different contract from whichever shard happened to land first.

That same seam shows up in Arrow-backed datasetQuick term guideDataset versionsWhat changed between v2, v3, v4, v5, and v6 of the C++ training corpus, why each step happened, why we kept the older formats backwards-compatible,…GroundingAbout: v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All History: Dataset Versions v2 to v6: The Long-Form Ablation History libraries. Declared schemas are not only about column names; they also keep sparse dict or list fields from being widened opportunistically as early shards arrive. Packed rows schema sample and Loader enriched columns sample are the local proof surfaces for that contract.

The same split should hold in storage. Heavy structure-aware payloads such as AST blobs, relation graphs, or provenance side dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most do not need to sit on the hot read path for every trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 batch. Keeping them in additive Parquet column groups or adjacent sidecars lets cold storage stay rich while the loader-facing contract materializes only the fields the batcher actually needs.

Parquet projection and Arrow IPC zero-copy also solve different problems. Projection says "do not read the cold columns"; Arrow IPC says "the typed arrays are already in an Arrow-friendly buffer, so expose them directly if the source supports it." Converting Parquet token shards into Megatron indexed datasets is the adjacent handoff where that distinction matters most.

Build metadata and structure metadata should evolve independently

The compile-commandQuick term guidecompile command databaseThe compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.GroundingAbout: compile commands and semantic graphs Example: compile commands context example Example: semantic indexing notes sample is a good reminder that build-aware dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most has its own lifecycle.

{
  "directory": "/workspace/build",
  "file": "src/parser.cpp",
  "arguments": [
    "clang++",
    "-std=c++20",
    "-Iinclude",
    "-DMEGACPP_EXAMPLE=1",
    "-c",
    "src/parser.cpp"
  ]
}

A change in compile flags is not the same thing as a change in chunk schema. A parser upgrade is not the same thing as a provenance-field addition. Keeping those surfaces separated makes it possible to reason about which part of the pipeline changed and which consumers need to care. Compile commands context example is the checked-in surface for that separation.

What a stable consumer contract should look like

A stable consumer contract for a corpus like this has three properties.

First, consumers read canonical field names with canonical meanings.

Second, older rows can still be loaded because missing fields have defined defaults or explicit null semantics.

Third, newer rows do not force older consumers to inspect raw producer variation. Additive fields should be ignorable when they are irrelevant.

That is the real goal of schema versioning: not just preserving bytes on disk, but keeping the model-facing interpretation narrow and predictable.

Promotion gates should test that contract at the real consumer seam too. For this corpus, that means not only row-level schema checks, but also a smoke read against the downstream indexed datasetQuick term guideDataset versionsWhat changed between v2, v3, v4, v5, and v6 of the C++ training corpus, why each step happened, why we kept the older formats backwards-compatible,…GroundingAbout: v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All History: Dataset Versions v2 to v6: The Long-Form Ablation History boundary so .idx headers, offsets, and dtype expectations still line up after a tokenizer or metadata change. A datasetQuick term guideDataset versionsWhat changed between v2, v3, v4, v5, and v6 of the C++ training corpus, why each step happened, why we kept the older formats backwards-compatible,…GroundingAbout: v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All History: Dataset Versions v2 to v6: The Long-Form Ablation History can look fine in Parquet and still be wrong for the actual trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 consumer.

Practical rules

  • Put schema version directly in the artifact metadata.
  • Separate row-core, provenance, build-aware, and structure-aware fields.
  • Prefer additive fields to overloaded meanings.
  • Define typed fallback behavior for every optional field family.
  • Run round-trip checks and at least one consumer smoke pass before promotion.
  • Keep model code on canonicalized rows rather than raw producer variants.

Stable trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most is not mainly about picking a fashionable format. It is about making every field explicit enough that a new snapshot can evolve without forcing every consumer to relearn the datasetQuick term guideDataset versionsWhat changed between v2, v3, v4, v5, and v6 of the C++ training corpus, why each step happened, why we kept the older formats backwards-compatible,…GroundingAbout: v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All History: Dataset Versions v2 to v6: The Long-Form Ablation History.

One practical read-path consequence is that Parquet and Arrow do different jobs. Parquet is the persistence envelope because additive nullable columns age well. The loader-facing boundary often wants Arrow IPC or another explicit typed hand-off so schema decoding does not become a per-batch CPU tax. That is why this article keeps storage format separate from consumer contract, and why Converting Parquet token shards into Megatron indexed datasets and Building the C++ Training Data Pipeline: What Worked, What Broke sit immediately next to it.

FAQ

Frequently asked questions

How is a schema version different from a producer revision?+
A schema version names a consumer-visible contract change. A producer revision is an implementation change that can still emit the same contract.
What do typed fallback semantics mean in practice?+
They mean the consumer never improvises when an optional field is absent. A row either gets a declared sentinel, an empty list, a canonical zero, or an explicit null depending on the field family.
Why should optional side metadata fall back to typed defaults instead of crashing the load?+
Because compatibility failures usually show up in optional side columns first. If an older shard is missing a relation field or one row carries malformed optional metadata, the loader should still be able to materialize the canonical row contract and warn rather than teaching model code a second schema. Loader enriched columns sample shows that fallback seam, and Packed rows schema sample shows the required row fields that still have to stay stable.
Why is Parquet projection pushdown not the same as Arrow zero-copy reads?+
Because they solve different parts of the read path. Parquet projection says "do not read the cold columns"; Arrow IPC zero-copy says "the typed arrays are already in an Arrow-friendly buffer, so expose them directly if the source supports it." The first is a storage-side skip, the second is an in-memory handoff property.
Why declare repeated fields before early shards actually populate them?+
Because null-heavy early shards can teach a dynamic loader the wrong type. Declared list types and typed empty arrays keep later structure-aware rows compatible with the original consumer contract. That matters even more before a later Converting Parquet token shards into Megatron indexed datasets handoff, where the downstream consumer expects one stable contract.
Why keep heavy AST or provenance payloads out of the hot read path?+
Because most trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… reads need row-core text plus a narrow typed context, not the full cold archive. Separate column groups or sidecars let the loader project only the hot contract while richer structure-aware metadata stays available for offline checks, rebuilds, or later promotion gates.
Why is schema validation alone not enough for promotion?+
Because a shard can satisfy the declared types and still fail the first real consumer. Promotion should prove both row-level validity and at least one downstream smoke read, especially across the Parquet-to-indexed-dataset handoff where header, offset, and dtype expectations still have to line up.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Input pinning

The rule that source corpora and shard manifests are pinned before materialization so training rows can be reproduced exactly.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

compile command database

The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.

Dataset versions

What changed between v2, v3, v4, v5, and v6 of the C++ training corpus, why each step happened, why we kept the older formats backwards-compatible,…

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Data pipeline

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…

Topic hubs