MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 5 min readDavid Gornshtein
Data
Megatron
Parquet
Dataset
Migration

Converting parquet token shards into Megatron indexed datasets

Why MegaCpp keeps a narrow data bridge from tokenized parquet shards to Megatron indexed datasets instead of tying data preparation to one runtime import surface.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Converting parquet token shards into Megatron indexed datasets
Published 5 min readDavid Gornshtein

The interesting dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most bridge is not tokenization by itself. It is the handoff between a tokenized columnar corpus and the indexed dataset format the training runtime actually expects.

MegaCpp keeps that bridge explicit. The converter and thinner format wrapper exist so the dataset contract stays readable even when the full MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split import surface is not available in the environment doing the conversion. That same contract discipline is what keeps C++ data versioning and schema readable across tokenizer and shard migrations.

For first touch, a token shard here is a ParquetQuick term guideParquetA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most shard that already contains token IDs and the row-level metadata the formatter needs; an indexed dataset is the .bin/.idx pair the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split runtime memory-maps at training time. The bridge is narrow on purpose: it translates one stable row contract into another without re-deciding tokenization, provenance, or split policy. The token-ID side of that contract is upstream in Inside the MegaCpp C++ tokenizer and Tokenizer evolution for C++ code; conversion should move those IDs into the runtime container, not reinterpret them. That bridge is the storage-side counterpart to Packed rows as the real training contract: the conversion step matters because the runtime eventually consumes a fixed indexed-dataset shape.

If you want the higher-level loader view of the same handoff, SLM data explains why ParquetQuick term guideParquetA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most shards are only an intermediate producer surface and why the model-facing contract starts later.

Why this bridge deserves its own public example

It solves a very specific operational problem. DataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most preparation often happens on machines or in environments that are not identical to the final training lane. If the conversion step is too tightly coupled to one training runtime import surface, the pipeline becomes harder to port and harder to validate.

The public examples keep the bridge narrow instead:

That is enough to describe the contract without pretending the whole training tree has to be present on the dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep machine.

The examples are deliberately layered. The bridge sample says the output is MMapIndexedDataset-style bin/idx and that a fallback writer is acceptable when runtime imports are missing. The wrapper sample says split policy and public-safe naming are explicit. The packed-row schema sample names the downstream required columns such as input_ids, target_ids, loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample, doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, valid_token_count, and num_docs. Together they define what conversion is allowed to preserve and what it is not allowed to invent.

At the byte level the bridge is simpler than the surrounding pipeline. The .bin side is just a contiguous token buffer with no extra framing, while the .idx side carries a 48-byte header plus the dim_offsets, data_offsets, sizes, and doc_idx arrays that tell MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split where sequences and documents begin. That is why header parity, data_offsets, and doc_idx checks matter more than any vague claim about a "converter": if those maps drift, the runtime will memory-map the wrong slices even though the files exist.

The "zero-copy" decoder needs the same precision. ParquetQuick term guideParquetA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most still has to be decoded and decompressed before this bridge can write anything useful. The win comes later, once Arrow has produced contiguous buffers and the converter can stream those token arrays into .bin and build the .idx maps without another object-heavy pass. That is also why the parity check should include the actual 48-byte .idx header and the offset arrays, not just row counts. The storage/runtime split is the same contract described in Megatron binidx pipeline and SLM data.

MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split's own runtime surface makes that strictness concrete. The indexed dataset is the low-level on-disk interface that higher-level GPT and blended dataset builders sit on top of, and the default bin reader is a memory-mapped reader. In practice that means this bridge is not just writing two plausible artifacts. It is writing the exact offset ledger later sample and shuffle indices will trust when they pull spans back out of .bin. If data_offsets or doc_idx drift, the failure shows up as wrong sampled token ranges rather than as a friendly formatting warning.

Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample's dataset docs make the layering explicit: IndexedDataset is the low-level .bin/.idx surface, while GPTDataset is where split indices, sample indices, and shuffle policy get built on top. That division is why this bridge should stop at byte-level fidelity instead of trying to smuggle training policy into the container format.

The bridge is also a practical escape hatch from Python-side index merging costs. Once shard counts get large, object-heavy .idx assembly can spend more time in serialization, list growth, and garbage collection than in actual formatting work. Keeping the conversion path close to Arrow buffers and a contiguous .idx writer is not just an implementation preference; it is what stops a format handoff from turning into a CPU-RAM bottleneck on the prep host.

The most conservative promotion check is a real read-back through the same indexed-dataset contract the trainer will use. Open the emitted pair, sample a few sequences by offset, and confirm that token spans and document resets match the source-shard receipts. Row counts alone will not catch an offset ledger that is monotone but wrong.

FAQ

Frequently asked questions

Why not do this conversion inside the training runtime?+
Because dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and… prep often runs on a different machine and only needs the dataset contract, not the full training import surface.
What should stay stable after conversion?+
Indexed-dataset boundaries, token width, and the row semantics expected by Packed rows as the real training contract. The formatter may change the container, but it should not silently change what input_ids or document boundaries mean.
What is the minimum verification before I trust a converted .bin/.idx pair?+
Header parity, offset parity, and document-boundary parity. The promoted dataset should prove that the index header and token width are sane, that every offset still lands on the expected byte boundary in the .bin, and that document boundaries survived the handoff. The extra cheap check is that the last dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and… offset agrees with the byte length implied by the token dtype, so a truncated .bin or off-by-one sequence ledger fails before the trainer samples from it.
Does the converter still get to choose uint16 versus uint32?+
Not on the current tokenizer lane. The production tokenizer is 131,072 entries, so token IDs no longer fit in uint16; pretokenized shards and emitted indexed datasets have to stay uint32, with older uint16 binaries limited to older archives. That is also why the promotion gate should parse the emitted index, check token range against vocab size, and do a small readback through the final .bin/.idx pair instead of treating file presence alone as success.
What is the practical difference between the bridge and the wrapper?+
The bridge is the storage translator: tokenized ParquetQuick term guideParquetA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in… in, indexed dataset out. The wrapper is the dataset-policy surface: which split becomes train or validation, how outputs are named, and how a public-safe dataset family is described.
Where do train, validation, and test splits actually live?+
Not in the .bin/.idx payload. This bridge only emits the low-level indexed dataset and its offset tables; split ratios, sample indices, and shuffle policy get built later by MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…'s higher-level dataset layer. That is the same storage-versus-loader boundary described in Megatron binidx pipeline and SLM data.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

Data pipeline

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

Parquet

A grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

loss_mask

The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.

Topic hubs