Converting parquet token shards into Megatron indexed datasets
Why MegaCpp keeps a narrow data bridge from tokenized parquet shards to Megatron indexed datasets instead of tying data preparation to one runtime import surface.

The interesting dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most bridge is not tokenization by itself. It is the handoff between a tokenized columnar corpus and the indexed dataset format the training runtime actually expects.
MegaCpp keeps that bridge explicit. The converter and thinner format wrapper exist so the dataset contract stays readable even when the full MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split import surface is not available in the environment doing the conversion. That same contract discipline is what keeps C++ data versioning and schema readable across tokenizer and shard migrations.
For first touch, a token shard here is a ParquetQuick term guideParquetA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most shard that already
contains token IDs and the row-level metadata the formatter needs; an
indexed dataset is the .bin/.idx pair the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split runtime memory-maps
at training time. The bridge is narrow on purpose: it translates one stable row
contract into another without re-deciding tokenization, provenance, or split
policy. The token-ID side of that contract is upstream in Inside the MegaCpp
C++ tokenizer and Tokenizer evolution for C++
code; conversion should move those IDs into the runtime
container, not reinterpret them. That bridge is the storage-side counterpart to
Packed rows as the real training contract:
the conversion step matters because the runtime eventually consumes a fixed
indexed-dataset shape.
If you want the higher-level loader view of the same handoff, SLM data explains why ParquetQuick term guideParquetA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most shards are only an intermediate producer surface and why the model-facing contract starts later.
Why this bridge deserves its own public example
It solves a very specific operational problem. DataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most preparation often happens on machines or in environments that are not identical to the final training lane. If the conversion step is too tightly coupled to one training runtime import surface, the pipeline becomes harder to port and harder to validate.
The public examples keep the bridge narrow instead:
- Parquet to Megatron indexed dataset sample for the input/output bridge
- Prepare-format MegaCpp sample for naming and split formatting
- Packed rows schema sample for the row fields the formatter is allowed to trust
- Token-level enriched parquet materialization example for the upstream token-and-metadata shape before format conversion
That is enough to describe the contract without pretending the whole training tree has to be present on the dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-prep machine.
The examples are deliberately layered. The bridge sample says the output is
MMapIndexedDataset-style bin/idx and that a fallback writer is acceptable
when runtime imports are missing. The wrapper sample says split policy and
public-safe naming are explicit. The packed-row schema sample names the
downstream required columns such as input_ids, target_ids, loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample,
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample, valid_token_count, and num_docs. Together they define what
conversion is allowed to preserve and what it is not allowed to invent.
At the byte level the bridge is simpler than the surrounding pipeline. The
.bin side is just a contiguous token buffer with no extra framing, while the
.idx side carries a 48-byte header plus the dim_offsets, data_offsets,
sizes, and doc_idx arrays that tell MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split where sequences and documents
begin. That is why header parity, data_offsets, and doc_idx checks matter
more than any vague claim about a "converter": if those maps drift, the runtime
will memory-map the wrong slices even though the files exist.
The "zero-copy" decoder needs the same precision. ParquetQuick term guideParquetA grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…GroundingSLM data: what the pipeline optimizes for and why the loader contract matters most still has to be
decoded and decompressed before this bridge can write anything useful. The win
comes later, once Arrow has produced contiguous buffers and the converter can
stream those token arrays into .bin and build the .idx maps without
another object-heavy pass. That is also why the parity check should include the
actual 48-byte .idx header and the offset arrays, not just row counts. The
storage/runtime split is the same contract described in Megatron binidx pipeline
and SLM data.
MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split's own runtime surface makes that strictness concrete. The indexed
dataset is the low-level on-disk interface that higher-level GPT and blended
dataset builders sit on top of, and the default bin reader is a memory-mapped
reader. In practice that means this bridge is not just writing two plausible
artifacts. It is writing the exact offset ledger later sample and shuffle
indices will trust when they pull spans back out of .bin. If data_offsets
or doc_idx drift, the failure shows up as wrong sampled token ranges rather
than as a friendly formatting warning.
Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample's dataset docs make the layering explicit: IndexedDataset is
the low-level .bin/.idx surface, while GPTDataset is where split indices,
sample indices, and shuffle policy get built on top. That division is why this
bridge should stop at byte-level fidelity instead of trying to smuggle training
policy into the container format.
The bridge is also a practical escape hatch from Python-side index merging
costs. Once shard counts get large, object-heavy .idx assembly can spend more
time in serialization, list growth, and garbage collection than in actual
formatting work. Keeping the conversion path close to Arrow buffers and a
contiguous .idx writer is not just an implementation preference; it is what
stops a format handoff from turning into a CPU-RAM bottleneck on the prep host.
The most conservative promotion check is a real read-back through the same indexed-dataset contract the trainer will use. Open the emitted pair, sample a few sequences by offset, and confirm that token spans and document resets match the source-shard receipts. Row counts alone will not catch an offset ledger that is monotone but wrong.
Frequently asked questions
Why not do this conversion inside the training runtime?+
What should stay stable after conversion?+
input_ids or document boundaries mean.What is the minimum verification before I trust a converted .bin/.idx pair?+
.bin, and that document boundaries survived the handoff. The extra cheap check is that the last dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and… offset agrees with the byte length implied by the token dtype, so a truncated .bin or off-by-one sequence ledger fails before the trainer samples from it.Does the converter still get to choose uint16 versus uint32?+
uint16; pretokenized shards and emitted indexed datasets have to stay uint32, with older uint16 binaries limited to older archives. That is also why the promotion gate should parse the emitted index, check token range against vocab size, and do a small readback through the final .bin/.idx pair instead of treating file presence alone as success.What is the practical difference between the bridge and the wrapper?+
Where do train, validation, and test splits actually live?+
.bin/.idx payload. This bridge only emits the low-level indexed dataset and its offset tables; split ratios, sample indices, and shuffle policy get built later by MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…'s higher-level dataset layer. That is the same storage-versus-loader boundary described in Megatron binidx pipeline and SLM data.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.
An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…
Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…
A grounded walkthrough of the MegaCpp data path: parquet shards, split logic, packed rows, metadata columns, and the interface choices documented in…
The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.
The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.