Megatron bin/idx pipeline from parquet token shards
Why a parquet-to-binidx bridge matters, what contract it has to preserve, and why a thin formatting wrapper is worth keeping separate from the low-level converter.

The public examples for MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-ready dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most prep are intentionally split into two surfaces.
The first surface is the actual bridge from tokenized parquetQuick term guideParquetWhy MegaCpp keeps a narrow data bridge from tokenized parquet shards to Megatron indexed datasets instead of tying data preparation to one runtime…GroundingConverting parquet token shards into Megatron indexed datasets SLM data: what the pipeline optimizes for and why the loader contract matters most shards to
MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-style .bin and .idx artifacts. The second surface is the thinner
formatting wrapper that encodes naming, split, and output policy for a repeatable
training dataset layout.
That split matters because the low-level converter and the dataset policy solve different problems.
What the bridge is really doing
The example parquet-to-indexed-dataset bridge sample keeps the core contract visible:
- input: tokenized parquetQuick term guideParquetWhy MegaCpp keeps a narrow data bridge from tokenized parquet shards to Megatron indexed datasets instead of tying data preparation to one runtime…GroundingConverting parquet token shards into Megatron indexed datasets SLM data: what the pipeline optimizes for and why the loader contract matters most shards
- output: MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-compatible
.bin/.idxdataset pair - fallback writer allowed when the full MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split import surface is unavailable
That last point is operationally important. DataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most conversion often has to happen on machines that do not mirror the full training environment exactly. If the bridge depends on one exact runtime layout, the data pipelineQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most becomes more fragile than it needs to be.
The paired wrapper example formatting-wrapper sample keeps a separate concern visible: naming, split policy, and public-safe dataset layout. That should not be buried inside the binary writer.
That split also keeps dependency bloat out of the conversion lane. A
standalone writer that emits the same .bin/.idx pair is enough for prep
machines and validation jobs that should not need the full training stack
installed, while the wrapper stays focused on dataset naming and split policy.
That makes the bridge easier to rerun in narrow environments without quietly
changing the binary contract.
Why this is not just file conversion
MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-style indexed datasets are not generic export artifacts. They are part
of the training contract. The .bin file stores the packed token stream, while
the .idx sidecar carries sequence lengths, offsets, and document boundaries
that the dataset reader needs for mmap-backed access and sample construction.
The research pack adds one helpful reader-level clarification: the .idx file
is not just a convenience index. It is the loader's structural contract about
where each sample starts, how long it is, and where document boundaries reset.
If those arrays drift from the .bin payload, the failure is not "just a bad
export"; it becomes a bad training dataset.
That is why a public sample is useful here. It lets the reader see that the bridge is preserving a concrete training interface, not just reshaping storage.
The same contract is what makes mmap-backed loading worthwhile. Once lengths, offsets, and document boundaries are explicit, a reader can jump directly to the right byte range without reparsing the corpus, while still knowing where packed context should stop or where a document boundary should reset attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries policy.
The research pack adds one more useful precision point here: the .idx file is
doing two jobs at once. It is a low-level byte map through the .bin payload,
but it is also the sampler's document-boundary ledger. That is why the sidecar
needs more than raw offsets. Sequence lengths, byte pointers, and doc_idx
belong together because the loader has to answer both "where does this sample
start?" and "where must context stop behaving like one continuous document?" A
converter that gets the bytes right but drifts on boundary bookkeeping can still
produce a bad training set.
The same distinction is why the fallback writer matters operationally, not just for portability. The converter should be allowed to run in a narrow prep environment that only knows token IDs and the binary contract. Semantic policy still belongs one layer up. That separation keeps the wrapper in charge of split/layout decisions while the writer stays accountable for deterministic offset accumulation, dtype choice, and index integrity.
The binary side is stricter than "emit some offsets somewhere." The .idx
sidecar starts with a compatibility header and dtype code, because the reader
has to know both that the file is really a MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split mmap index and how many
bytes each token occupies in the paired .bin payload. That is why dtype
selection belongs to the converter rather than the wrapper: choosing uint16
when the token IDs fit is a storage-contract decision, not a dataset-policy
decision.
In practice the reader-facing .idx contract is a small fixed header followed
by four monotonic ledgers: dim_offsets, data_offsets, sizes, and
doc_idx. Naming them matters because promotion bugs usually show up as one of
those ledgers drifting, not as a mysterious loader failure.
Example -> article -> upstream docs
- example: parquet-to-indexed-dataset bridge sample
- companion example: formatting-wrapper sample
- article: Converting parquet token shards into Megatron indexed datasets
- upstream docs: MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split indexed-dataset readers and the broader MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split/Nemotron recipe ecosystem
Why the wrapper deserves its own surface
The formatting wrapper looks small, but it prevents the converter from becoming an undocumented control plane.
It keeps a few policy choices explicit:
- where train and validation splits come from
- what output family is being produced
- how dataset names stay stable and public-safe
Those are operational knobs, not binary-format details. Keeping them separate makes the pipeline easier to review and easier to port.
Scale makes that separation more important, not less. On very large shards, the
sizes, byte-pointer, and doc_idx arrays can become a meaningful memory
surface before the writer flushes them, so chunk sizing and flush policy belong
to the converter. Train/validation splitting, EOD policy, and naming still
belong to the wrapper because they change dataset semantics rather than the
serialization mechanics.
One more scale decoder matters here. On multi-terabyte shards, those same
sizes, pointer, and doc_idx ledgers can grow into multi-GB
metadata surfaces before the writer ever flushes them, which is why "hold the
whole index state in memory until the end" stops being a harmless
implementation detail. The read-side twin shows up later in training: a
mathematically pure global shuffle over an mmap-backed .bin can thrash the
OS page cache badly enough to stall GPUs. The bridge still wants real
randomization, but at scale it also wants a shuffle policy that stays local
enough for storage and page cache to keep up. That is one reason this article
stays paired with
Converting parquet token shards into Megatron indexed datasets:
the binary contract and the operational shuffle story are connected, not
separate.
Frequently asked questions
Why is the .idx file more than a convenience index?+
.bin payload, the problem is not cosmetic; the training dataset itself is wrong.Why preserve document boundaries in the index if the tokens are already packed?+
Why keep the fallback writer separate from the formatting wrapper?+
Why does flush policy belong to the converter instead of the wrapper?+
sizes, offset, and doc_idx arrays can become a meaningful memory surface before they are flushed. That is binary-writing mechanics, so it belongs next to the writer rather than the naming and split wrapper described by formatting-wrapper sample.Why keep pretraining on .bin/.idx but move some supervised lanes to packed parquet?+
.bin/.idx is good at. Supervised or role-masked lanes often need extra aligned arrays such as loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking. and explicit sequence starts on every row, which fit more naturally in packed parquetQuick term guideParquetWhy MegaCpp keeps a narrow data bridge from tokenized parquet shards to Megatron indexed datasets instead of tying data preparation to one runtime… without teaching the binary writer a second semantic job. Packed rows schema sample, Packed row builder example, and Document masking and curriculum are the quickest local surfaces for that split.Is the .idx sidecar the same thing as Megatron's sample cache?+
.idx sidecar is the on-disk dataset contract: header, dtype, sequence sizes, byte pointers, and document ranges. Loader construction can then build separate document, sample, and shuffle index arrays for deterministic sample lookup. Keep those layers separate: promotion checks validate the .bin/.idx payload, while startup tuning belongs to the training-time cache.When should dataset mixtures stay virtual instead of being physically merged?+
.bin/.idx contract, keep those pairs independently verifiable and let the higher-level dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and… config express the mixture. MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…'s blended-dataset layer can choose among multiple MegatronDataset instances with dataset and sample index maps, so corpus weights do not have to be baked into one giant binary file. Use physical merging only when storage lifecycle or deployment constraints make it simpler; otherwise the safer public seam is the bridge in Converting parquet token shards into Megatron indexed datasets plus the promotion checks in C++ data versioning and schema. The practical win is reversibility: changing corpus weights can stay in the blend config and sample maps instead of forcing a fresh physical export and another full .idx promotion cycle.What should a promotion check validate in a generated .idx file?+
sizes and the .bin payload, and a sane doc_idx ledger. If those checks also pass under a real read-back, the bridge is probably preserving the runtime contract instead of only emitting files.Why can a mathematically pure global shuffle still hurt huge .bin/.idx runs?+
.idx gives true random access, but the storage path still runs through mmap and the OS page cache. On very large shards, a shuffle that jumps too broadly across the file can turn that strength into page-cache churn and GPU stalls. The practical goal is not "less randomness"; it is keeping shuffle locality explicit enough that storage can keep up, which is the same operational seam described in Converting parquet token shards into Megatron indexed datasets.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.
Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…
Why MegaCpp keeps a narrow data bridge from tokenized parquet shards to Megatron indexed datasets instead of tying data preparation to one runtime…
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…
An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…