MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 5 min readDavid Gornshtein
Data
Megatron
Binidx
Parquet

Megatron bin/idx pipeline from parquet token shards

Why a parquet-to-binidx bridge matters, what contract it has to preserve, and why a thin formatting wrapper is worth keeping separate from the low-level converter.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Megatron bin/idx pipeline from parquet token shards
Published 5 min readDavid Gornshtein

The public examples for MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-ready dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most prep are intentionally split into two surfaces.

The first surface is the actual bridge from tokenized parquetQuick term guideParquetWhy MegaCpp keeps a narrow data bridge from tokenized parquet shards to Megatron indexed datasets instead of tying data preparation to one runtime…GroundingConverting parquet token shards into Megatron indexed datasets SLM data: what the pipeline optimizes for and why the loader contract matters most shards to MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-style .bin and .idx artifacts. The second surface is the thinner formatting wrapper that encodes naming, split, and output policy for a repeatable training dataset layout.

That split matters because the low-level converter and the dataset policy solve different problems.

What the bridge is really doing

The example parquet-to-indexed-dataset bridge sample keeps the core contract visible:

That last point is operationally important. DataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most conversion often has to happen on machines that do not mirror the full training environment exactly. If the bridge depends on one exact runtime layout, the data pipelineQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most becomes more fragile than it needs to be.

The paired wrapper example formatting-wrapper sample keeps a separate concern visible: naming, split policy, and public-safe dataset layout. That should not be buried inside the binary writer.

That split also keeps dependency bloat out of the conversion lane. A standalone writer that emits the same .bin/.idx pair is enough for prep machines and validation jobs that should not need the full training stack installed, while the wrapper stays focused on dataset naming and split policy. That makes the bridge easier to rerun in narrow environments without quietly changing the binary contract.

Why this is not just file conversion

MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-style indexed datasets are not generic export artifacts. They are part of the training contract. The .bin file stores the packed token stream, while the .idx sidecar carries sequence lengths, offsets, and document boundaries that the dataset reader needs for mmap-backed access and sample construction.

The research pack adds one helpful reader-level clarification: the .idx file is not just a convenience index. It is the loader's structural contract about where each sample starts, how long it is, and where document boundaries reset. If those arrays drift from the .bin payload, the failure is not "just a bad export"; it becomes a bad training dataset.

That is why a public sample is useful here. It lets the reader see that the bridge is preserving a concrete training interface, not just reshaping storage.

The same contract is what makes mmap-backed loading worthwhile. Once lengths, offsets, and document boundaries are explicit, a reader can jump directly to the right byte range without reparsing the corpus, while still knowing where packed context should stop or where a document boundary should reset attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries policy.

The research pack adds one more useful precision point here: the .idx file is doing two jobs at once. It is a low-level byte map through the .bin payload, but it is also the sampler's document-boundary ledger. That is why the sidecar needs more than raw offsets. Sequence lengths, byte pointers, and doc_idx belong together because the loader has to answer both "where does this sample start?" and "where must context stop behaving like one continuous document?" A converter that gets the bytes right but drifts on boundary bookkeeping can still produce a bad training set.

The same distinction is why the fallback writer matters operationally, not just for portability. The converter should be allowed to run in a narrow prep environment that only knows token IDs and the binary contract. Semantic policy still belongs one layer up. That separation keeps the wrapper in charge of split/layout decisions while the writer stays accountable for deterministic offset accumulation, dtype choice, and index integrity.

The binary side is stricter than "emit some offsets somewhere." The .idx sidecar starts with a compatibility header and dtype code, because the reader has to know both that the file is really a MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split mmap index and how many bytes each token occupies in the paired .bin payload. That is why dtype selection belongs to the converter rather than the wrapper: choosing uint16 when the token IDs fit is a storage-contract decision, not a dataset-policy decision.

In practice the reader-facing .idx contract is a small fixed header followed by four monotonic ledgers: dim_offsets, data_offsets, sizes, and doc_idx. Naming them matters because promotion bugs usually show up as one of those ledgers drifting, not as a mysterious loader failure.

Example -> article -> upstream docs

Why the wrapper deserves its own surface

The formatting wrapper looks small, but it prevents the converter from becoming an undocumented control plane.

It keeps a few policy choices explicit:

  • where train and validation splits come from
  • what output family is being produced
  • how dataset names stay stable and public-safe

Those are operational knobs, not binary-format details. Keeping them separate makes the pipeline easier to review and easier to port.

Scale makes that separation more important, not less. On very large shards, the sizes, byte-pointer, and doc_idx arrays can become a meaningful memory surface before the writer flushes them, so chunk sizing and flush policy belong to the converter. Train/validation splitting, EOD policy, and naming still belong to the wrapper because they change dataset semantics rather than the serialization mechanics.

One more scale decoder matters here. On multi-terabyte shards, those same sizes, pointer, and doc_idx ledgers can grow into multi-GB metadata surfaces before the writer ever flushes them, which is why "hold the whole index state in memory until the end" stops being a harmless implementation detail. The read-side twin shows up later in training: a mathematically pure global shuffle over an mmap-backed .bin can thrash the OS page cache badly enough to stall GPUs. The bridge still wants real randomization, but at scale it also wants a shuffle policy that stays local enough for storage and page cache to keep up. That is one reason this article stays paired with Converting parquet token shards into Megatron indexed datasets: the binary contract and the operational shuffle story are connected, not separate.

FAQ

Frequently asked questions

Why is the .idx file more than a convenience index?+
Because the loader depends on it as a structural contract. It needs lengths, byte offsets, and document-boundary resets to build samples correctly from the memory-mapped token stream. If those arrays drift from the .bin payload, the problem is not cosmetic; the training dataset itself is wrong.
Why preserve document boundaries in the index if the tokens are already packed?+
Because packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a… changes storage shape, not the need for boundary-aware sampling. The reader still has to know where attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. or sample construction should stop treating adjacent spans as one document. That is the same packed-row boundary discipline described in Packed rows as the real training contract.
Why keep the fallback writer separate from the formatting wrapper?+
Because they solve different failure modes. The wrapper owns dataset policy, while the writer owns binary correctness. Keeping them apart lets the converter run in constrained prep environments without dragging the whole training stack into the serialization lane.
Why does flush policy belong to the converter instead of the wrapper?+
Because on very large shards the sizes, offset, and doc_idx arrays can become a meaningful memory surface before they are flushed. That is binary-writing mechanics, so it belongs next to the writer rather than the naming and split wrapper described by formatting-wrapper sample.
Why keep pretraining on .bin/.idx but move some supervised lanes to packed parquet?+
Because the contracts are different. Pretraining mostly wants fast token-stream access plus trustworthy sample and document boundaries, which is exactly what .bin/.idx is good at. Supervised or role-masked lanes often need extra aligned arrays such as loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking. and explicit sequence starts on every row, which fit more naturally in packed parquetQuick term guideParquetWhy MegaCpp keeps a narrow data bridge from tokenized parquet shards to Megatron indexed datasets instead of tying data preparation to one runtime… without teaching the binary writer a second semantic job. Packed rows schema sample, Packed row builder example, and Document masking and curriculum are the quickest local surfaces for that split.
Is the .idx sidecar the same thing as Megatron's sample cache?+
No. The .idx sidecar is the on-disk dataset contract: header, dtype, sequence sizes, byte pointers, and document ranges. Loader construction can then build separate document, sample, and shuffle index arrays for deterministic sample lookup. Keep those layers separate: promotion checks validate the .bin/.idx payload, while startup tuning belongs to the training-time cache.
When should dataset mixtures stay virtual instead of being physically merged?+
When each shard family already has a valid .bin/.idx contract, keep those pairs independently verifiable and let the higher-level dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and… config express the mixture. MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…'s blended-dataset layer can choose among multiple MegatronDataset instances with dataset and sample index maps, so corpus weights do not have to be baked into one giant binary file. Use physical merging only when storage lifecycle or deployment constraints make it simpler; otherwise the safer public seam is the bridge in Converting parquet token shards into Megatron indexed datasets plus the promotion checks in C++ data versioning and schema. The practical win is reversibility: changing corpus weights can stay in the blend config and sample maps instead of forcing a fresh physical export and another full .idx promotion cycle.
What should a promotion check validate in a generated .idx file?+
At minimum: header compatibility, dtype code, monotonic offset arrays, agreement between sizes and the .bin payload, and a sane doc_idx ledger. If those checks also pass under a real read-back, the bridge is probably preserving the runtime contract instead of only emitting files.
Why can a mathematically pure global shuffle still hurt huge .bin/.idx runs?+
Because .idx gives true random access, but the storage path still runs through mmap and the OS page cache. On very large shards, a shuffle that jumps too broadly across the file can turn that strength into page-cache churn and GPU stalls. The practical goal is not "less randomness"; it is keeping shuffle locality explicit enough that storage can keep up, which is the same operational seam described in Converting parquet token shards into Megatron indexed datasets.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

loss_mask

The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

Parquet

Why MegaCpp keeps a narrow data bridge from tokenized parquet shards to Megatron indexed datasets instead of tying data preparation to one runtime…

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

Data pipeline

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…

Topic hubs