MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 12 min readDavid Gornshtein
Data
Reproducibility
Dataloader
Training

Data Shuffling and Seed Discipline

Deterministic shuffles, seed plumbing across rank and stage, the reshuffle-per-epoch rule, packed-sequence ordering effects on loss curves, and the reproducibility bar we actually hold.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Data Shuffling and Seed Discipline
Published 12 min readDavid Gornshtein

Reproducibility is one of those words that sounds like a property of the code and is actually a property of the whole trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 setup. In MegaCpp we use a specific, finite bar: two runs of the same config, on the same dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most, with the same seed, on the same hardware family, should produce the same loss curve for the first few thousand steps. Beyond that, numerical drift in bf16 reductions makes strict bitwise equality pointless. The stricter policy note is Determinism and bit-exact runs: what we guard and where we accept drift; this post is the dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-order part of that same contract.

The bar we do hold covers everything under our control: dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most order, document packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles, batch composition, FIM splits, initial weights. This post is about what it took to get there.

Where seeds actually get set

the shared randomness utilities sets the global seeds during device init:

torch.manual_seed(42)
if device_type == "CUDA":
    torch.cuda.manual_seed(42)

That comment in the file is the one that matters: "we set the global seeds here, but most of the code uses explicit rng objects." The global seed is a fallback, not the primary contract. If you relied on the global state, a single torch.randn call anywhere upstream of the place you cared about would shift your run.

the main trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 entrypoint adds an optional user-provided override:

if getattr(args, "seed", None) is not None:
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(args.seed)

Note the order. We set all four. If any one is skipped, reproducibility leaks through that channel. numpy in particular is easy to forget; several utility modules call into numpy RNGs for shuffles and masks, and a missing np.random.seed had us chasing a spurious difference between two "identical" runs for an afternoon.

For Megatron-style tensor parallelQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding init, init_megatron_parallel_state takes an explicit seed argument: _megatron_seed = getattr(args, "seed", None) or 42. Same seed on every rank, by design; the init is deterministic function of rank plus seed, so every chip computes its own shard identically.

Explicit RNG objects over global state

The rule we follow in the code: any randomness that matters takes an RNG as an argument, or seeds its own local generator.

Examples in the tree:

  • the adapter-merge path takes seed: int | None; when non-None, it creates a torch.Generator and calls gen.manual_seed(seed).
  • the compact-activation path takes a seed and seeds its own generator per call site. Its reverse pass re-seeds the same generator from ctx.seed, which is why the compressed activation can be reconstructed without storing the projection matrix.
  • the KV-quantization path's orthogonal matrix helper seeds a local generator with the passed-in seed, defaulting to 42.
  • The sampling path in the main model runtime module uses a passed-in seed (default 42) for rng.manual_seed(seed) inside generate(...).
  • the best-of-n sampling path takes a base_seed and uses base_seed + i for sample i. Not torch.manual_seed; a per-call generator.
  • the fill-in-the-middle packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles path takes an optional random.Random instance. When the caller wants reproducibility, they pass one in.
  • the GSPO trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 path uses seed=42 + i inside its loop, for the same reason as best-of-n.
  • the compact-activation path's seeded layer computes seed = self.seed_offset + self._step * 31337. The layer index contributes via the offset (count * 7919), so layer 0 and layer 1 do not share a seed.

We keep seeds plumbed through as arguments so that nothing further upstream can poison them. It is ugly. It works.

The same rule applies to loader-owned torch.Generator state. Saving a generator only helps if the sampler or loader actually consumes that isolated generator, and if worker children are reseeded from that same logical stream instead of inheriting ambient process state. Otherwise one stray random draw before the loader starts is enough to offset the rest of the run.

The multi-source dataloader and its RNG

The streaming dataloader over multiple parquet source families uses a single random.Random(42) to pick which source family a given batch comes from. That RNG lives on the main process.

rng = random.Random(42)
if resume_state_dict is not None and "multi_source_rng_state" in resume_state_dict:
    rng.setstate(resume_state_dict["multi_source_rng_state"])

Two design choices are worth calling out:

  1. The RNG state is serialized into the checkpoint on save and restored on resume. If you resume at step N, the source selection at step N+1 is the one you would have seen in a fresh run of identical config; we have verified this on restart smokes.
  2. The RNG is a random.Random instance, not random module global state. Nothing else can touch it.

For the packed-row loader, the epoch counter itself is part of the resume state:

resume_epoch = (
    resume_state_dict.get("epoch", 1) if resume_state_dict is not None else 1
)

It starts at 1, increments when all shards of the parquet lineage have been consumed, and is written back into the state we persist for resume. The epoch field is part of the loader state tuple, next to the shard index and the per-shard offset.

Shuffles happen where it is cheap and safe

The parquet write stage is where we shuffle at scale. In the JSONL-to-Parquet ingestion stage:

  1. Read JSONL lines with readline().
  2. Accumulate documents in batches of 50000.
  3. When a batch is full, shuffle it and write a parquet shard immediately.
  4. Wait on idle, write a final validation shard when the writer is done.

Shuffling at this stage is cheap because the dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most is already in memory, and it is safe because it happens exactly once per shard, before anyone reads it. The shuffle inside create_fim_example's corpus prep uses random.Random(42).shuffle(texts) for the same reason: local, seeded, one-shot.

Why not a streaming shuffle buffer at load time? Two reasons. The packed-row loader cannot reorder across shard boundaries without breaking the chronology collator that needs file_local_commit_index to be ascending within a packed batch (_validate_packed_row_commit_window_batch in the dataloader implementation raises if it is not). And shard-level entropy is already enough; across thousands of shards, per-shard shuffle plus multi-source RNG produces batch composition that does not benefit from another buffer pass.

The reshuffle-per-epoch rule

The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-pipeline design doc spells out the multi-epoch strategy. With roughly 80B unique tokens and 300B target tokens, we cycle:

  • Epoch 1-2: full corpus, standard order.
  • Epoch 3-4: re-shuffled, different FIM splits.
  • Never repeat the exact same FIM split of the same file.

The re-shuffle on epoch rollover is implemented at the shard-listing layer. When the loader wraps around len(parquet_paths), it rescans the directory (new shards may have landed), and the loader bumps the epoch counter. The epoch-bump is the signal to the upstream prep that FIM splits should be regenerated; the regeneration is seeded off the epoch number, so epoch 3 and epoch 4 get distinct splits and epoch 3 restarted from checkpoint gets the same splits as the original epoch 3.

The reason to not repeat the exact same FIM split is straightforward: cross-entropy on a token position you have already seen in the same framing is closer to memorization than generalization. Re-framing the same document with a different FIM split forces the model to solve a different task at that position.

We have not shipped aggressive dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most augmentation beyond FIM; the scope of "re-shuffled, different FIM splits" is the envelope.

The safest implementation detail is to derive the split from stable document identity plus epoch rather than from whatever point the process-global RNG had reached when the loader touched the sample. That keeps worker count, prefetch timing, and later reshuffle windows from silently changing which infill framing a document receives.

The same discipline has to hold for token-aligned side dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most. If the FIM split is reproducible but the metadata remap is not, resume can replay the same text while still shifting chunk boundaries or structure labels underneath it. The checked-in masking pipeline excerpt and chunk boundary remap sample are the public-safe proof surfaces for that rule: offset rewrites, sentinel placement, and dropping cross-split chunks all need to be a pure function of the same split.

Packed-sequence ordering effects on loss curves

PackingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles is where ordering subtly matters. We have three policies: packed_rows, best_fit, and single_doc. The loader tracks utilization telemetry (optional, via --packing_telemetry) including valid tokens / total slots, docs_per_row, avg_doc_tokens, and cropped_doc_frac.

The effect on the loss curve:

  • single_doc gives the cleanest curve but the worst utilization. Padding dominates on short documents.
  • best_fit produces the best utilization but reorders documents within the packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles window. That reordering is deterministic (bisect into a sorted list keyed by order key) given a fixed input document stream, so it does not break reproducibility, but it does change which documents land next to which. We have seen best_fit vs single_doc produce visibly different first-epoch loss traces on the same seed, with best_fit running ~0.03 bpb lower in the early phase. The safer interpretation is not "padding improved and nothing else changed." Sequence composition changed too.
  • packed_rows reads pre-packed shards from parquet. The ordering inside each row is fixed at write time. This is the fastest path and the one we use for the main trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane.

One subtle interaction with the chronology collator: for temporal commit-window collation, packed rowsQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles in a batch are sorted by file_local_commit_index so commit chains appear in temporal order within the batch. That sort is stable and deterministic. We validated it explicitly because the collator rejects any batch where sorted(indices) != indices, so a bug in ordering would crash loudly rather than silently shift the loss.

We also had one instance where a telemetry counter change, adding utilization tracking, briefly seemed to change loss. It did not; the utilization tracking is plain Python with zero device overhead. The apparent shift was a different run picking up a different shard set because completion sentinels were not yet in place. That bug fixed, reproducibility was back.

Resume determinism

A real reproducibility bar includes resume. If resume-from-step-N is not deterministic, long trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 runs cannot be compared to their restart-free counterparts after an inevitable interruption.

What we persist per step for the loader:

  • Shard index (pq_idx).
  • Row/byte offset inside the shard.
  • Epoch counter.
  • multi_source_rng_state for the multi-source selector.
  • Per-source state tuples.

What we persist for the model:

What we do not persist, and have chosen not to:

  • Per-rank dropout RNG state. Dropout in bf16 on TPU already introduces noise we do not try to make bitwise-exact across resume; the cost of persisting and restoring is not worth it. We accept that resume-from-N produces a loss curve that tracks but is not bitwise identical beyond the first few steps.
  • Python random module global state beyond the multi-source RNG. Anything that needs reproducibility uses an explicit RNG, so the global is irrelevant.

A second boundary matters on elastic restarts: logical row order cannot depend on the current GPU count. Placement may change on resume, but the next approved row and source decision cannot. If replay state is tied to physical ranks instead of logical position, the restart has become a different dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most run.

There are two sane ways to make that true. Either persist the exact RNG and loader cursor state and replay from there, or make the loader algebraic: store a logical sample position, epoch, shard identity, and shuffle seed, then recompute the same row from those invariants. The second path is harder to keep pure, because every augmentation has to be a function of stable document identity and epoch, but it avoids turning checkpoint size into a hidden tax on every resume.

On resume smokes we check three things: the next (shard, offset, doc) triple matches what a fresh run would produce at the same step; the first few losses are within bf16 noise of the original; and the multi-source RNG has re-emitted the same source for the next batch.

The reproducibility bar

Stated explicitly:

  1. DataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most order is deterministic given the same seed and the same shard set. Tested by comparing the first 1000 (shard, doc_id) pairs across two runs.
  2. Document packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles is deterministic given the same input document stream. Tested by hashing the packed-row token sequences for the first N rows across two runs.
  3. FIM splits are deterministic given the same seed and epoch number. Tested by comparing FIM markers in the first N splits.
  4. Weight init is deterministic given the same seed (torch.manual_seed, np.random.seed, random.seed, torch.cuda.manual_seed_all all set). Tested by comparing parameter hashes immediately after init.
  5. The first ~100 trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 losses match to within bf16 noise across two runs with identical config. We eyeball this on every new architecture change; a visible divergence in step 1 is always a bug somewhere.
  6. Resume-from-step-N matches the RNG state and dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most position of a fresh run at step N. Tested on every major dataloader change.

What we do not promise:

  • Bitwise identical loss beyond the warmup window. bf16 reductions, XLA graph partitioning across recompiles, and nondeterministic collectives at scale make this an expensive illusion.
  • Cross-hardware reproducibility. A v6e run and an accelerator-backed run are not expected to match bitwise; they share tokenizer, shard set, and seed, and that is enough to reason about.
  • Reproducibility under torch.use_deterministic_algorithms(True). The line is commented out in the shared randomness utilities with the note "skipping full reproducibility for now, possibly investigate slowdown later." The slowdown is real, and we have not needed the extra guarantee.

What we threw away

  • The idea that a single torch.manual_seed up front would pin all randomness. It will not, and pretending it does only hides the first place it leaks.
  • Any attempt at a streaming shuffle buffer on top of the packed-row loader. Breaks chronology, adds latency, has never been worth it.
  • A "canonical" epoch ordering preserved across epochs. Re-shuffle per epoch is the rule; fighting it to preserve order across epochs makes the loss curve worse.
  • The habit of logging seeds only at startup. They are now logged into the per-run metadata snapshot alongside the model config so that a bisect on a loss regression can start by confirming the seed did not change.

Reproducibility at this level is a maintenance cost. It is also what lets a single failed check at step 100 turn into "which commit between these two introduced the divergence" rather than "something is wrong somewhere in the last month."

Seed-to-source map

Source of randomness Where it is set Per-rank?
torch.manual_seed (global) the shared randomness utilities device init yes (rank-offset)
dataloader RNG explicit numpy.random.Generator yes
FIM split per-doc seeded random.Random doc-deterministic
Best-of-N sampling explicit RNG passed in yes
GSPO group sampling explicit RNG passed in yes
init weights torch.Generator per module rank-shared
FAQ

Frequently asked questions

Is saving one loader generator enough for exact resume?+
No. It is only enough if the sampler, loader, and worker-init path all derive from that same logical RNG stream. A saved generator plus one stray global draw is still a different batch order.
Why mention worker_init_fn separately if the loader already has a saved generator?+
Because the main-process generator and the worker processes do different jobs. The saved loader generator can still define batch order, but multi-process workers may also touch Python or NumPy randomness while reading or augmenting samples. If those workers are not re-seeded from the same logical stream, resume can keep the same sample order while still changing FIM boundaries, masks, or other per-sample randomness inside the worker path.
Do deterministic FIM epochs require separate shard copies?+
No. The safer pattern is to keep the canonical tokenized record stable and derive the FIM split from document identity plus epoch. The checkpoint then needs the logical position and epoch, not a second physical copy of every reshuffled span.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

scheduler

The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.

Data pipeline

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.