MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202610 min readDavid Gornshtein

Determinism

Reproducibility

Training Infra

Testing

Determinism and bit-exact runs: what we guard and where we accept drift

A grounded account of GPU and TPU determinism on our stack: the fast path we run in production, the bitwise path we keep for regression testing, and the tests that fire when silent nondeterminism creeps in.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Determinism and bit-exact runs: what we guard and where we accept drift

Published April 18, 2026•10 min read•David Gornshtein

Full numerical determinism on a modern accelerator is a consultant's promise, not an engineering target. We do not claim it. What we do claim, and what we have tests for, is a narrower and more useful property: for any code path we consider stable, turning a new feature flag off gets us byte-identical outputs to the version of the code before that feature existed. We call that "bit-exact default path" and it is the contract that lets us ship changes at the pace we do without losing the ability to bisect a training loss regression.

This post is about where we land on determinism in practice: what is bitwise, what is "deterministic up to reduction order", what we have just accepted as non-deterministic, and which tests catch the drift.

Three Tiers of "The Same"

We separate three notions that people regularly conflate.

Bit-exact — the two runs produce the exact same tensors, byte for byte.
Numerically deterministic — the same run, run again, with the same inputs and the same seed, produces the same output. Two different runs are not promised to match.
Statistically equivalent — loss and eval metrics land within a small tolerance; specific tensors do not match.

Tier 1 is what we guard for the default code path and for anything downstream of a checkpoint torch.load. Tier 2 is what we guard for kernel-level changes that are "the same algorithm, different implementation". Tier 3 is what we settle for on GPU training end-to-end, because nothing else survives contact with cuDNN, NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 all-reduce reduction order, and Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backward.

The Fast Path Versus the Bitwise Path

The simplest statement of our stance appears in the shared runtime bootstrap:

torch.manual_seed(42)
if device_type == "CUDA":
    torch.cuda.manual_seed(42)
# skipping full reproducibility for now, possibly investigate slowdown later
# torch.use_deterministic_algorithms(True)

That commented-out line is the honest summary. We seed everything that has a global RNG, we do not call torch.use_deterministic_algorithms(True) for production training runs, and we do not force CUBLAS_WORKSPACE_CONFIG. We tried. The slowdown on the FA3 / FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample backward path was not tolerable at our scale, and the runs that require bitwise behavior are small, targeted, and already run in a separate configuration.

The production fast path therefore:

Uses torch.set_float32_matmul_precision("high") on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, which lets cuBLAS pick TF32 tensor-core algorithms whose reduction order can differ across launches.
Runs NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 all-reduces in whatever order the collective schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention picks, which for bf16 averages is not commutative to the last bit.
Uses Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backward kernels that internally accumulate in fp32 but do so with block orderings that depend on occupancy and stream scheduling.
Accepts that two identical-seed runs on the same GPU will match loss to roughly five or six decimal places for a few hundred steps and diverge from there.

The bitwise path is a different configuration we keep paved for regression tests and for the cases where drift would hide a real bug. It is not a mode a user flips at runtime; it is a specific set of feature flags set to their pre-feature values plus a few environment knobs. In practice this means a known baseline configuration whose outputs can be compared byte-for-byte when a new feature needs isolation.

The Default-Path Invariant

The rule we enforce on every merge: with all new flags off, the training forward pass is byte-identical to the pre-change baseline.

Concretely, for the Nemotron Nano 3 iteration:

reset_position_ids_at_doc_boundary=False degrades the new position-ID reset logic to a no-op, so the tensor passed into attention is the same tensor the old code passed in.
The per-block FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper autocast context factory defaults to None. When it is None, _fp8_factory(i) degrades to contextlib.nullcontext(), which is a zero-overhead pass-through. The forward loop then compiles to the exact same bytecode as the pre-FP8-scope version.
The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack TE permute fast path is gated on MEGACPP_MOE_TE_PERMUTE=1. Unset, the code flow enters the index_select branch it always used, and the argsort-inverse unpermute path on the combine side is bit-exact to the pre-change combine.
grad_reduce_in_fp32 is opt-in. With the flag off, the reducer keeps its bf16 flat grad buffers and its bf16 reduce_scatter_tensor.

This pattern is deliberate and it is tested. The validation sweep we run on a default-path PR is:

Torch save a reference forward-pass output tensor from main.
Apply the PR, run the same forward pass with all new flags off.
torch.equal on the two. Not allclose. Equal.

This is slow to run in full, so we run it against a pinned tiny model (depth=4, n_embd=128, vocab trimmed) on CPU in CI. The model is small enough that the full forward takes milliseconds; the guarantee is the same. Our test_mhc_group_fp8_ctx family of tests is one example: seven tests that pin the "no factory installed means nullcontext, means bit-exact fallthrough" property across the three autograd call-site branches we care about.

What We Can and Cannot Make Deterministic

On GPU, the honest breakdown is roughly:

FP32 matmul forward in non-TF32 mode: deterministic within a run, and bit-exact across identical runs if you pin CUBLAS workspace and disable the TF32 path. In production we do neither, because the cost is real.
FP32 matmul backward: same as forward in principle, but cuBLAS can pick a different algorithm for the grad input than for the grad weight path, and the two paths can share a workspace. Deterministic within a run.
BF16 matmul: deterministic within a run; differences across runs come mostly from reduction order, which is bounded but nonzero.
Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns forward: deterministic by construction for the softmax pass. The numerics are the same kernel every time; small block size and num_warps changes (Triton autotune) can change the result because reduction order within a block changes. We pin autotune choices for any test that wants bit-exact behavior.
Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backward: not deterministic in the general case without the explicit "deterministic backward" variant. The CuTe-backed FA4 lane we use in production does not offer a deterministic backward at the block sizes we want. We accept it.
cuDNN dSwiGLU atomicAdd backward: we hit this one head-on; it is the classic atomic-accumulate non-determinism. The workaround and the vendor escalation are documented in a associated validation notes.
NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 collectives: tree reductions with ring order fixed at init are deterministic for a given topology. The moment the topology changes — different node counts, different NIC ordering — the order changes and bf16 bits drift.
Dropout, any stochastic sampling: deterministic iff seeded, and iff the seed sees the same RNG consumption history. Our data pipeline consumes RNG in a fixed order per rank.

On TPU / XLA:

XLA compiled programs are deterministic for a given HLO. Different HLO from a recompile means different numerics; see our graph-recompilation experience for why this matters.
SPMD collective ordering is pinned once the meshQuick term guidemeshThe named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense.GroundingAbout: XLA SPMD sharding annotations Example: 3D parallelism sample Reference: FSDP2 on XLA TPU is built, so the bf16 reduction-order issue is better than on NCCL.
torch_xla's SPMDSavePlanner / SPMDLoadPlanner are part of the exact-restart contract; our checkpoint round-trip test uses torch.equal and not torch.allclose.
Mamba SSM scans have an xla_scan code path for XLA determinism; we use it when we care about bitwise reproduction of a Mamba layer on TPU, and the more general Triton scan in other cases.

The routing-statistics determinism one deserves its own sentence: gradient checkpointing around an MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack block can route tokens to different experts on the recompute pass than it did on the forward pass, because routing reads auxiliary state. Our code comments this explicitly (GPT block guidance, "EBlockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample: MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch is deterministic with static shapes") and keeps recompute off the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch by default; turning it on is a conscious decision to accept non-deterministic routing decisions for the sake of memory.

Tests That Guard Against Silent Non-Determinism

The interesting tests are not the "this function is deterministic" unit tests. The interesting tests are the ones that pin an invariant we otherwise would have lost.

test_perturbation_determinism and test_perturbation_restore_exact in the RandOpt suite: the first asserts that the same (seed, sigma) pair produces the same perturbation tensor; the second asserts that perturb then restore yields bitwise-identical weights to the original, including for frozen LoRA params. These two tests gate every RandOpt change.
test_mhc_group_fp8_ctx: seven tests that prove each branch of the per-block FP8 context path is nullcontext-equivalent when the factory is unset. This is the specific shape of "default path bit-exact" enforcement we described above.
test_checkpoint_manager::test_resume_weights_exact_match: uses torch.equal, not allclose, on resume. Anything that quietly changes .pt serialization numerics fails here.
test_doc_relative_sinks, test_exact_token_dsa_packed_doc_isolation, and the Mamba document-boundary isolation tests: all of them compare a packed-doc batch to the equivalent single-doc batch and require exact equality on the payload rows. Any cross-document numerical leak shows up immediately.
fail_closed_decode invariants on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 decode runs: the validation sweep explicitly verifies determinism under decode (same inputs to the same model produce the same tokens), KV cache consistency, and finite logits. This is Tier 2 determinism and we run it on every serving candidate.
FA3 backward parity tests, with an explicit disclaimer in the report: identical seed plus deterministic data ordering are the preconditions, and "numerical bitwise identity" is flagged as NOT expected for bf16 kernels because their rounding differs. This is exactly the Tier 3 line we live on for attention backward.

The pattern we try to enforce on every new determinism-adjacent test is: pick the tier you are asserting, put it in the name of the test, and use torch.equal for Tier 1 / 2 and a tight bounded tolerance for Tier 3. allclose with default tolerances hides real bugs.

Things That Bit Us

A short gallery of silent non-determinism that slipped through before we added the corresponding test.

FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper autocast factory that returned a decorated generator context manager instead of a plain nullcontext. The isinstance check in the group helper matched on the wrong branch, so the "default path" occasionally took the FP8 path, which produced different numerics than the baseline. Caught by pinning "factory unset returns nullcontext exactly".
Triton fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries kernel that assumed (1, T, 1, hd) cos/sin but silently tolerated (B, T, 1, hd) by reading the wrong stride. It was deterministic within a run, which is why it did not trip determinism tests; it was deterministically wrong. Defense added in the form of a cos.shape[0] == 1 guard plus a fallback to the safe apply_rotary_emb, plus a shape contract test.
Dataloader set_to_none=True causing different gradient-creation graphs on rank 0 vs the others, which caused XLA to recompile to a different HLO, which changed reduction order and therefore bit-exact behavior on resumed runs. We now pin grad accumulation to a single torch_xla.compile() call around all microsteps and log the number of compilations per step as a determinism canary.
MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack capacity factor with dynamic shapes: tokens-per-expert varied slightly across steps, which broke Tier 2 determinism. We moved to Expert Choice with a fixed capacity so per-step shapes are constant; the comment chain in the main MoE runtime module around the deterministic padded grouped GEMM documents the fix.

What We Gave Up, Knowingly

Cross-node bit-exactness on bf16 end-to-end training. Reduction orders change with topology; we do not try to force them.
Cross-run bit-exactness on the default FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample backward path. We use FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample because it is fast; we take the bf16 rounding differences.
Bit-exact MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing under gradient checkpointing. Memory wins.
Bit-exactness of long-horizon eval generation across GPU generations. Our eval uses greedy decoding (temperature 0) on purpose — that makes eval Tier 2 deterministic per device, which is enough to compare checkpoints of the same architecture. It does not make a T4 match an H200 bit for bit; we do not need it to.

The Rule of Thumb

The stance that has held up for us is: bit-exact the default path, numerically deterministic within a run for everything we ship, and explicit about every place we accept less. The test suite exists to make that stance cheap to enforce; the bitwise path exists so that when a loss regression lands, we can reach for a known-good baseline and bisect against it instead of arguing about whether a 1e-4 delta is "real". Everything else — full cross-run bitwise determinism, fully deterministic cuDNN on GPU — we have looked at, priced out, and decided the cost is not worth it for our workload. That is a trade-off, not a virtue.

What we promise vs what we do not

Property	Status	Backed by
same code path, same seed, same hardware family -> same first-N-step loss	promised	public distributed-CUDA sample tests
bitwise weight equality across runs	not promised	numerical drift in BF16 reductions
deterministic cuDNN	off in production	costs more than it pays
MoE token order under EP	deterministic per rank	dispatcher contract
randopt perturbation reproducibility	promised under explicit seed	public randopt sample tests
FA4 backward parity vs reference	guarded	a dedicated FA4 backward parity validation

FAQ

Frequently asked questions

Why is a plain .pt checkpoint on XLA not enough for a Tier 1 restart?+

Because XLA tensors are serialized differently from ordinary CPU and CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. storages, and exact restart needs the shard and meshQuick term guidemeshThe named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense. contract as well as the host-visible values. For the exact restart lane we therefore treat SPMDSavePlanner and SPMDLoadPlanner as part of the contract rather than as optional convenience wrappers; the longer version lives in Checkpoint format and resume.

Why is torch.use_deterministic_algorithms(True) not enough to make a CUDA run bit-exact?+

Because PyTorch treats that flag as only one part of reproducibility, not the whole contract. It switches known ops to deterministic implementations where available and errors on some others, but PyTorch's own docs say that this alone is not always enough to make an application reproducible. On CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes., cuDNN benchmarking is a separate choice, so picking the same convolution algorithm across runs still requires its own setting, and some cuBLAS paths also depend on a pinned CUBLAS_WORKSPACE_CONFIG. That is why our exact-replay lane is a full configuration and not a single toggle.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

mesh

The named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense.

Grounding

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Grounding

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

Grounding

eblock

The expert / MoE block family in MegaCpp's A/M/E/R notation.

Grounding

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Grounding

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

Grounding

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

Grounding

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

Grounding

scheduler

The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

David Gornshtein • MegaCppMore posts →

Determinism and bit-exact runs: what we guard and where we accept drift

Three Tiers of "The Same"

The Fast Path Versus the Bitwise Path

The Default-Path Invariant

What We Can and Cannot Make Deterministic

Tests That Guard Against Silent Non-Determinism

Things That Bit Us

What We Gave Up, Knowingly

The Rule of Thumb

What we promise vs what we do not

Read next

References

Frequently asked questions

Terms used in this article