MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 10 min readDavid Gornshtein
Determinism
Reproducibility
Training Infra
Testing

Determinism and bit-exact runs: what we guard and where we accept drift

A grounded account of GPU and TPU determinism on our stack: the fast path we run in production, the bitwise path we keep for regression testing, and the tests that fire when silent nondeterminism creeps in.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Determinism and bit-exact runs: what we guard and where we accept drift
Published 10 min readDavid Gornshtein

Full numerical determinism on a modern accelerator is a consultant's promise, not an engineering target. We do not claim it. What we do claim, and what we have tests for, is a narrower and more useful property: for any code path we consider stable, turning a new feature flag off gets us byte-identical outputs to the version of the code before that feature existed. We call that "bit-exact default path" and it is the contract that lets us ship changes at the pace we do without losing the ability to bisect a training loss regression.

This post is about where we land on determinism in practice: what is bitwise, what is "deterministic up to reduction order", what we have just accepted as non-deterministic, and which tests catch the drift.

Three Tiers of "The Same"

We separate three notions that people regularly conflate.

  1. Bit-exact — the two runs produce the exact same tensors, byte for byte.
  2. Numerically deterministic — the same run, run again, with the same inputs and the same seed, produces the same output. Two different runs are not promised to match.
  3. Statistically equivalent — loss and eval metrics land within a small tolerance; specific tensors do not match.

Tier 1 is what we guard for the default code path and for anything downstream of a checkpoint torch.load. Tier 2 is what we guard for kernel-level changes that are "the same algorithm, different implementation". Tier 3 is what we settle for on GPU training end-to-end, because nothing else survives contact with cuDNN, NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 all-reduce reduction order, and Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backward.

The Fast Path Versus the Bitwise Path

The simplest statement of our stance appears in the shared runtime bootstrap:

torch.manual_seed(42)
if device_type == "CUDA":
    torch.cuda.manual_seed(42)
# skipping full reproducibility for now, possibly investigate slowdown later
# torch.use_deterministic_algorithms(True)

That commented-out line is the honest summary. We seed everything that has a global RNG, we do not call torch.use_deterministic_algorithms(True) for production training runs, and we do not force CUBLAS_WORKSPACE_CONFIG. We tried. The slowdown on the FA3 / FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample backward path was not tolerable at our scale, and the runs that require bitwise behavior are small, targeted, and already run in a separate configuration.

The production fast path therefore:

The bitwise path is a different configuration we keep paved for regression tests and for the cases where drift would hide a real bug. It is not a mode a user flips at runtime; it is a specific set of feature flags set to their pre-feature values plus a few environment knobs. In practice this means a known baseline configuration whose outputs can be compared byte-for-byte when a new feature needs isolation.

The Default-Path Invariant

The rule we enforce on every merge: with all new flags off, the training forward pass is byte-identical to the pre-change baseline.

Concretely, for the Nemotron Nano 3 iteration:

  • reset_position_ids_at_doc_boundary=False degrades the new position-ID reset logic to a no-op, so the tensor passed into attention is the same tensor the old code passed in.
  • The per-block FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper autocast context factory defaults to None. When it is None, _fp8_factory(i) degrades to contextlib.nullcontext(), which is a zero-overhead pass-through. The forward loop then compiles to the exact same bytecode as the pre-FP8-scope version.
  • The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack TE permute fast path is gated on MEGACPP_MOE_TE_PERMUTE=1. Unset, the code flow enters the index_select branch it always used, and the argsort-inverse unpermute path on the combine side is bit-exact to the pre-change combine.
  • grad_reduce_in_fp32 is opt-in. With the flag off, the reducer keeps its bf16 flat grad buffers and its bf16 reduce_scatter_tensor.

This pattern is deliberate and it is tested. The validation sweep we run on a default-path PR is:

  1. Torch save a reference forward-pass output tensor from main.
  2. Apply the PR, run the same forward pass with all new flags off.
  3. torch.equal on the two. Not allclose. Equal.

This is slow to run in full, so we run it against a pinned tiny model (depth=4, n_embd=128, vocab trimmed) on CPU in CI. The model is small enough that the full forward takes milliseconds; the guarantee is the same. Our test_mhc_group_fp8_ctx family of tests is one example: seven tests that pin the "no factory installed means nullcontext, means bit-exact fallthrough" property across the three autograd call-site branches we care about.

What We Can and Cannot Make Deterministic

On GPU, the honest breakdown is roughly:

  • FP32 matmul forward in non-TF32 mode: deterministic within a run, and bit-exact across identical runs if you pin CUBLAS workspace and disable the TF32 path. In production we do neither, because the cost is real.
  • FP32 matmul backward: same as forward in principle, but cuBLAS can pick a different algorithm for the grad input than for the grad weight path, and the two paths can share a workspace. Deterministic within a run.
  • BF16 matmul: deterministic within a run; differences across runs come mostly from reduction order, which is bounded but nonzero.
  • Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns forward: deterministic by construction for the softmax pass. The numerics are the same kernel every time; small block size and num_warps changes (Triton autotune) can change the result because reduction order within a block changes. We pin autotune choices for any test that wants bit-exact behavior.
  • Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backward: not deterministic in the general case without the explicit "deterministic backward" variant. The CuTe-backed FA4 lane we use in production does not offer a deterministic backward at the block sizes we want. We accept it.
  • cuDNN dSwiGLU atomicAdd backward: we hit this one head-on; it is the classic atomic-accumulate non-determinism. The workaround and the vendor escalation are documented in a associated validation notes.
  • NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 collectives: tree reductions with ring order fixed at init are deterministic for a given topology. The moment the topology changes — different node counts, different NIC ordering — the order changes and bf16 bits drift.
  • Dropout, any stochastic sampling: deterministic iff seeded, and iff the seed sees the same RNG consumption history. Our data pipeline consumes RNG in a fixed order per rank.

On TPU / XLA:

  • XLA compiled programs are deterministic for a given HLO. Different HLO from a recompile means different numerics; see our graph-recompilation experience for why this matters.
  • SPMD collective ordering is pinned once the meshQuick term guidemeshThe named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense.GroundingAbout: XLA SPMD sharding annotations Example: 3D parallelism sample Reference: FSDP2 on XLA TPU is built, so the bf16 reduction-order issue is better than on NCCL.
  • torch_xla's SPMDSavePlanner / SPMDLoadPlanner are part of the exact-restart contract; our checkpoint round-trip test uses torch.equal and not torch.allclose.
  • Mamba SSM scans have an xla_scan code path for XLA determinism; we use it when we care about bitwise reproduction of a Mamba layer on TPU, and the more general Triton scan in other cases.

The routing-statistics determinism one deserves its own sentence: gradient checkpointing around an MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack block can route tokens to different experts on the recompute pass than it did on the forward pass, because routing reads auxiliary state. Our code comments this explicitly (GPT block guidance, "EBlockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample: MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch is deterministic with static shapes") and keeps recompute off the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch by default; turning it on is a conscious decision to accept non-deterministic routing decisions for the sake of memory.

Tests That Guard Against Silent Non-Determinism

The interesting tests are not the "this function is deterministic" unit tests. The interesting tests are the ones that pin an invariant we otherwise would have lost.

  • test_perturbation_determinism and test_perturbation_restore_exact in the RandOpt suite: the first asserts that the same (seed, sigma) pair produces the same perturbation tensor; the second asserts that perturb then restore yields bitwise-identical weights to the original, including for frozen LoRA params. These two tests gate every RandOpt change.
  • test_mhc_group_fp8_ctx: seven tests that prove each branch of the per-block FP8 context path is nullcontext-equivalent when the factory is unset. This is the specific shape of "default path bit-exact" enforcement we described above.
  • test_checkpoint_manager::test_resume_weights_exact_match: uses torch.equal, not allclose, on resume. Anything that quietly changes .pt serialization numerics fails here.
  • test_doc_relative_sinks, test_exact_token_dsa_packed_doc_isolation, and the Mamba document-boundary isolation tests: all of them compare a packed-doc batch to the equivalent single-doc batch and require exact equality on the payload rows. Any cross-document numerical leak shows up immediately.
  • fail_closed_decode invariants on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 decode runs: the validation sweep explicitly verifies determinism under decode (same inputs to the same model produce the same tokens), KV cache consistency, and finite logits. This is Tier 2 determinism and we run it on every serving candidate.
  • FA3 backward parity tests, with an explicit disclaimer in the report: identical seed plus deterministic data ordering are the preconditions, and "numerical bitwise identity" is flagged as NOT expected for bf16 kernels because their rounding differs. This is exactly the Tier 3 line we live on for attention backward.

The pattern we try to enforce on every new determinism-adjacent test is: pick the tier you are asserting, put it in the name of the test, and use torch.equal for Tier 1 / 2 and a tight bounded tolerance for Tier 3. allclose with default tolerances hides real bugs.

Things That Bit Us

A short gallery of silent non-determinism that slipped through before we added the corresponding test.

  • FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper autocast factory that returned a decorated generator context manager instead of a plain nullcontext. The isinstance check in the group helper matched on the wrong branch, so the "default path" occasionally took the FP8 path, which produced different numerics than the baseline. Caught by pinning "factory unset returns nullcontext exactly".
  • Triton fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries kernel that assumed (1, T, 1, hd) cos/sin but silently tolerated (B, T, 1, hd) by reading the wrong stride. It was deterministic within a run, which is why it did not trip determinism tests; it was deterministically wrong. Defense added in the form of a cos.shape[0] == 1 guard plus a fallback to the safe apply_rotary_emb, plus a shape contract test.
  • Dataloader set_to_none=True causing different gradient-creation graphs on rank 0 vs the others, which caused XLA to recompile to a different HLO, which changed reduction order and therefore bit-exact behavior on resumed runs. We now pin grad accumulation to a single torch_xla.compile() call around all microsteps and log the number of compilations per step as a determinism canary.
  • MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack capacity factor with dynamic shapes: tokens-per-expert varied slightly across steps, which broke Tier 2 determinism. We moved to Expert Choice with a fixed capacity so per-step shapes are constant; the comment chain in the main MoE runtime module around the deterministic padded grouped GEMM documents the fix.

What We Gave Up, Knowingly

The Rule of Thumb

The stance that has held up for us is: bit-exact the default path, numerically deterministic within a run for everything we ship, and explicit about every place we accept less. The test suite exists to make that stance cheap to enforce; the bitwise path exists so that when a loss regression lands, we can reach for a known-good baseline and bisect against it instead of arguing about whether a 1e-4 delta is "real". Everything else — full cross-run bitwise determinism, fully deterministic cuDNN on GPU — we have looked at, priced out, and decided the cost is not worth it for our workload. That is a trade-off, not a virtue.

What we promise vs what we do not

Property Status Backed by
same code path, same seed, same hardware family -> same first-N-step loss promised public distributed-CUDA sample tests
bitwise weight equality across runs not promised numerical drift in BF16 reductions
deterministic cuDNN off in production costs more than it pays
MoE token order under EP deterministic per rank dispatcher contract
randopt perturbation reproducibility promised under explicit seed public randopt sample tests
FA4 backward parity vs reference guarded a dedicated FA4 backward parity validation
FAQ

Frequently asked questions

Why is a plain .pt checkpoint on XLA not enough for a Tier 1 restart?+
Because XLA tensors are serialized differently from ordinary CPU and CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. storages, and exact restart needs the shard and meshQuick term guidemeshThe named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense. contract as well as the host-visible values. For the exact restart lane we therefore treat SPMDSavePlanner and SPMDLoadPlanner as part of the contract rather than as optional convenience wrappers; the longer version lives in Checkpoint format and resume.
Why is torch.use_deterministic_algorithms(True) not enough to make a CUDA run bit-exact?+
Because PyTorch treats that flag as only one part of reproducibility, not the whole contract. It switches known ops to deterministic implementations where available and errors on some others, but PyTorch's own docs say that this alone is not always enough to make an application reproducible. On CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes., cuDNN benchmarking is a separate choice, so picking the same convolution algorithm across runs still requires its own setting, and some cuBLAS paths also depend on a pinned CUBLAS_WORKSPACE_CONFIG. That is why our exact-replay lane is a full configuration and not a single toggle.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

mesh

The named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

eblock

The expert / MoE block family in MegaCpp's A/M/E/R notation.

EP

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

scheduler

The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.