MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202613 min readDavid Gornshtein

Training

Divergence

Loss Curves

Monitoring

Debugging

Loss Curves and the Divergence Playbook: How We Catch It at Epoch 0

The divergence playbook used on every training start: early-training spikes, NaN bisect, LR warmup shape, data-order suspects, and the monitors that catch it before step 100.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Loss Curves and the Divergence Playbook: How We Catch It at Epoch 0

Published April 18, 2026•13 min read•David Gornshtein

Most trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 runs do not fail at the model. They fail at step 1, or step 50, or step 24,850, and someone has to figure out why. Repeated bring-up work has produced enough specific failure modes that a fixed playbook is now used against the first hundred steps of every new configuration. This post is that playbook: what spikes look like, how we bisect a NaN to a single rank or kernel, what LR warmup shape we landed on, the data-order suspects we keep checking, and the monitors that catch each of these before they become an overnight loss.

Why this matters

An overnight run that diverges at hour six wastes more than the hour-six dollars. It wastes the next morning's investigation, invites a sloppy "we think it was the LR" postmortem, and quietly raises the bar for the next bug because nobody wants to reopen the same wound. A 100-step smoke that catches the problem before a long run starts is two orders of magnitude cheaper, and the playbook here exists to make that smoke load-bearing.

The other reason: a modern trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 stack can mix Muon plus AdamW, FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper islands, MoE routing, Mamba SSM kernels, and intra-document masking. Each subsystem has its own divergence signature, and trying to debugQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures OOM Debugging Playbook for H200 Training Runs them in aggregate is hopeless. The bisect order in this playbook reflects how often each subsystem has actually been the culprit in practice, not how interesting it is in the abstract.

The shortest companion set is Training speed anatomy on H200, Profiler and performance reports, and Training on H200 eight-GPU machines: the divergence signatures here only make sense if the lane, the receipt, and the operator surface stay explicit before you zoom into optimizer or routing details. In this post a receipt is the compact per-run record that preserves the lane definition, the step window, and the measured health checks in one schema-versioned artifact. The checked-in goodput tracker sample, compile/runtime receipt sample, and GPU profile receipt sample are the shortest local proof surfaces; Muon optimizer on NVIDIA, Precision recipe: FP16, BF16, FP8, NVFP4, and MoE routing we actually shipped come next once the lane itself is clear.

1. What divergence looks like in real run data

Three patterns dominate.

Immediate NaN at step 1 with a finite step 0

Almost always an optimizer or precision issue, not a model issue. The canonical case from a TPU v6e case study: step 0 loss 14.79 on a v6e-32 slice and 11.46 on v6e-16, then every subsequent step NaN. With --no_muon (AdamW only) the loss stayed finite (14.79 -> 14.70 -> 25.9 -> 33.6 -> 39.2). The bisect pointed at Muon's Polar Express Newton-Schulz iteration running in BF16. On CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 the TensorCores accumulate in FP32 internally; on XLA they do not, and the Polar Express coefficients are large enough (a = 8.15, b = -22.48, c = 15.88 on the first iteration) that five iterations in BF16 compound into a NaN. The fix is one line: detect PJRT_DEVICE=TPU and run the iteration in FP32. CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 stays BF16. Both TPU pods recovered. The optimizer side of that fix is easier to reason about next to Muon optimizer on NVIDIA, while the TPU runtime boundary questions usually sit next to FSDP2 on XLA TPU and libtpu, PJRT, JAX, and ownership boundaries.

Early gradient spike that does not NaN

H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 4-step receipts on a representative Hopper recipe consistently show a gnorm 1160 -> 3648 -> 7200 pattern across the first three steps under one topology; an alternate activation_memory_budget=1.0 topology showed 1040 -> 4800 -> 7392; a dsa_query_chunk_size=16 lane showed 1072 -> 3904 -> 6976. Same severe pattern under otherwise different knobs. The spikes do not break trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 (clip catches them, the EMA settles by step 50), but they are a real signal: the model is taking large early steps under the current init and LR schedule, and any change to LR shape or initialization should be evaluated against this baseline.

Late-training transient

The d24 hybrid 877M run took a transient loss spike around step 24,850 (loss jumped from ~0.8 to ~3.4) and fully recovered by step ~25,700 (back to ~0.85). The failure mode here is not the spike, which Muon + AdamW handled, but that the step 25K checkpoint was saved during it. Eval on that checkpoint dropped from 11.0% compile (step 20K) to 3.1% (step 25K). The lesson is operational, not algorithmic: do not save checkpoints during a gnorm excursion.

2. The NaN bisect

When step N produces a non-finite loss or gradient, we run a deterministic bisect. The order matters because it cuts the search space the fastest.

Disable Muon. If the run goes finite, the issue is in Muon or its interaction with the precision policy. This caught the TPU Polar Express NaN above.
Force BF16 across all paths (no FP16, no FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper). If the run goes finite, the issue is in a low-precision island. The most common case is FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper e4m3 saturating on the first backward of a freshly-initialized model; the optimizer's finite-check-and-skip prevents weight pollution and subsequent steps stabilize as the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper amax history fills out. Making the strict check_for_nan_in_grad=True guard opt-in lets a run pass through the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper warmup transient without masking the root cause. That transient should stay brief: if non-finites continue after the amax history has had a few steps to populate, the issue is no longer "FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper is still calibrating" but the precision recipe itself.
Bisect by rank. The pre-collective sanitization step nan_to_nums the flattened gradient before the reduce-scatter; we temporarily disable it (MEGACPP_SKIP_PRE_REDUCE_NAN_CHECK=1) and log per-rank is_finite(grad).all() before the collective. A single rank producing a NaN points at MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing overflow, a Mamba convolution edge case, or an experimental kernel firing on that rank's data.
Bisect by block. Disable EBlocks (--no_moe), then RBlocks, then MBlocks, in that order. The order reflects historical likelihood: MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack has been the most common source of late-trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 spikes; M2RNN is rarely the cause; Mamba kernels are occasional contributors at specific seq_lens.
Re-enable with the simplest config that reproduces. The reproducer is the asset; the fix is downstream.

The FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper branch in that bisect is not just "disable it and move on." The better steady-state fix is a short calibration warmup that seeds AMAX history in BF16 or FP32 before the ordinary delayed-scaling lane takes over. Skip-on-overflow is useful as a guardrail; if it is still doing real work after the calibration window, the scaling policy is wrong rather than merely uninitialized.

The same distinction matters on restart. An FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper lane that restores weights and optimizer slots but comes back with cold scale history will behave like a fresh calibration window rather than a continuation, which is why an exact replay on that path means "same batch family, same optimizer state, same FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper side-state" instead of only "same weights." The adjacent grounding is Checkpoint format and resume and Transformer Engine on H200 and Blackwell-class GPUs.

The bisect typically finishes in under an hour because every step is a four-step smoke (--save_every=0 --core_metric_every=0 --sample_every=0 --eval_every=0 --warmup_ratio=0.0), the same shape we use for H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 lane receipts and the same reporting discipline described in Profiler and performance reports: keep the schema version, effective lane, step window, bounded health checks, and any heavy trace links together instead of reconstructing them later from screenshots or chat.

3. LR warmup shape

We tried several warmup shapes; the production default is linear warmup followed by Megatron-style full-range cosine decay to a final_lr_frac floor. The implementation in the main trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 entrypoint supports linear, cosine, WSD, inverse-square-root, and a few exotic options (exponential, minus_sqrt); we ship cosine because it is what our reference Nemotron-class stack uses and because it produced the cleanest curves on our hybrid configs. The performance side of that tradeoff is covered in Throughput vs quality knobs, but the acceptance test here is simpler: the early curve should stay boring.

The warmup ratio matters more than the shape. We default to warmup_ratio=0.03 (so 3% of num_iterations is warmup); the formula is warmup_iters = round(warmup_ratio * num_iterations), the multiplier during warmup is (it + 1) / max(1, warmup_iters), and after warmup the multiplier is the cosine decay to final_lr_frac.

Shape	When we use it	Notes
linear-warmup + cosine	default training	cleanest curves on hybrids
linear-warmup + WSD	long runs with planned mid-rate plateau	works, slightly noisier final
linear only	smoke and 4-step receipts	warmup off, deterministic
inverse-sqrt	parked	does not beat cosine on our data

Initialization changes are warmup changes

An init change that is meant to be cosmetic — for example, zeroing the outgoing weights of "noop" experts — can act like a discontinuous schedule. In a 24-layer transformer with residual connections, zeroing outgoing weights creates a sudden void in the residual stream and spikes the loss by 0.01-0.05 with ~200 steps of recovery. We weaken to std * 0.1 instead, which preserves signal flow. Anything that discontinuously changes the residual stream is a divergence risk, and warmup will not save you from it.

4. Data-order suspects

Every "the loss spiked at step X" investigation eventually checks the data shard at step X. We log the current shard name every 100 steps specifically for this. The suspects we check, in order. The surrounding context usually lives in Data pipeline story and Doc masking and curriculum, so those two posts are part of the standard readback when the spike looks data-shaped:

A single bad shard. If the spike correlates with a particular parquet file across reruns, the file is the answer. Our pipeline shuffles within shards (random.Random(42).shuffle(texts) at ingest), so a bad document is bounded to one shard rather than smeared across many.
A shard-boundary effect. Our packed-doc pipeline can produce an unusually long contiguous run of similar-domain text at a shard boundary, behaving like a tiny domain shift. Shows up as a correlated gnorm and loss bump that resolves in <100 steps.
A curriculum transition. Some trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 corpora have explicit curriculum stages; transitions are obvious in the curve and are expected. We mark them in the run log so they are not mistaken for instability.
Token-distribution drift. We periodically compare the rolling token-frequency histogram to the pretraining baseline. A KL spike here usually points at a tokenizer edge case rather than a data issue.
Determinism check. Same seed, same shard order, same loss within numerical tolerance for the first ~20 steps. If determinism breaks, the issue is in the loader or the FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample sharding, not the data.

The shuffle-by-rank pattern matters too. We seed per-rank with (global_seed, rank, epoch) so ranks see different data within a microbatch boundary; this is the same convention as Megatron's data sampler. Getting this wrong is silent: the model trains, but it trains on the same tokens N times per step.

There is a second seam here that looks like "data quality" until you reread the loader contract: packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles policy. A flattened or padding-free collator can soften shard boundaries by interleaving documents across the seam, while a more literal shard-to-batch handoff can turn the same corpus transition into a sharper short-lived domain shock. That is why a shard-boundary spike gets reread beside Packed rows as the real training contract and Converting Parquet token shards into Megatron indexed datasets: sometimes the issue is not the shard alone, but how the loader exposed that shard boundary to the optimizer.

5. The monitors that catch it at epoch 0

We run four monitors on every step. They are cheap; we keep them on in production.

Grad-norm EMA detector

Lives in the main trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 entrypoint. Maintains _gnorm_ema with alpha = 0.01, compares each step's grad_norm against the OLD EMA before updating (so the spike does not inflate the EMA and clip the ratio at 1/alpha = 100), and prints [GRAD SPIKE WARNING] at >10x EMA and [GRAD SPIKE CRITICAL] at >100x. The ratio-cap detail matters: an earlier version updated the EMA first and then compared, which made CRITICAL effectively unreachable.

Non-finite optimizer guard

Detects non-finite gradients before sanitization. With --skip_nan_steps, the optimizer step is skipped entirely (Megatron pattern), _step_skipped = True is set on the wrapper, and the LR schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention does not advance. We log the skip; a steady drip of skipped steps is a sign that the precision policy is wrong, not that the data is bad.

Loss-curve 2x rule

Rolling 2x-of-average rule from the design doc: if the smoothed loss exceeds 2x the recent average, save a checkpoint and pause for investigation. The d24 hybrid spike at step 24,850 should have triggered this; it did not because the alert pre-dated the EMA implementation. We added it to the loop after that incident.

Throughput leading indicator

A sudden drop in tok/sec at step N often precedes a loss event by a few hundred steps; the cause is usually graph recompilation or a comm-pattern change, but occasionally it is the optimizer entering a region where per-step work changes. We use it as a leading indicator, not a diagnostic.

The reporting line is one we look at on every step, and it is the same compact receipt surface we try to preserve in Profiler and performance reports. If throughput is part of the incident, Training speed anatomy on H200 and the checked-in goodput tracker sample are the direct companions: goodput is the fraction of wall time spent doing useful step work, while badput is the wall time lost to compilation, checkpointing, eval, data loading, or idle gaps.

The research lane did point at richer anomaly detectors than this EMA rule, including Z-score and per-tensor-history variants. We have not made those the default because operator clarity matters too. A detector that is marginally better at spike scoring but much harder to explain during bring-up is not an obvious win. The production default stays simple on purpose: preserve the one-line receipt, stop the run early, and escalate to a heavier detector only if the same lane keeps producing ambiguous spikes.

step 00050/30000 (0.17%) | loss: 4.2371 | lrm: 0.55 | dt: 412.10ms |
  tok/sec: 254,128 | mfu: 17.8 | gnorm: 1.4231 | total time: 0.34m

lrm is the LR multiplier from the schedule, dt is the step time, gnorm is the post-clip grad norm. Reading this line, the warmup ramp, the early-step compile tax, and the gnorm settling are all visible in the first thirty steps. A new config that diverges in the first hundred steps shows it here, in a single line per step, before any eval ever runs.

6. What we throw out at step 100

If the playbook says diverged, we kill the run. We keep one log artifact (the last 200 steps of trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 output, the optimizer state diagnostic, the per-rank is_finite snapshot), tag the directory, and start the next candidate. The cost of a failed 100-step smoke is small; the cost of an overnight run that diverges at hour six is large; the cost of a checkpoint saved during a spike is larger still.

We also keep enough restart metadata to decide whether a recovery is real. On plain single-rank smokes that mostly means seed and lane definition; on sharded FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample-style lanes it also means sampler state and dataset position. If a resumed run sees a different batch family, the replay is no longer a clean continuation and the postmortem has to say so explicitly.

On FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper lanes the restart contract is wider than weights plus optimizer slots. The scale side-state has to come back too, or the resumed curve is no longer a clean continuation even when the numerical state looks close at load time.

What we kept and what we threw away

Kept: the bisect order (Muon -> precision -> rank -> block), the linear-warmup-plus-cosine default with warmup_ratio=0.03, the four step-level monitors, per-100-step shard logging, the four-step smoke shape, and the "do not save during a gnorm excursion" rule. The single-line trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 log with loss / lrm / dt / tok/sec / mfu / gnorm stays as the universal diagnostic surface.

Threw away: zero-init for noop experts (replaced with std * 0.1), the inverse-sqrt and exponential LR shapes (do not beat cosine on our data), strict-NaN-grad as a default (made it opt-in to survive the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper warmup transient), and EMA-after-compare (made CRITICAL unreachable). The 5-shape LR library is also slated to collapse to two shapes (linear-warmup-cosine and linear-warmup-WSD) at the next refactor.

What still hurts

Three honest gaps. The grad-norm EMA monitor is a heuristic; a real anomaly detector with a per-shard prior would catch the d24-style mid-trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 spike earlier. The data-order bisect is manual: we have shard logs and the seed manifest, but reproducing a single problematic batch from a multi-rank run is still a half-day exercise, and a "replay this exact step" tool is overdue. And the LR schedule library is more optionality than we use.

FAQ

Frequently asked questions

When do we kill a run instead of waiting for recovery?+

If the first 100-step smoke produces a non-finite loss, a non-finite gradient, or a sustained early-step spike that keeps tripping the monitors, we kill it and bisect immediately. The one thing we do not do is let a suspicious run continue just because clip or EMA might hide the problem for another hour.

Is divergence usually an optimizer bug or a data bug?+

The early failures are usually optimizer or precision bugs; the later weird bumps are more often routing, masking, or data-order issues. That is why the bisect order starts with Muon and precision before it goes hunting through shards and curriculum boundaries.

What is the minimum useful receipt for a divergence report?+

Keep the four-step smoke shape, the one-line step log, the effective lane from compile/runtime receipt sample, the per-rank finite snapshot if a rank goes bad, and the optimizer-state diagnostic. If the run also looked slow before it went unstable, add the goodput tracker sample categories so compile or checkpoint badput does not get mistaken for model instability. That is enough to compare lanes without dragging a full overnight artifact set into every incident.

What do we need for an exact restart after divergence?+

The lane definition, seed state, sampler or dataloader position, and enough checkpoint metadata to prove the resumed run is seeing the same batch family rather than a merely similar one. On sharded distributed lanes that distinction matters, because a resume with drifted data order can create a new failure surface that looks like optimizer instability. On FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. lanes, exact also means restoring the scale side-state rather than only the weights; Checkpoint format and resume and Determinism and bit-exact runs are the direct companion posts when the replay question stops being approximate.

Can a bad restart look like a collective hang instead of a loss spike?+

Yes, on sharded distributed lanes. If ranks resume with drifted sampler state or dataset position, they can stop agreeing on which sequence fragments belong in the same step, and the first visible symptom is often a stalled collective rather than an informative loss curve. That is why restart receipts stay tied to Context parallel and sequence parallel, NCCL and collective hangs, and Determinism and bit-exact runs whenever a "resume bug" stops looking purely numerical.

Why do we still ship the simple EMA monitor instead of a more adaptive detector?+

Because operator clarity is part of the playbook too. A more adaptive detector may catch some spikes earlier, but it also adds another moving threshold to explain during bring-up. The current EMA-plus-receipt rule stays cheap enough to run on every step, stable enough to compare across lanes, and simple enough that a human can reconstruct why the run was stopped. Any future replacement has to beat that clarity as well as the detection rate.

What should step 0 look like on an FP8 lane?+

Like calibration, not like ordinary steady-state optimization. The first few steps should populate AMAX history in higher precision so delayed scaling has a real baseline; otherwise the run teaches you more about uninitialized FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. state than about the model or the recipe.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Context parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.

Grounding

Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.

Grounding

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Grounding

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

Grounding

libtpu

The TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.

Grounding

PJRT

The TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Grounding

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

Grounding

scheduler

The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.

Grounding

Debugging

A grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…

Grounding

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Grounding

JAX

A separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.

Grounding

NVFP4

NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

David Gornshtein • MegaCppMore posts →

Loss Curves and the Divergence Playbook: How We Catch It at Epoch 0

Why this matters

1. What divergence looks like in real run data

Immediate NaN at step 1 with a finite step 0

Early gradient spike that does not NaN

Late-training transient

2. The NaN bisect

3. LR warmup shape

Initialization changes are warmup changes

4. Data-order suspects

5. The monitors that catch it at epoch 0

Grad-norm EMA detector

Non-finite optimizer guard

Loss-curve 2x rule

Throughput leading indicator

6. What we throw out at step 100

What we kept and what we threw away

What still hurts

Read next

References

Frequently asked questions

Terms used in this article