SOTA Ablation and Comparison: How MegaCpp Decides What to Keep
The ablation plan, the comparison methodology, and the honest numbers behind the MegaCpp SLM stack — what stacked, what didn't, and what we threw out even though the paper said it would help.

SOTA Ablation and Comparison: How MegaCpp Decides What to Keep
Most "we adopted SOTA" posts read like a shopping list. Ours is the opposite. The default state of any new module — sparse attention variant, dynamic-depth scheme, training-objective auxiliary, fancy positional encoding — is off. It earns its way into the production NAM-class run by clearing a bounded ablation, on the same data, at the same depth, against a baseline we trust. If it doesn't, we drop it, even when the paper looks great and the GitHub stars are flattering.
This post is the methodology behind that decision: how we structure the ablation plan, how we keep comparisons honest across phases, the bugs we caught in our own runs that invalidated entire experiment groups, and the final scorecard of what actually stacks for C++.
The fixed substrate
Every comparison number in this post is on the same substrate. That part is not negotiable.
Hardware: spot v6e-4 TPU pods (europe-west4-a), six in parallel for one
wave. Data: cpp_enriched_16k, our compiler-pretokenized C++ corpus, served
from the workspace GCS area as parquet shards. Model geometry:
depth=16, head_dim=64 (so model dim 1024, 16 heads), 4K context,
total_batch=131072 tokens, 10K training steps. Topology: TP=2, dp=2 over
the 4 chips of a v6e-4. Eval metric: val_bpb (validation bits per byte) on
a held-out C++ split, measured at the 10K-step checkpoint.
If a number in this post is on a different substrate, it is labeled. If it is not labeled, it is on this one. We have spent enough time chasing apparent regressions that turned out to be a head-dim change or a context-length change that we now refuse to print a number without naming the substrate it came from.
The catalog
The catalog of modules we are willing to consider is split into three tiers, because the cost of evaluating them is not uniform.
Tier 1 is "already wired into base_train.py — flip a flag and run". This
is where Mamba-3 (AAM hybrid pattern), DSA (DeepSeek sparse attention),
Engram (n-gram hash embeddings on attention layers only), mHC (multi-head
collaboration), MTP (multi-token prediction), NCP (next concept prediction),
and MoD (mixture of depths) live.
Tier 2 is SOTA we believed was promising enough to integrate: TOP (Token Order Prediction), SRI (Search/Replace Infilling), IFIM (instruction-aware FIM), GateSkip (residual gating for token-wise layer skip), several MoD variants (modr, a_mod, gamma_mod, p_mod), FlexiDepth, continual backprop, shrink-and-perturb, Jacobi forcing, and YaRN RoPE extension for stages 3-4 context scaling.
Tier 3 is inference-only and explicitly out of scope for the training ablation: ADEPT early-exit, EAGLE-2 speculative decoding, ring attention. They get their own evaluation lane.
Phase 1: structure first
Phase 1 was the architectural foundation question: holding everything else
fixed, which structural change moves val_bpb the most? Six experiments,
all at depth=16, all on cpp_enriched_16k, all 10K steps.
| Exp | Config | Val BPB @ 10K | Delta vs baseline |
|---|---|---|---|
| EXP1 | Baseline (attention only) | ~1.866 | — |
| EXP2 | + Mamba-3 AAM | ~1.80 | -3.5% |
| EXP3 | + DSA (sparse attention) | 1.562 | -16.3% (winner) |
| EXP4 | Engram + mHC + MTP | INVALID | Engram-on-Mamba-layer bug |
| EXP5 | Full stack | INVALID | same bug + NaN loop |
| EXP6 | Full stack + NCP | ~1.7 | NCP marginal |
The headline is that DSA was the largest single-feature improvement we have
seen in any phase: -16.3% val_bpb from the attention-only baseline. From
Phase 2 onwards, DSA is the baseline.
The asterisk on Phase 1 is more important than the headline. EXP4 and EXP5
used --engram_layers=0,5,10. With our AAM Mamba pattern at depth=16,
layers 2, 5, 8, 11, 14 are Mamba layers. Engram is an embedding-side trick
designed for attention layers; applying it on top of a Mamba layer is not a
supported configuration and produces val_bpb north of 3.5 at init. Both
experiments were invalid for evaluating Engram, mHC, and MTP. The correct
layer set on depth=16 AAM is 0,1,3,4,6,7,9,10,12,13,15 — every
attention layer, no Mamba layers — and a model-init guard now refuses
launches that violate it.
We are publishing the bug because pretending it didn't happen is how a number like "1.578 with Engram + mHC + MTP" survives into a marketing slide a year later. The honest version is: those experiments did not happen on a configuration that means what we thought it meant, and we re-ran the relevant cells in Phase 2.
Phase 2: stacking with corrected layers
With DSA fixed as the base, Phase 2 added one secondary feature at a time, with the corrected Engram layer set:
| Exp | Config | Val BPB | Status |
|---|---|---|---|
| p2_e01 | DSA only | 1.678 @ 3750 | reference |
| p2_e03 | DSA + Engram (TP=2) | 1.600 @ 1250 | best stable |
| p2_e04 | DSA + mHC (TP=2) | 1.577 @ 10K | complete |
| p2_e05 | DSA + MTP | 1.934 @ 2750 | converged early |
Two takeaways. First, both Engram and mHC help on top of DSA. Second, MTP at 4K context on this substrate slightly hurts — counter to the multi-token prediction hype — and we kept it out of the dense base going forward.
The TP=2 constraint on Engram and GateSkip is not a tuning preference; it is
a head-sharding limitation. At TP=4, both features land on the wrong shard
axis and produce val_bpb ~1.984 (vs 1.581 at TP=2 for the same
configuration). We learned this by running it both ways. The right answer
was not "Engram is bad", it was "Engram is TP=2 only on this geometry".
Phase 3: context scaling
Phase 3 asked the boring but required question: what is the maximum context
length we can train at on v6e-8 with TP=8? The answer was 128K with
gradient checkpointing — at 565K tok/sec. 256K OOMs even with GC, because
GC does not reduce XLA's pre-allocation. To go beyond 128K on this hardware
we would need sequence parallelism, larger HBM (v6e-16+), or both. That is
a stage-4 problem; for the ablation we capped at 128K and moved on.
MoE: the discontinuity
The MoE ablation is where the numbers stop looking incremental. With the
best dense configuration sitting around val_bpb 1.56–1.58, the MoE
experiments were on a different scale entirely:
| Exp | Config | Val BPB @ best | Notes |
|---|---|---|---|
| moe_e01 | Dense DSA | ~1.992 @ 5250 | dense baseline (this geometry) |
| moe_e02 | DSA + 8r+1s top-2 | — | very slow on shared VM |
| moe_e05 | DSA + Mamba + MoE + Engram (TP=2) | 2.221 @ 3250 | running |
| moe_e06 | DSA + Mamba + 2s+16r + Engram + mHC (TP=2) | 1.206 @ 3750 | best overall |
moe_e06 — DSA, Mamba-3 AAM, 2 shared + 16 routed experts top-2,
Engram on attention layers, mHC, TP=2 — is the strongest configuration
we have run in the ablation series. The gap from the best dense IFIM result
(1.565) to moe_e06 (1.206) is large enough that we are willing to call it
a real architectural shift, not a measurement artifact. The catch is that
moe_e06 was on a contended machine and only reached 3,750 steps cleanly,
so Phase 5 re-runs it on a fresh box for a 10K-step receipt.
Phase 4: training objectives
Phase 4 isolated the question of training objectives on the dense DSA base:
| Exp | Config | TP | Val BPB @ 10K | Notes |
|---|---|---|---|---|
| p2_e00 | DSA baseline | 4 | 1.958 | reference |
| p4_e06 | DSA + IFIM | 4 | 1.565 | winner — 0.393 gain |
| p4_e03 | DSA + GateSkip | 2 | 1.581 | TP=2 required |
| p4_e01 | DSA + TOP | 4 | 1.734 | T_top=2048 aligned |
| p4_e05 | DSA + SRI | 4 | 1.746 | |
| p4_e04b | DSA + Mamba-3 | 4 | 2.324 | hurts when stacked with DSA |
IFIM — instruction-aware fill-in-the-middle — was the clean winner: a
training-time data transformation that prepends docstrings and comments as
instruction prefixes to FIM examples. It costs nothing at inference time
and gave a 20% improvement over the dense DSA baseline. GateSkip was the
second-best, with the TP≤2 caveat noted above. SRI and TOP both helped
over baseline but less. DSA + Mamba-3 stacked badly on this geometry — a
result that contradicts what the AAM hybrid did under MoE in moe_e06,
which is exactly why we run things in groups instead of trusting any one
result.
The asterisk on Phase 4 is the TOP number. The original TOP run was at
~101 seconds per step (a 12-day ETA) because of a Python loop in the
auxiliary loss. The batched matmul fix replaces eight sequential
torch.mm calls with one torch.mm(h_2d, w.T) and brings the step time
down to ~12 seconds. The 1.734 number is from the slow run; we are
re-running TOP under the batched implementation in Phase 5 before letting
it into any final comparison.
Phase 5: do they stack?
The Phase 5 question is the only one that matters for the production candidate: which Phase 4 objectives stack additively on top of the MoE base, and is there a single configuration we can commit to for the NAM-64 production run?
| ID | Config | TP | Expected BPB |
|---|---|---|---|
| p5_e01 | MoE base re-validation (moe_e06) | 2 | ~1.20 |
| p5_e02 | MoE + IFIM | 2 | 1.05–1.20 |
| p5_e03 | MoE + GateSkip + IFIM | 2 | 1.00–1.15 |
| p5_e04 | MoE + SRI + IFIM | 2 | 1.05–1.20 |
| p5_e05 | Dense TOP re-validation (batched) | 4 | ~1.72 |
| p5_e06 | MoE + TOP | 2 | 1.05–1.20 |
The hypothesis behind p5_e02 is that IFIM and MoE operate at orthogonal
levels — IFIM rewrites the data, MoE routes the activations — so they
should compose without interference. IFIM may even improve MoE routing by
giving the router a cleaner semantic signal in the prefix. If that holds,
p5_e02 is the production candidate.
The expected-BPB column is honest: it is a range, not a target. We do not pretend to predict the exact stacking gain ahead of the run. The decision rule is in the comparison methodology, not in the prediction.
Comparison methodology
The methodology has three rules that we follow without exception.
First, same substrate or labeled differently. Two numbers from different substrates do not appear in the same table. If they have to be discussed together, the substrate is part of the row.
Second, best-checkpoint reporting, with the step recorded. We report
the best val_bpb and the step it was achieved at, never the final-step
number alone. A model that hits 1.20 at step 3,750 and drifts up to 1.25
by 10K is a different signal than a model that hits 1.20 at 10K — the
first one is a stability problem, the second one is a converged result.
The step matters; the step is in the table.
Third, invalidation is loud. If a configuration was run on a buggy layer set, a corrupted preset, or during a known loss spike, its numbers are removed from the comparison and replaced with the word INVALID and the reason. The d24 hybrid checkpoint at step 25K is the canonical example: a transient loss spike (loss jumped from ~0.8 to ~3.4 around step 24,850, recovered fully by ~25,700) coincided with the periodic save, and the saved weights were degraded. The eval at 25K (3.1% compile rate vs 11.0% at step 20K) was not a model regression; it was a snapshot during recovery. We say so in the report. The next save is the comparable one.
What we did not adopt
The half of the catalog that did not make it into the production candidate
is just as important as the half that did. MTP at 4K context: dropped on
this substrate (slight regression). Dense DSA + Mamba-3 stacking: dropped
(stacks badly without MoE). NCP: marginal gain at the cost of training
complexity, parked. MoD variants beyond the baseline: all parked pending a
head-to-head against GateSkip on the same MoE base; if GateSkip stacks
with MoE in p5_e03, MoD is unlikely to earn a slot. Several Phase 1
results are simply void due to the Engram-on-Mamba bug and were not
rerun, because Phase 2 covered the same hypotheses on a corrected setup.
The honest scorecard
The current architectural shape we plan to commit to looks like:
DSA (sparse attention, start-layer 8) as the base, Mamba-3 AAM hybrid
with qknorm + bias + trapezoidal defaults (complex RoPE opt-in) for
the SSM portion, MoE with 2 shared + 16 routed experts top-2 and
z_loss_weight=0.01, Engram on attention layers only with the explicit
layer set, mHC enabled, IFIM as the training-time objective.
That stack is what survives our ablation methodology — not what survives a single best-of-N run. If the Phase 5 stacking experiments contradict the prediction, we will rebuild the candidate around the actual numbers, not the prior. That is the whole point of running ablations instead of adopting papers.
References
SOTA_ABLATION_PLAN.mdSOTA_COMPARISON.mdPHASE5_ABLATION_PLAN.mdTRAINING_EVAL_REPORT.mdeval_doc.mdcheckpoint_eval_report.mdtraining_review.md