ablation
sota
moe
dsa
ifim
evaluation
c++

SOTA Ablation and Comparison: How MegaCpp Decides What to Keep

The ablation plan, the comparison methodology, and the honest numbers behind the MegaCpp SLM stack — what stacked, what didn't, and what we threw out even though the paper said it would help.

10 min readDavid Gornshtein
SOTA Ablation and Comparison: How MegaCpp Decides What to Keep

SOTA Ablation and Comparison: How MegaCpp Decides What to Keep

Most "we adopted SOTA" posts read like a shopping list. Ours is the opposite. The default state of any new module — sparse attention variant, dynamic-depth scheme, training-objective auxiliary, fancy positional encoding — is off. It earns its way into the production NAM-class run by clearing a bounded ablation, on the same data, at the same depth, against a baseline we trust. If it doesn't, we drop it, even when the paper looks great and the GitHub stars are flattering.

This post is the methodology behind that decision: how we structure the ablation plan, how we keep comparisons honest across phases, the bugs we caught in our own runs that invalidated entire experiment groups, and the final scorecard of what actually stacks for C++.

The fixed substrate

Every comparison number in this post is on the same substrate. That part is not negotiable.

Hardware: spot v6e-4 TPU pods (europe-west4-a), six in parallel for one wave. Data: cpp_enriched_16k, our compiler-pretokenized C++ corpus, served from the workspace GCS area as parquet shards. Model geometry: depth=16, head_dim=64 (so model dim 1024, 16 heads), 4K context, total_batch=131072 tokens, 10K training steps. Topology: TP=2, dp=2 over the 4 chips of a v6e-4. Eval metric: val_bpb (validation bits per byte) on a held-out C++ split, measured at the 10K-step checkpoint.

If a number in this post is on a different substrate, it is labeled. If it is not labeled, it is on this one. We have spent enough time chasing apparent regressions that turned out to be a head-dim change or a context-length change that we now refuse to print a number without naming the substrate it came from.

The catalog

The catalog of modules we are willing to consider is split into three tiers, because the cost of evaluating them is not uniform.

Tier 1 is "already wired into base_train.py — flip a flag and run". This is where Mamba-3 (AAM hybrid pattern), DSA (DeepSeek sparse attention), Engram (n-gram hash embeddings on attention layers only), mHC (multi-head collaboration), MTP (multi-token prediction), NCP (next concept prediction), and MoD (mixture of depths) live.

Tier 2 is SOTA we believed was promising enough to integrate: TOP (Token Order Prediction), SRI (Search/Replace Infilling), IFIM (instruction-aware FIM), GateSkip (residual gating for token-wise layer skip), several MoD variants (modr, a_mod, gamma_mod, p_mod), FlexiDepth, continual backprop, shrink-and-perturb, Jacobi forcing, and YaRN RoPE extension for stages 3-4 context scaling.

Tier 3 is inference-only and explicitly out of scope for the training ablation: ADEPT early-exit, EAGLE-2 speculative decoding, ring attention. They get their own evaluation lane.

Phase 1: structure first

Phase 1 was the architectural foundation question: holding everything else fixed, which structural change moves val_bpb the most? Six experiments, all at depth=16, all on cpp_enriched_16k, all 10K steps.

Exp Config Val BPB @ 10K Delta vs baseline
EXP1 Baseline (attention only) ~1.866
EXP2 + Mamba-3 AAM ~1.80 -3.5%
EXP3 + DSA (sparse attention) 1.562 -16.3% (winner)
EXP4 Engram + mHC + MTP INVALID Engram-on-Mamba-layer bug
EXP5 Full stack INVALID same bug + NaN loop
EXP6 Full stack + NCP ~1.7 NCP marginal

The headline is that DSA was the largest single-feature improvement we have seen in any phase: -16.3% val_bpb from the attention-only baseline. From Phase 2 onwards, DSA is the baseline.

The asterisk on Phase 1 is more important than the headline. EXP4 and EXP5 used --engram_layers=0,5,10. With our AAM Mamba pattern at depth=16, layers 2, 5, 8, 11, 14 are Mamba layers. Engram is an embedding-side trick designed for attention layers; applying it on top of a Mamba layer is not a supported configuration and produces val_bpb north of 3.5 at init. Both experiments were invalid for evaluating Engram, mHC, and MTP. The correct layer set on depth=16 AAM is 0,1,3,4,6,7,9,10,12,13,15 — every attention layer, no Mamba layers — and a model-init guard now refuses launches that violate it.

We are publishing the bug because pretending it didn't happen is how a number like "1.578 with Engram + mHC + MTP" survives into a marketing slide a year later. The honest version is: those experiments did not happen on a configuration that means what we thought it meant, and we re-ran the relevant cells in Phase 2.

Phase 2: stacking with corrected layers

With DSA fixed as the base, Phase 2 added one secondary feature at a time, with the corrected Engram layer set:

Exp Config Val BPB Status
p2_e01 DSA only 1.678 @ 3750 reference
p2_e03 DSA + Engram (TP=2) 1.600 @ 1250 best stable
p2_e04 DSA + mHC (TP=2) 1.577 @ 10K complete
p2_e05 DSA + MTP 1.934 @ 2750 converged early

Two takeaways. First, both Engram and mHC help on top of DSA. Second, MTP at 4K context on this substrate slightly hurts — counter to the multi-token prediction hype — and we kept it out of the dense base going forward.

The TP=2 constraint on Engram and GateSkip is not a tuning preference; it is a head-sharding limitation. At TP=4, both features land on the wrong shard axis and produce val_bpb ~1.984 (vs 1.581 at TP=2 for the same configuration). We learned this by running it both ways. The right answer was not "Engram is bad", it was "Engram is TP=2 only on this geometry".

Phase 3: context scaling

Phase 3 asked the boring but required question: what is the maximum context length we can train at on v6e-8 with TP=8? The answer was 128K with gradient checkpointing — at 565K tok/sec. 256K OOMs even with GC, because GC does not reduce XLA's pre-allocation. To go beyond 128K on this hardware we would need sequence parallelism, larger HBM (v6e-16+), or both. That is a stage-4 problem; for the ablation we capped at 128K and moved on.

MoE: the discontinuity

The MoE ablation is where the numbers stop looking incremental. With the best dense configuration sitting around val_bpb 1.56–1.58, the MoE experiments were on a different scale entirely:

Exp Config Val BPB @ best Notes
moe_e01 Dense DSA ~1.992 @ 5250 dense baseline (this geometry)
moe_e02 DSA + 8r+1s top-2 very slow on shared VM
moe_e05 DSA + Mamba + MoE + Engram (TP=2) 2.221 @ 3250 running
moe_e06 DSA + Mamba + 2s+16r + Engram + mHC (TP=2) 1.206 @ 3750 best overall

moe_e06 — DSA, Mamba-3 AAM, 2 shared + 16 routed experts top-2, Engram on attention layers, mHC, TP=2 — is the strongest configuration we have run in the ablation series. The gap from the best dense IFIM result (1.565) to moe_e06 (1.206) is large enough that we are willing to call it a real architectural shift, not a measurement artifact. The catch is that moe_e06 was on a contended machine and only reached 3,750 steps cleanly, so Phase 5 re-runs it on a fresh box for a 10K-step receipt.

Phase 4: training objectives

Phase 4 isolated the question of training objectives on the dense DSA base:

Exp Config TP Val BPB @ 10K Notes
p2_e00 DSA baseline 4 1.958 reference
p4_e06 DSA + IFIM 4 1.565 winner — 0.393 gain
p4_e03 DSA + GateSkip 2 1.581 TP=2 required
p4_e01 DSA + TOP 4 1.734 T_top=2048 aligned
p4_e05 DSA + SRI 4 1.746
p4_e04b DSA + Mamba-3 4 2.324 hurts when stacked with DSA

IFIM — instruction-aware fill-in-the-middle — was the clean winner: a training-time data transformation that prepends docstrings and comments as instruction prefixes to FIM examples. It costs nothing at inference time and gave a 20% improvement over the dense DSA baseline. GateSkip was the second-best, with the TP≤2 caveat noted above. SRI and TOP both helped over baseline but less. DSA + Mamba-3 stacked badly on this geometry — a result that contradicts what the AAM hybrid did under MoE in moe_e06, which is exactly why we run things in groups instead of trusting any one result.

The asterisk on Phase 4 is the TOP number. The original TOP run was at ~101 seconds per step (a 12-day ETA) because of a Python loop in the auxiliary loss. The batched matmul fix replaces eight sequential torch.mm calls with one torch.mm(h_2d, w.T) and brings the step time down to ~12 seconds. The 1.734 number is from the slow run; we are re-running TOP under the batched implementation in Phase 5 before letting it into any final comparison.

Phase 5: do they stack?

The Phase 5 question is the only one that matters for the production candidate: which Phase 4 objectives stack additively on top of the MoE base, and is there a single configuration we can commit to for the NAM-64 production run?

ID Config TP Expected BPB
p5_e01 MoE base re-validation (moe_e06) 2 ~1.20
p5_e02 MoE + IFIM 2 1.05–1.20
p5_e03 MoE + GateSkip + IFIM 2 1.00–1.15
p5_e04 MoE + SRI + IFIM 2 1.05–1.20
p5_e05 Dense TOP re-validation (batched) 4 ~1.72
p5_e06 MoE + TOP 2 1.05–1.20

The hypothesis behind p5_e02 is that IFIM and MoE operate at orthogonal levels — IFIM rewrites the data, MoE routes the activations — so they should compose without interference. IFIM may even improve MoE routing by giving the router a cleaner semantic signal in the prefix. If that holds, p5_e02 is the production candidate.

The expected-BPB column is honest: it is a range, not a target. We do not pretend to predict the exact stacking gain ahead of the run. The decision rule is in the comparison methodology, not in the prediction.

Comparison methodology

The methodology has three rules that we follow without exception.

First, same substrate or labeled differently. Two numbers from different substrates do not appear in the same table. If they have to be discussed together, the substrate is part of the row.

Second, best-checkpoint reporting, with the step recorded. We report the best val_bpb and the step it was achieved at, never the final-step number alone. A model that hits 1.20 at step 3,750 and drifts up to 1.25 by 10K is a different signal than a model that hits 1.20 at 10K — the first one is a stability problem, the second one is a converged result. The step matters; the step is in the table.

Third, invalidation is loud. If a configuration was run on a buggy layer set, a corrupted preset, or during a known loss spike, its numbers are removed from the comparison and replaced with the word INVALID and the reason. The d24 hybrid checkpoint at step 25K is the canonical example: a transient loss spike (loss jumped from ~0.8 to ~3.4 around step 24,850, recovered fully by ~25,700) coincided with the periodic save, and the saved weights were degraded. The eval at 25K (3.1% compile rate vs 11.0% at step 20K) was not a model regression; it was a snapshot during recovery. We say so in the report. The next save is the comparable one.

What we did not adopt

The half of the catalog that did not make it into the production candidate is just as important as the half that did. MTP at 4K context: dropped on this substrate (slight regression). Dense DSA + Mamba-3 stacking: dropped (stacks badly without MoE). NCP: marginal gain at the cost of training complexity, parked. MoD variants beyond the baseline: all parked pending a head-to-head against GateSkip on the same MoE base; if GateSkip stacks with MoE in p5_e03, MoD is unlikely to earn a slot. Several Phase 1 results are simply void due to the Engram-on-Mamba bug and were not rerun, because Phase 2 covered the same hypotheses on a corrected setup.

The honest scorecard

The current architectural shape we plan to commit to looks like: DSA (sparse attention, start-layer 8) as the base, Mamba-3 AAM hybrid with qknorm + bias + trapezoidal defaults (complex RoPE opt-in) for the SSM portion, MoE with 2 shared + 16 routed experts top-2 and z_loss_weight=0.01, Engram on attention layers only with the explicit layer set, mHC enabled, IFIM as the training-time objective.

That stack is what survives our ablation methodology — not what survives a single best-of-N run. If the Phase 5 stacking experiments contradict the prediction, we will rebuild the candidate around the actual numbers, not the prior. That is the whole point of running ablations instead of adopting papers.

References

  • SOTA_ABLATION_PLAN.md
  • SOTA_COMPARISON.md
  • PHASE5_ABLATION_PLAN.md
  • TRAINING_EVAL_REPORT.md
  • eval_doc.md
  • checkpoint_eval_report.md
  • training_review.md
David Gornshtein • Datasunrise OÜMore posts →