Landing the Mamba 3 + Transformer Interleave Ratio: What the Ablations Told Us to Throw Away
How the hybrid layer pattern for our C++ specialist converged: AEMEAEDE versus dense versus GDN, what the NAM52 and NAM56R ablations settled, and the features we cut on the data.

Landing the Mamba 3 + Transformer Interleave Ratio: What the Ablations Told Us to Throw Away
The interesting question is not "should we use a hybrid" - that one was answered by every frontier lab independently the year before we got here. The interesting question is "which ratio, in which order, at our model size, for C++". This post walks through the ablation work that actually settled that question for our nanochat POC, names the patterns we kept and the ones we cut, and is honest about which decisions were made on loss curves and which were made on training stability.
All runs referenced below use our internal preset names (NAM52, NAM56R, variants of both). Shapes are concrete even when the names are internal.
The Candidate Space
For a decoder-only backbone with Mamba 3 token-mixers and Transformer attention blocks, the interleaving question has a handful of live axes:
- Attention-to-Mamba ratio:
A:Msomewhere between 1:2 and 1:9. PureAand pureMare the trivial endpoints. - Placement: which depths carry attention. Early, middle, or late third of the stack, or a spread.
- Block pattern: whether we run attention and Mamba as separate blocks, or inside a Nemotron-style A/M/E separated-block pattern where MLP lives in its own block.
- Block variants on top of that: Engram, mHC, DSA (exact-token sparse attention as a per-layer choice), MTP.
We built a 20-variant ablation matrix on TPU v6e-x4 at 4k context for rapid iteration, then narrowed to the handful that survived and reran the survivors on H200 at NAM52 and NAM56R shapes. The matrix lives in the v4 architecture note; the short version is that dense and MoE baselines, multiple attention-to-Mamba ratios, several routing variants, and the full feature-stack (Engram, mHC, DSA, MTP, MoE) were compared against each other at a uniform step budget. Evaluation is an LLM-as-a-judge C++ review pipeline over generated completions plus standard loss.
What the 100-Step Results Said
The operational win was the H200 100-step AdamW sweep, because it told us which patterns were even stable before we spent compute on loss curves. Nine patterns finished 100 steps; the ones with concrete numbers in our notes:
| Run | Preset | Final Loss | gnorm | Tok/sec | Status |
|---|---|---|---|---|---|
| r3_adamw (dense baseline) | nam52_h200_dense_no_mtp_v1 |
5.43 | 0.57 | 508 | BEST |
| r3_ref-aw | nam52_h200_dense_ref_v1 |
6.84 | 0.76 | 629 | OK |
| r4_mtp1 | nam52_h200_dense_mtp1_v1 |
6.79 | 0.79 | 644 | OK |
| r4_gdn6 (GDN6 AEMEDAEME) | nam52_gdn6_h200_dense_v1 |
6.67 | 0.91 | 618 | OK |
| r4_fullstack | nam52_fullstack_h200_dense_v1 |
6.84 | 0.91 | 518 | OK |
| r3_gdn-aw (GDN, no mamba) | nam52_gdn_nomamba_h200_dense_v1 |
6.88 | 1.11 | 511 | OK |
| r4_hybrid_adamw (AEMEAEDE) | nam52_hybrid_md_h200_dense_v1 |
7.06 | 0.72 | 512 | OK |
| r3_dyt-aw (Dynamic Tanh) | nam52_h200_dense_dyt_v1 |
8.02 | 241.0 | 647 | UNSTABLE |
| r3_attn-aw (AttnRes) | nam52_h200_dense_attnres_v1 |
25.91 | 18.9M | 652 | DIVERGED |
Several things settle on this table.
First, at 100 steps on 4k context the dense Transformer baseline is ahead of every hybrid variant. That is expected - the hybrid's advantage is at long context, and at 4k the Mamba layers are spending their O(N) cheapness on sequence lengths that do not exercise it. The dense baseline's 5.43 at 508 tok/sec is the number every hybrid has to beat, and none of them did at this shape.
Second, the gap between variants is smaller than the gap to divergence. AttnRes diverged (loss 25.91, gnorm 18.9M) and DyT went unstable (gnorm 241) at the same config where AEMEAEDE, GDN6, fullstack, and AdamW dense all converged cleanly. Architecture experiments have to get past "does this converge at our LR schedule" before they get to compete on loss, and some of them didn't. We cut DyT and AttnRes from the tree right here.
Third, Muon was a universal instability source at NAM52 scale. Every --muon run NaN'd at step 4 or 5 (one made it to step 32) except r7_splitqkv_hybrid, which combined split-QKV with the AEMEAEDE hybrid and landed at loss 4.23 after 50 steps - the best Muon run we had, and the best overall number at 50 steps. The implication for the hybrid ratio question: the AEMEAEDE pattern is not just a loss number, it is the only pattern we have where Muon trains at all. If we want Muon's compute efficiency, the hybrid pattern is a prerequisite, not a preference.
The Pattern We Kept
After the ablation sweep and the Muon stability test, the pattern we carried into the longer runs is Mamba-majority with a minority of attention blocks biased toward the middle and later third of the network. Concrete shape, at our production NAM56R depth:
- Roughly 7 Mamba 3 layers per 1 attention layer. Attention is a minority everywhere.
- Attention placement biased to middle + late third. Early layers embed tokens and accumulate local state; attention there is wasted. By the middle of the network, representations are abstract enough that attention lookups hit meaningful keys, and the quadratic cost is paid against features that justify it.
- Nemotron-style separated blocks (A / M / E independent), which lets us apply module-specific optimizer routing and gradient checkpointing independently per block. Mamba blocks are skipped by our selective checkpoint policy; attention blocks are not.
- MIMO rank 4 on the Mamba side, one group (
ngroups=1),chunk_size=16, RoPE fraction 1.0. Numbers come from theAuthorMamba3Configcontract; see the parallel-performance post for the geometry.
This pattern gave us the g1k5 result - the 50-step Muon + split-QKV + hybrid run that landed at 4.23, our best number through step 50. It is also what we run in the long NAM56R training plan because it is the only variant that stays numerically sane under the optimizer we actually want to use.
What the Data Told Us to Throw Away
Being explicit about cuts matters as much as the pattern we kept, because the ablations eliminated more than they confirmed.
DyT (Dynamic Tanh) and AttnRes. Both diverged or went unstable on the same LR schedule where the base stack trained cleanly. The 2.6-point loss gap on DyT plus the 241 gnorm makes this an easy call. The b16 and ahel tracks are closed on our side.
GDN without Mamba. nam52_gdn_nomamba_h200_dense_v1 (r3_gdn-aw) converged cleanly but came in at 6.88, behind dense (5.43), behind GDN6 (6.67), and behind the AEMEAEDE hybrid (7.06). GDN as a Mamba replacement did not justify itself on this dataset; the 57go track closed.
Fullstack (Engram + A-MoD + ngram + mHC + MTP + structure + MoE, all on at once). nam52_fullstack_h200_dense_v1 trained at 6.84 loss, 518 tok/sec - no advantage over dense. "All features on" is not strictly better than a thoughtful subset. We carry A-MoD + ngram_hash + Engram forward; we drop MoDA, un-bottlenecked structure, and fullstack-as-default.
MoDA. 15,426 tok/sec against the baseline's ~20,800 is a 25.8 percent throughput tax for a routing variant that fullstack did not pay back in loss. The "MoDA detach fix" was still in-tree when the bench ran and has since landed, but the throughput gap at fixed architecture matters more than the bug note. Out.
Structure embeddings at full width. Adopted only after a separate optimization pass cut them from 5 separate nn.Embedding tables to one unified table with offsets, added a low-rank bottleneck (structure_bottleneck_dim=64), and replaced softmax+mask with learned-scale weighting - 3 kernel launches instead of ~15. The full-width variant was throwing FLOPs at a low-rank signal.
ngram_hash. Same pattern. 16 separate tables unified into one nn.Embedding with offsets, vectorized hash, tuned embed_dim=16 and table_size=200000. Throughput 20,838 versus baseline 20,781 - free at the tuned dimensions, 22x fewer kernel launches.
MoE placement. We keep MoE but ablations favored MoE on alternate layers over MoE on every layer at our FLOP budget, and a 64-expert top-4 + shared expert over both the 8-expert legacy config and 128-expert top-8. Ultra-fine expert parallelism did not pay off at our size. The shared expert paired well with Engram, matching the MoE+Engram synergy hypothesis.
The Optimizer Wiring That Actually Mattered
Two things we were not expecting to matter for "interleaving ratio" ended up being pre-requisites for any hybrid result at all.
First, Muon does not train non-2D parameters. Mamba introduces 1D params (A_log, dt_bias, D, conv1d.bias) and 3D params (conv1d.weight at shape (conv_dim, 1, d_conv)). Newton-Schulz orthogonalization requires 2D matrices; pushing conv1d.weight through Muon crashes. The fix is an ndim != 2 filter that routes non-2D params to AdamW. Without it the hybrid lane is dead on arrival.
Second, LR separation is not optional. The repo's tuning has distinct LRs for lm_head (0.004 * scale), embeddings (0.2 * scale), resid (0.005), and x0 (0.5). Merging any of these into one group is a 50x LR tax on lm_head and a near-instant divergence. The Mamba-AdamW group has to be its own entry in adam_groups. The 4.23-at-50 Muon+split-QKV+hybrid run preserved those groups verbatim; merged runs died by step 5.
Initialization, Because It Killed A Week
Two initialization cuts that came out of the review cycle and would have silently degraded every hybrid run:
A_log range. The official Mamba 2 Simple init draws A ~ Uniform(1, 16) and stores A_log = log(A) so that A = -exp(A_log) ends up in [-16, -1]. Our earlier proposals drew A_log ~ Uniform(-log(64), -log(1)), which produces A in [-64, -1] - four times the decay rate. The practical effect is that at the high-decay tail, the SSM state forgets within one token, and the model looks like an expensive causal pointwise operator. We now match the official init.
conv1d.weight init. An earlier attempt used a linear-style uniform_(-s, s) with s = sqrt(3)/sqrt(n_embd) ~ 0.048. The correct fan-in for a depthwise 1D conv with d_conv=4 gives s ~ 0.866 under Kaiming uniform. Linear init is 18x too small; conv1d starts as near-zero, and the convolutional path contributes nothing for the first thousand steps. We now leave conv1d at PyTorch's default, which is already Kaiming.
Neither of these is about "hybrid ratio" in the naive sense, but both are reasons the hybrid pattern works in our hands. Architecture is not just a layer order; it is also the initialization contract every block has to satisfy.
The PyTorch vs Official Mamba 3 Delta
To be concrete about what we adopted: we adopted twelve features from the official Mamba 3 release, all behind config flags, all backward-compatible, all covered by regression tests. Input-dependent A, MIMO at rank 4, optional removal of conv1d, learned angle rates, single-pass trapezoidal, per-head bias with init 1.0, learned RMSNorm for B/C, QK-dot skip connection, rope_fraction, output norm group size, Triton SISO kernel, and mod-2-pi angle wrapping.
The three critical gaps we closed were input-dependent A, MIMO, and optional conv1d removal. What remains different from upstream is kernel-level: MIMO in our implementation runs a PyTorch loop over ranks where upstream runs a fused TileLang kernel, and RMSNormGated runs through nn.RMSNorm where upstream uses a fused Triton path. Those are perf differences, not correctness differences, and they exist specifically because we wanted the XLA/TPU lane to work.
What we have that upstream does not: TPU/XLA support via torch_xla.experimental.scan, document-boundary masking inside the scan and the conv and the RoPE, Nemotron-style separated blocks, MoE + MoD + Engram + mHC + DSA + MTP integration, selective gradient checkpointing that skips Mamba, and a torch.compile wrapper for the Triton kernels under inductor.
What Comes Next
Two things are still open on the interleaving question.
The first is long-context ablation. The 100-step H200 sweep ran at the 4k-context shape because it is cheap and parallelizable. The hybrid's actual win is at 16k - 64k context on the v4 context-graph packer's Callers -> Target -> Callees snippets. That sweep is queued; we expect the hybrid-to-dense loss gap to invert once the sequence length starts exercising the O(N) advantage.
The second is MIMO-scan asymmetry between training and decode. The official fused kernel uses a single shared state across the R ranks during training, while our PyTorch loop uses R independent states. Decode correctly uses a shared state. That train-decode asymmetry may be leaving loss on the table; closing it requires a custom scan kernel, which is on the roadmap behind the TileLang P1 and PsiV work from the parallel-performance post.
The honest summary is that the ratio, the placement, and the optimizer wiring are all settled for now: Mamba-majority, attention biased middle-and-late, Nemotron-separated blocks, MIMO rank 4, Muon on 2D params with AdamW on the rest, LR groups preserved. Everything else in the ablation matrix is either shipping (MoE 64+1, A-MoD, Engram, structure with bottleneck, ngram_hash unified) or cut (DyT, AttnRes, GDN-no-mamba, MoDA, structure at full width, 128-expert MoE, fullstack-as-default). The shape of the model is what the data said to keep after we threw away what the data said to throw away.
References
mamba_integration_log.mdmamba_review_followup_plan.mdmamba3_adoption_report_2026-03-18.mdv4_architecture.mdnanochat_cpp_model.mdarchitecture_and_eval_en.mddocs/design/03-model-architecture.mddocs/design/14-structure-aware-attention-and-feature-integration-plan.mddocs/design/13-gated-attention-v1-spec.mdCHANGELOG.mdCURRENT_STATE.md