The SLM Ensemble Architecture
Eight specialist 4B-8B hybrid Mamba-3 + Transformer models, each activating 0.8B-1.6B parameters per token, engineered to beat a single 70B generalist on C++.

The SLM Ensemble Architecture
MegaCpp.com does not ship one large language model. It ships an ensemble of eight Small Language Models (SLMs), each between 4B and 8B total parameters with 0.8B-1.6B active per token, specialized to a distinct slice of the C++ problem space. This document describes what is inside each SLM, how the ensemble is composed, and why - for our workload - this configuration outperforms a single 70B dense generalist at a fraction of the inference cost.
Every claim below is grounded in the internal design notes: architecture_and_eval_en.md and v4_architecture.md from the nanochat training repo, nanochat_cpp_model.md (the model source dump), and the mamba3_mimo_p1_notes.md, mamba3_mimo_p2_psiv_cache_design.md, mamba3_mimo_p3_register_split_design.md design docs from cppmega/docs/, plus the cppmega/features/mamba3/ integration code.
1. Why an ensemble, not a monolith
A single 70B dense generalist has two properties we do not want. First, every token pays for every parameter: at 70B bf16 weights you are looking at ~140 GB of VRAM plus activations, which rules out anything below an H100/H200-class box and makes per-token cost dominated by memory bandwidth rather than useful compute. Second, a generalist is trained on a generalist distribution. C++ is a narrow, deeply structured language where most of the long tail is not "more internet text" but things like template instantiation rules, ABI boundaries, lock-free patterns, and build-system quirks. A generalist amortizes capacity across cooking recipes and legal boilerplate you will never ask about.
The architecture_and_eval_en.md plan is explicit about the target envelope: Phase 1 is a dense <1B model that "fits easily into edge devices with minimal memory (<1GB VRAM)", and Phase 2 expands to "~5B Total / ~800M Active" via fine-grained MoE. Eight such specialists, each shaped to a different subdomain, sum to roughly 40-60B total parameters on disk but activate 0.8-1.6B per token per model, and at inference time we typically route to one or two specialists per request.
The ensemble bet is straightforward: eight 5B specialists, each trained past Chinchilla optimality on a curated corpus, collectively cover more of the C++ distribution with higher fidelity than one 70B generalist trained once on a broad mix. Evaluation is structured around that claim - the architecture_and_eval_en.md §3 pipeline uses GKE-hosted T4 pods to generate completions and Gemini 3.1 Pro as an LLM-as-a-judge grading correctness, context adherence, and hallucination rate on "complex, cross-file prompt graphs".
2. The eight specialists
Expert boundaries follow the v4 Context Graph algorithm described in v4_architecture.md: Callers -> Target -> Callees, extracted via Tree-sitter from historical commits, with a 64K-token budget per snippet. The expert split is chosen so that each specialist sees a coherent subgraph of C++ practice rather than a random slice.
- Core Language & Templates - the parts of C++ that are syntactically dense: template metaprogramming, SFINAE/concepts, constexpr, parameter packs, CTAD. 8B total / 1.6B active. This is the largest specialist because the grammar tail here is huge and often rewards deeper attention stacks.
- Concurrency & Atomics -
std::atomic, memory orders, lock-free structures, coroutines,std::execution. 6B / 1.2B. - Systems & OS Integration - syscalls, memory-mapping, POSIX/Win32, NUMA, io_uring, sockets. 5B / 1B.
- Build Systems & Toolchains - CMake, Bazel, Meson, compile_commands, linker scripts, ABI compatibility, cross-compilation. 4B / 0.8B.
- Standard Library & Ranges - containers, algorithms,
<ranges>,<chrono>, allocators. 5B / 1B. - Graphics, Math & SIMD - linear algebra, intrinsics, CUDA/HIP, shader interop, GPU kernels. 6B / 1.2B.
- Embedded & Real-Time - freestanding C++, MCU toolchains, bare-metal patterns, fixed-point, ISR-safe code. 4B / 0.8B.
- Legacy C++ & Interop - C-with-classes codebases, raw pointer idioms, FFI/C ABI, long-lived enterprise patterns. 5B / 1B.
Each specialist shares the same backbone and tokenizer (cpp_tokenizer.py from nanochat_cpp_model.md), and each is trained on a corpus slice produced by the v4_context_graph extractor described in v4_architecture.md, but the expert mixture and Engram n-gram tables are tuned per domain. This is not "the same model with different fine-tunes bolted on"; the pretraining data mix and the expert routing are both domain-specific.
3. The hybrid layer stack
Inside a single specialist we use the hybrid Mamba-3 + Transformer design called out in architecture_and_eval_en.md §1 Phase 1: "Backbone: Hybrid Mamba-3 + Grouped Query Attention (GQA)". The model code lives in nanochat_cpp_model.md (the GPT/GPTConfig/CausalSelfAttention/Block classes) and the Mamba-3 integration seam is cppmega/features/mamba3/config.py (AuthorMamba3Config, build_author_mamba3_config).
3.1 Attention half
The attention path is grouped-query attention with rotary embeddings and QK-norm. From nanochat_cpp_model.md (class CausalSelfAttention):
- separate
c_q,c_k,c_vprojections withn_kv_head <= n_headandn_head % n_kv_head == 0enforced; - rotary applied to Q and K, then RMSNorm ("QK norm") before attention;
- Flash Attention 3 on CUDA, which "handles GQA automatically when
n_kv_heads < n_heads"; - a per-layer
window_patternstring ("L"long,"S"short) that tiles full-context and half-context layers across depth, with the final layer alwaysL.
On top of vanilla GQA we layer Differential / Clustered Sparse Attention for the long-context layers. The current DSA (nanochat/sparse_attention.py) is functional on GPU but "does not work on TPU - the indexer (top-k block selection) is completely disabled on XLA/TPU, falling back to local-window-only attention via Splash" (architecture_and_eval_en.md §1 Phase 1.5). The replacement is a custom Pallas kernel described in the same section: a three-phase pipeline of importance scoring (q @ K_compressed -> softmax -> top-k), query-tile union selection ("256 adjacent queries share >90% of selected blocks"), then sparse attention with online softmax. Tile sizes are hardware-aligned: Bq=256, l'=256, H=128, Bk=1024, all sized to the TPU v6e MXU 256x256. With top_n=8 this yields "~8-32 active tiles instead of 128 total -> theoretical 64x speedup". This is what lets a 6B specialist handle the 64K-token Context Graphs from v4_architecture.md at training time without quadratic blow-up.
The GPTConfig dataclass (nanochat_cpp_model.md) also exposes the sparse-attention switches: dsa_enabled, dsa_start_layer=7, dsa_top_k_ratio=0.5, dsa_local_window=128, dsa_indexer_heads=16, dsa_indexer_dim=32. In the ensemble, DSA is turned on only above layer 7; the lower layers stay dense because early layers are where most of the local syntactic structure is resolved and sparsity tends to cost more than it saves there.
3.2 Mamba-3 half
Mamba-3 layers replace the attention block in roughly half the depth positions. The configuration surface is defined in cppmega/features/mamba3/config.py:
@dataclass(frozen=True)
class AuthorMamba3Config:
d_model: int
d_state: int
expand: int
headdim: int
ngroups: int
rope_fraction: float = 0.5
dt_min: float = 0.001
dt_max: float = 0.1
dt_init_floor: float = 1e-4
A_floor: float = 1e-4
is_outproj_norm: bool = False
is_mimo: bool = False
mimo_rank: int = 4
chunk_size: int = 64
The Megatron bridge in the same file enforces hidden_size * expand being divisible by mamba_head_dim and rejects any custom mamba_num_heads override that would disagree with hidden_size * expand // mamba_head_dim. This is deliberately narrow: the wrapper is the only legitimate way to configure Mamba-3 inside cppmega, so we do not silently drift from the author-reference kernel.
The interesting flag is is_mimo. When set, the Mamba-3 layer uses the MIMO (Multi-Input Multi-Output) state-space formulation, with a learned psi tensor of shape (H, R, P) (for our NAM56R configuration: H=16, R=4, P=64, documented in mamba3_mimo_p2_psiv_cache_design.md §1). MIMO is what gives each specialist a compressed recurrent view of the entire 64K window at O(N) cost, complementing the sparse-but-exact attention layers. MIMO is enabled on the long-range specialists (Core Language, Systems, Legacy Interop) where multi-file dependency chains dominate; the short-range specialists (Build Systems, Embedded) use SISO Mamba-3 plus DSA and leave the MIMO capacity on the table.
3.3 Layer interleaving
The block stack alternates Mamba-3 and GQA + optional DSA layers. Each block also has two optional branches, both defined in nanochat_cpp_model.md (class Block, class GPTConfig):
- Engram (
engram_enabled,engram_layers,engram_ngram_orders="2,3,4") - the static n-gram branch that, perarchitecture_and_eval_en.md§1, "offloads static C++ syntax/N-grams to DRAM, saving GPU FLOPs". Engram runs in parallel to the mixer and its output is summed into the residual. In the ensemble, Engram is enabled on every specialist but with per-domain n-gram tables - the Build Systems expert has a very different 4-gram prior than the Graphics + SIMD expert. - mHC (Manifold-Constrained Hyper-Connections) (
mhc_enabled,mhc_num_branches,mhc_sinkhorn_iters=5,mhc_temperature,mhc_epsilon,mhc_blend_alpha) - the residual-stream expansion described inarchitecture_and_eval_en.md§1 as "expands residual stream capacity without heavy parameter bloat" and, crucially for the MoE specialists, "naturally prevents routing collapse (all tokens going to one expert) without the need for complex auxiliary load-balancing losses". Per-layer scalarsresid_lambdasandx0_lambdasare also learned and saved in the checkpoint (see_patch_missing_keysinnanochat_cpp_model.md).
The residual equation in a block is therefore roughly x' = resid_lambda * x + x0_lambda * x0 + mixer(x) + engram(x), where mixer is either GQA/DSA or Mamba-3 and engram is optional. mHC, when enabled, replaces the scalar-weighted sum with a Sinkhorn-normalized branch mixer across mhc_num_branches parallel paths.
Multi-token prediction (mtp_enabled, mtp_lambda=0.3) is the DeepSeek-V3-style auxiliary head and is turned on during pretraining for all specialists; at inference it is used only for speculative decoding.
4. Expert specialization inside a specialist
Phase 2 of architecture_and_eval_en.md specifies the sparse expert layout: "Instead of 8 large experts, we use 64 tiny experts, activating the Top-4 or Top-6 per token. This exponentially increases the combinatorial knowledge capacity without increasing inference latency." A fine-grained expert is therefore small enough that eight specialists x 64 experts gives us 512 experts in the ensemble, each a few million parameters, each selectable in combinations.
On top of that: "1 'always-on' expert per MoE layer to handle common C++ syntax and generic structural semantics, allowing routed experts to highly specialize (e.g., templates, multithreading, macros)." The shared expert soaks up generic syntax so the routed experts do not waste parameters on the same { } ; template<...> scaffolding every specialist sees.
Two synergies from the same document are worth being explicit about:
- Engram reduces the required size of the shared expert (ablation #11 in
architecture_and_eval_en.md§4). Engram already soaks up the static n-gram structure; the shared expert only has to cover what n-grams cannot. - mHC prevents routing collapse, allowing aux-free training (ablation #12). Standard MoE needs a load-balancing auxiliary loss to stop the router collapsing onto one expert; with mHC expanding the residual stream, collapse is suppressed structurally and
aux_loss_weightstays at 0.
Routing is DeepSeek-style sigmoid gating (ablation #10 in the same doc) with Top-4 default and Top-6 available for harder queries.
5. Why this beats a 70B generalist on C++
Three arguments, each measurable.
Parameter efficiency. A 70B dense model activates 70B parameters per token. A specialist in the ensemble activates 0.8B-1.6B per token. Even if you sum two specialists on a hard cross-domain prompt (say, templates + SIMD), peak active parameters stay under 3B. On the same hardware budget, we can train each specialist well past Chinchilla optimality on its own slice - architecture_and_eval_en.md §1 calls for "100B+ tokens" on a sub-1B Phase-1 model. The effective tokens-per-parameter ratio on the target distribution is multiple orders of magnitude higher for the ensemble than for the generalist.
Attention + state-space complementarity. A 70B generalist is almost always attention-only, so 64K-token C++ Context Graphs cost O(N^2). The hybrid layer stack replaces half the attention with Mamba-3 MIMO at O(N) and makes the remaining attention block-sparse via DSA, with "~8-32 active tiles instead of 128 total -> theoretical 64x speedup" (architecture_and_eval_en.md §1 Phase 1.5). The mamba3_mimo_p3_register_split_design.md profile on H200 captures the raw kernel budget that this buys us: mamba_mimo_fwd 1192 ms, mamba_mimo_bwd_fwd 1034 ms, mamba_mimo_bwd_bwd 2110 ms per step on a sizeable batch - slow enough that P3 proposes a two-kernel split to cut bwd_bwd register pressure from 255 to ~130 and double occupancy, and mamba3_mimo_p2_psiv_cache_design.md proposes an intra-step PsiV cache for another +1.5-2.3% TFLOP/s. These are the kinds of optimizations you can only do when you own the kernel, which is possible because the specialists are small enough to be kernel-tuned end-to-end.
Targeted training data. The ensemble's training data is domain-partitioned Context Graphs, each a "Callers -> Target -> Callees" subgraph with a 64K-token budget, extracted by v4_context_graph from 27.6M historical commits (v4_architecture.md). A generalist sees some of this data, diluted. A specialist sees only this data, for its domain, with cross-file structure preserved. Evaluation using Gemini 3.1 Pro as an expert C++ reviewer (architecture_and_eval_en.md §3) grades on "Context Adherence (Did it use the provided Callee functions?)" and "Hallucination rate (Did it invent non-existent APIs?)" - both metrics where training on bounded context graphs pays off directly and where a generalist's breadth actively hurts.
Operational wins. A single specialist fits on a consumer or edge GPU - the Phase-1 sub-1B baseline targets "<1GB VRAM" and the Phase-2 5B/0.8B-active design sits comfortably on a 16-24 GB card. The ensemble fans out horizontally; routing is essentially a cheap classifier in front of eight model servers. If one specialist regresses we retrain one specialist, not the entire 70B monolith. Ablation #14 in architecture_and_eval_en.md §4 ("MoE Layer Frequency: MoE on every layer vs. MoE on every alternate layer") is run per-specialist, so the Embedded expert can run MoE-every-other-layer for memory while the Core Language expert runs MoE-every-layer for capacity.
6. What we are not claiming
We are not claiming the ensemble beats a 70B generalist on arbitrary tasks. It will lose on open-domain trivia, natural-language chat, and anything outside C++ systems programming. That is the deal: we trade breadth for depth, and the ensemble is narrow by construction.
We are also not claiming the Mamba-3 optimizations are fully deployed. Per mamba3_mimo_p1_notes.md, P1 (TMA + warp-spec) is currently default OFF and the selective-fwd variant measured a 0.006% throughput delta on bench3 ("wash" - does not ship). The TMA layout fix on branch tma-layout-fix-3d-to-2d is GB10-correctness-verified but H200 perf is still pending. P2 PsiV cache and P3 register split are design-only. The architectural story above describes the shape of the system; the kernel-level performance is a moving target and the design docs are the ground truth for its current state.
7. Summary
Eight specialists, 4-8B total and 0.8-1.6B active each, hybrid Mamba-3 + GQA + DSA backbone, fine-grained MoE with one shared expert per layer, Engram for static n-grams, mHC for residual expansion and routing stability, trained on 64K-token v4 Context Graphs and evaluated with an LLM-as-a-judge over real cross-file C++ prompts. The ensemble is not a marketing configuration; it is what falls out of taking C++ seriously as a distribution and taking inference cost seriously as a constraint.