Training Recipe for the MegaCpp SLM Ensemble
How we pretrain, fine-tune and deploy a small-language-model ensemble for C++ codegen: Muon-AdamW hybrid, FP16 training, NVFP4 inference, 100-200B tokens per specialist, and a staged context-length curriculum.

The MegaCpp ensemble is not a single monolith. It is a small swarm of specialist SLMs (a 400M C++ generator, a 100M CMake specialist, the 270M FunctionGemma router, and a heavier 4.73B NAM56R hybrid backbone for deep reasoning) trained against a shared tokenizer and a shared evaluation harness. This post documents the production training recipe end to end: the optimizer split, the precision strategy from FP16 forward through NVFP4 inference, the per-specialist token budget, and the curriculum that takes a freshly initialised model from 4K file-level syntax to 64K repository-level reasoning. Every claim below is grounded in the engineering notes that drove the run.
Architecture Floor
Before any tokens are spent, every specialist inherits the same architectural floor. Attention is Grouped Query Attention with n_kv_heads = num_heads // 4, not Multi-Head Attention; the previous default of num_kv_heads = num_heads blew up KV-cache memory and made the long-context curriculum mathematically infeasible on TPU HBM ([training_plan_en.md], [training_review.md]). Input and output embeddings are tied, recovering 15-20% of parameter count on the small specialists at no measured quality cost ([training_review.md]). QK-Norm is removed - it destroys the dot-product magnitude information the model needs for long-context retrieval ([training_plan_en.md]) - and replaced with Gemma-style logit softcapping at 30.0 on the LM head and 50.0 on attention logits ([training_review.md]). Sequence packing is mandatory, but only with intra-document masking via cu_seqlens plumbed into Flash Attention; without it, cross-document contamination scales quadratically with context length and silently destroys the model's ability to isolate logic ([training_plan_en.md]).
The 4.73B NAM56R backbone layers in the hybrid stack on top of that floor: 27 Mamba3 mixers, 13 Attention blocks (9 DSA + 4 MLA), and 12 MoE experts ([fp8_optimization_session_2026_04_13.md]). The smaller specialists use the AAM (Attention-Attention-Mamba) hybrid pattern with Mamba-3 carrying the linear-time long-context scan and Attention acting as a dense routing node ([training_plan_en.md]). On the SLMs we additionally enable Engram conditional memory (a DRAM-resident lookup that offloads factual C++ syntax from the compute stream) and mHC manifold-constrained hyper-connections with n=4 Sinkhorn-Knopp normalised branches to widen the residual stream without paying for dense scaling ([training_plan_en.md]).
Optimizer: Muon + AdamW Hybrid
Every specialist trains under a strict Muon + AdamW split. Muon owns the 2D weight matrices where its sample efficiency pays off; AdamW owns everything else - 1D tensors, biases, embeddings, RMSNorm gains, the LM head when untied, and any MTP-specific parameters ([training_review.md], [nam56r_mtp_optimization_plan_2026_04_11.md]). The split is non-negotiable: putting MTP parameters under Muon triggers a hard assertion in the optimizer (we hit it four times during FlashAdamW integration before adding the explicit Muon exclusion fix on bench3, [fp8_optimization_session_2026_04_13.md]).
Weight decay is a parameter-group property, not a global flag. The base trainer correctly assigns weight_decay=0.0 to all non-2D parameters via AdamW; the post-training scripts originally did not, and torch.optim.AdamW(model.parameters(), weight_decay=config.weight_decay) was decaying embeddings in sft_train.py, rl_train.py and gspo_train.py. Embedding decay shrinks norms, which inverse-proportionally spikes gradients in early layers via the LayerNorm Jacobian. Every post-training script now constructs explicit parameter groups with weight_decay=0.0 for 1D tensors, biases, and embeddings ([training_review.md]). Gradient clipping defaults to --max_grad_norm=1.0 everywhere, including the TPU run scripts where it was previously disabled and silently allowed loss-spike recovery to fail ([training_review.md]).
Muon's collective layout matters as much as its math. On TPU v6e-8 the mesh is Mesh(device_ids, (8,), ("data",)) - pure 1D data parallelism, no tensor parallelism for sub-1B specialists - and the global token batch is held constant at total_batch_size = 524,288 (~0.5M tokens/step) so that Muon's all-to-all collectives stay within the topology's bandwidth envelope ([training_plan_en.md], [training_run_plan_en.md]). Micro-batch size shrinks as context length grows, and gradient_accumulation_steps is dynamically scaled to preserve the global batch across every context-extension phase ([training_plan_en.md]).
Precision Strategy: FP16 Train, NVFP4 Inference
The training story is FP16-class precision (BF16 in practice for the H200 path, mixed FP16/BF16 with FP32 master weights for the smaller TPU runs); the inference story is NVFP4. We tried more aggressive options and most of them lost. TransformerEngine FP8 GEMMs were measured exhaustively on the NAM56R 4.73B model on H200x8: the BF16 baseline at MBS=6, GBS=48 is 158 TFLOP/s and FP8 at the same shape lands at 158 TFLOP/s. FP8 at MBS=4 GBS=32 is 154; at MBS=4 GBS=64 it is 156; with --fp8-param-gather it saves 5 GiB of weight memory but holds the same speed ([fp8_optimization_session_2026_04_13.md]). The root cause is that GEMMs are only 23.5% of compute on a hybrid Mamba model (SSM is 27.5%, elementwise is 14.7%), and FP8's amax quantize/dequantize plus history overhead applies on every GEMM. TE additionally keeps a BF16 master copy alongside the FP8 weights, costing +8 GiB; --fp8-param-gather removes the BF16 copy but --use-precision-aware-optimizer is incompatible with it (Int16 vs FP32 assertion). The conclusion is recorded as a hard rule in the optimization log: "TE FP8 GEMMs at --fp8-format hybrid = net zero or net loss for hybrid Mamba model. Do not use for NAM56R." ([fp8_optimization_session_2026_04_13.md]).
The selective FP8 path - monkey-patching megatron.core.fp8_utils.get_fp8_context to apply FP8 only to the 22 MoE layers while keeping Mamba and Attention in BF16 - is written (cppmega/megatron/selective_fp8_moe_patch.py, 134 lines, gated by CPPMEGA_SELECTIVE_FP8_MOE=1, requires --fp8-recipe tensorwise) but not yet a production default; the first tests showed the same 158 TFLOP/s because the env var did not propagate, and CG conflicts consumed the remaining session budget ([fp8_optimization_session_2026_04_13.md]). Per-mixer correctness for FP8 is, however, established for every Mamba3 path we ship: smoke runs at seq_length=4096, PP=4, MBS=2, GBS=16 confirm Path A (mamba3_te_stack_spec), Path B (Author SISO 6/7), Path C (Author MIMO R=4, unblocked 2026-04-11 by patching the TileLang mamba_mimo_bwd_combined G<H reduction), and Path D (nam56r_noconv_spec) all converge cleanly under --fp8-format hybrid --fp8-amax-history-len 16 --fp8-amax-compute-algo max ([fp8_path_status.md]). The two precautionary guards - AuthorMamba3Mixer currently supports non-fp8, non-fp4 runs only and the sibling guard in m2rnn_spec.py - have been removed and TE's FP8 wrap of TELayerNormColumnParallelLinear / TERowParallelLinear handles fp32 bias/parameter tensors cleanly ([fp8_path_status.md]).
The golden production training config on H200x8 is therefore not FP8 at all. It is BF16, MBS=8, Liger fused cross-entropy, no CUDA graphs, which sustains 265 TFLOP/s and 84k tok/sec (27% MFU), with MBS=10 reaching 268 TFLOP/s under tight 128 GiB memory ([fp8_optimization_session_2026_04_13.md]). CUDA graph capture of the TE backward path measured a -15% throughput regression and is disabled. The 158→265 TFLOP/s recovery came from a single observation: dense_indexer_loss_fallback was a 37% overhead, eliminated by CPPMEGA_DSA_INDEXER_LOSS_COEFF=0 ([fp8_optimization_session_2026_04_13.md]).
Inference is a different precision regime. NVFP4 is the deployment target for the SLM ensemble - the 4-bit micro-scaled format gives us roughly 4x weight compression and the latency profile we need for the agent orchestrator to stay interactive across CodeGen, CMakeGen, FunctionGemma routing, and the GKE sandbox round-trip ([TRAINING_PLAN.md]). Training in BF16/FP16 with carefully bounded logit ranges (the 30/50 softcap pair) keeps post-training NVFP4 quantization in a regime where calibration converges quickly and the inference stack does not need quantization-aware training. Modal B200 benchmarking of the DSA indexer found FP8 11.4% slower than BF16 on the indexer shapes (too small for FP8 amortization), and FP4 was not testable on TE 2.1 because the FP4 API is not exposed there ([nam56r_mtp_optimization_plan_2026_04_11.md]) - so the precision split (BF16 train, NVFP4 inference) is also a pragmatic answer to the current toolchain.
Token Budget: 100-200B per Specialist
Each specialist gets 100-200B training tokens, scaled to its parameter count and to the staged context curriculum below. The d24 production target (~877M parameters) is sized for a 50K-step Stage 2 base run at GBS 524,288 tokens/step, which is ~26B tokens at 4K context, followed by 16K and 64K context-extension phases that consume the remaining budget at lower micro-batch sizes ([training_run_plan_en.md]). The d16 400M generator runs the same staged budget at higher iteration count (50000 base iterations under the v3 plan, ~26B tokens at 4K) before being extended ([TRAINING_PLAN.md], [training_run_plan_en.md]). Cost is bounded: Stage 1 d16 5K-step validation at ~$30, Stage 2 d24 50K-step base at ~$1,800, Stage 3 16K+64K context extension at ~$1,100, total Phase 1-3 budget ~$2,930 at $15/hour on-demand v6e-8 ([training_run_plan_en.md]).
The dataset substrate for the small generators is treesitter_compilable_16k (34 files, 11.6 GiB, 18 shards of ~360 MB each plus a _COMPLETE sentinel), tree-sitter chunked with compilable ordering ([training_run_plan_en.md]). The base mix is 70% code (single_func, class_block from chunk_cpp_data.py), 20% text/docs, 10% math at the 4K syntax stage ([training_plan_en.md]). On top of that, base pretraining splits the code stream into 40% raw next-token prediction, 40% random FIM, and 20% structured FIM - the structured slice trains docstring-prefix to function-body-middle infilling, which is the primary surface our agent uses ([TRAINING_PLAN.md]).
Curriculum: 4K → 16K → 64K
The progressive context curriculum is the spine of the recipe. Step 3.1 trains syntax mastery at 4K with rope_theta=10,000, micro-batch 16 per chip (128 global), accumulation 1, standard NTP plus FIM at rate 0.5 ([training_plan_en.md]). Anti-ossification runs in this phase only: ReDo recycles dead neurons under tau < 0.025 and DASH applies direction-aware shrinking to prevent feature overfitting ([training_plan_en.md]). When loss plateaus on 4K we transition to Step 3.2: 16K context with a 70/30 mix of full_file/func_chain against single_func to prevent catastrophic forgetting of short-range syntax, micro-batch 4 per chip (32 global), accumulation 4, rope_theta rescaled to 500,000 ([training_plan_en.md]). Context extension is not a free hyperparameter swap: we trigger FIRE (Frobenius-Isometry REinitialization) on mode='context_extension', which orthogonalizes the Q/K projection matrices and resets optimizer momentum. This surgery is the difference between a model that adapts to new token distances and one that silently degrades ([training_plan_en.md]). Step 3.3 lifts to 64K repo-level: 50% cpp_dep_aware (repository dependency graphs), 30% 16K files, 20% 4K chunks, micro-batch 1 per chip (8 global), accumulation 16, rope_theta rescaled to 1,000,000 with YaRN, and FIRE triggered one final time on Q/K ([training_plan_en.md]). Sparse attention via the custom Pallas kernel in experiments/sparse_pallas/ is the path to 128K - importance scoring, union selection over Bq=256 query blocks, online softmax over 8-32 selected tiles instead of the 64-128 total - measured at a theoretical 64x FLOP saving at 128K with top_n=8 ([training_plan_en.md]).
Post-Training: SFT then GSPO
SFT runs at 64K with micro-batch 1 per chip, on a high-quality instruction-following mix in V3/V4 pre/post-diff format from prepare_mixed_sft.py, with a small percentage of repo-level FIM held in to preserve long-context generation ([training_plan_en.md]). The SFT scheduler is the single most-burned bug we have on record: --warmup_ratio=0.1 against ~1.4M total SFT steps caused the LR to climb aggressively for 142,000 steps and produced the int& int& repetition-loop mode collapse observed at 100K steps. The fix is an absolute --warmup_steps=500, not a ratio ([training_plan_en.md]). ReDo, DASH, and FIRE are all strictly disabled during SFT; they shred the fragile fine-tuning instructions ([training_plan_en.md]). The SFT corpus combines diff_sft.jsonl (60k PR/MR repairs), docstring_sft.jsonl, and pass_k_sft.jsonl (multi-solution variations); pairs flagged with large reward differences and high code similarity provide the strongest signal, per Anchored Preference Optimization (ICLR 2025) ([TRAINING_PLAN.md]).
RL alignment is GSPO (Group Sequence Policy Optimization), not GRPO. The reward stream is verifiable execution: prompts are sampled from humaneval_cpp.jsonl and custom_problems.jsonl, candidate solutions are generated under pass@k, and the GKE Agent Sandbox executes them in a hermetic environment to produce binary correctness rewards ([TRAINING_PLAN.md]). Stage targets are explicit: CodeBLEU > 0.3 / pass@1 > 0.05 after base, > 0.4 / > 0.15 after SFT, > 0.5 / > 0.25 after GSPO, and end-to-end agent success rate > 0.4 ([TRAINING_PLAN.md]).
MTP and the 19% Tax
The 4.73B backbone trains with Multi-Token Prediction at depth 1 because MTP improves both training signal and inference-time speculative decoding, but only when its overhead stays in the published 3-5% range. Our VPP PP=2 VPP=2 MBS=4 GBS=64 baseline measured 1963 ms/iter, 112,152 tok/sec, with MTP adding 374 ms/iter (19.1%) - 2-10x the published numbers from Meta (~0% with sequential detach) and DeepSeek-V3 (~2-5% on a 61-layer model) ([nam56r_mtp_optimization_plan_2026_04_11.md]). Six research agents converged on three causes. First, the launch script passed --untie-embeddings-and-output-weights, which gave the MTP head its own [3584 × 65536] ≈ 235M-parameter output projection plus the corresponding optimizer state; DeepSeek-V3 explicitly ties embedding and output head with the main model, and the fix is a one-line removal because Megatron defaults to tied ([nam56r_mtp_optimization_plan_2026_04_11.md]). Second, MTP was appended to the last main pipeline chunk, fattening it from 460 ms to 834 ms and stalling 1F1B for every micro-batch; promoting MTP to a standalone VPP chunk lets interleaved 1F1B overlap MTP forward with main backward across micro-batches and recovers ~300 ms/iter (projected ~157,700 tok/sec) ([nam56r_mtp_optimization_plan_2026_04_11.md]). Third, process_mtp_loss() materialises the full [B×S, V] logits tensor; Liger-Kernel's fused_linear_cross_entropy chunks the matmul and computes local log-sum-exp without storing logits, which is 30-50% of MTP cost per Meta's paper ([nam56r_mtp_optimization_plan_2026_04_11.md]). The DeepSeek λ schedule (0.3 for the first 10T tokens, 0.1 for the final 4.8T) is also adopted, matching the NeMo DeepSeek-V3 recipe and Megatron-Core defaults ([nam56r_mtp_optimization_plan_2026_04_11.md]). The reference implementation is in our sister project's nanochat/mtp.py: shared block (one Linear(2D→D) plus one Transformer block recursed K times, not K separate blocks), weight-tied wte/lm_head passed into forward, roll-and-mask static shapes to prevent K-way graph recompile, activation checkpointing inside the K-loop, fused linear+CE per depth, cadence scheduling so λ=0 skips the MTP forward entirely, and mtp.* params routed to AdamW (never Muon) ([nam56r_mtp_optimization_plan_2026_04_11.md]).
Closing the Loop
The recipe is deliberately conservative where conservatism wins (BF16 GEMMs over FP8 on hybrid Mamba), aggressive where measurement supports it (Muon on 2D, NVFP4 at inference, sparse attention via Pallas at 64K+), and explicit about every guardrail (intra-document masking, GQA, gradient clipping at 1.0, weight decay only on 2D, Gemma-style softcaps, embedding tying, FIRE on context extension, surgery-off during SFT, absolute warmup steps). Each specialist - 100M CMake, 270M FunctionGemma, 400M C++ generator, 4.73B NAM56R backbone - moves through the same staged pipeline against the same evaluation harness, and every change to the recipe is bound to a number we have measured, not a number we have wished for ([fp8_optimization_session_2026_04_13.md], [fp8_path_status.md], [training_review.md]).