How We Evaluate the MegaCpp SLM Ensemble Against 70B Generalists
The benchmarks, harness, and cost-per-quality math behind our claim that a focused ensemble of small C++ specialists beats 70B+ generalist models on real C++ work.

How We Evaluate the MegaCpp SLM Ensemble Against 70B Generalists
Most "our model is better" posts hand-wave the evaluation. We have the opposite problem: our whole thesis — that an ensemble of small C++ specialists beats a 70B generalist on production C++ work — only holds if the measurement holds. So this post is the methodology, not the marketing. It describes what we benchmark, how the harness runs, how we price each quality point, and why, for C++ specifically, a narrow ensemble is a better product than a wide 70B.
It consolidates our internal eval design from the nanochat research stream
(architecture_and_eval_en.md, training_review.md, review_gcp_tpu.md,
speed_rep_xx.md) and the production-side numbers we actually run against on
H200 and GB10 (cppmega/docs/production_status.md,
cppmega/docs/reproducible_runs.md).
What We Are Actually Measuring
A 70B generalist is trained to be acceptable at everything: Python, TypeScript,
SQL, prose, image captions, light C++. "Acceptable at C++" in benchmark terms
usually means it can write a compiling std::vector example and explain what
const does. That is not the job. The job is: given a real translation unit,
with its headers, its macros, its call graph, and its template instantiations,
produce a patch that compiles, links, passes tests, and does not hallucinate
APIs.
Our evaluation is built around that job. Concretely, the nanochat eval design
(architecture_and_eval_en.md, §3) frames code-gen quality as four axes that
perplexity cannot see:
- Compilation probability of the generated diff against the original TU.
- Context adherence — did the model call the
Calleefunctions actually present in the provided call graph, or did it invent new ones? - Hallucination rate — rate of references to non-existent symbols, headers, or overloads.
- Correctness, graded against a held-out test set of cross-file prompt graphs.
Perplexity still matters during pretraining because it is cheap and monotonically useful, but it is not a product metric. A 70B with broad web pretraining can look excellent on MBPP-style single-file prompts and still fail every one of the four axes above on a real C++ repo.
The Harness: GKE + T4 Inference, Gemini-as-Judge
The harness mirrors what we ship, not a synthetic sandbox. From
architecture_and_eval_en.md §3:
- A GKE NodePool of
nvidia-tesla-t4GPUs is scaled on demand (scripts/infrastructure/gke_t4_eval_pool.sh). T4s are cheap, plentiful, and more than enough for SLM inference — using them instead of H100s collapses eval cost by roughly an order of magnitude per checkpoint. - Each pod loads a candidate checkpoint (one of our specialists, or a 70B baseline via vLLM) and generates completions against a held-out set of cross-file C++ prompt graphs derived from our v4/v5 Bounded Context Graph dataset (see variant 20 in the ablation list).
- Generated diffs are piped to Gemini 3.1 Pro Preview acting as an expert C++
reviewer. Gemini grades each output on the four axes above. Auth is via GCP
Application Default Credentials, so there are no raw API keys in the worker
image (
eval_worker/evaluate_checkpoint.py). - Jobs are fanned out as Kubernetes
Jobs (eval_worker/eval_job.yaml), one per variant × seed, so we get full per-variant distributions rather than single-point estimates.
LLM-as-a-judge has well-known failure modes — length bias, self-preference,
sycophancy. We mitigate these three ways. First, every judged prompt is paired
with a ground-truth Callee list extracted by Tree-sitter from the original
repo, so "context adherence" and "hallucination rate" are scored against a
deterministic oracle, not Gemini's taste. Second, the compile axis is run
through an actual C++ frontend (clang with the repo's real build flags) before
the judge ever sees the diff; a non-compiling diff caps its maximum score.
Third, we rotate judges periodically and spot-check with a second model to
detect judge drift between eval runs.
Benchmarks We Actually Run
Three layers, from cheap to expensive:
Layer 1 — perplexity and loss curves on held-out C++. Computed every few
thousand steps during training. Used to catch regressions early, never to
claim product quality. The train_400M_10b.log and train_diff_sft.log
streams in the nanochat tree are the raw inputs here.
Layer 2 — 4K-context functional eval on the 20 ablation variants
(architecture_and_eval_en.md §4). Each variant — Dense 1B baselines, Hybrid
Mamba-3 + GQA, Engram, mHC, Fine-Grained MoE (64 experts, Top-4, 1 shared
expert), Ultra-Fine MoE (128 experts, Top-8), plus routing, capacity, and
curriculum ablations — is trained on TPU v6e-x4 slices in parallel and scored
by the T4/Gemini harness. 4K context is chosen deliberately: it is short
enough to get full ablation sweeps in hours, and it is where the 70B
generalists look their strongest, so any specialist win at 4K is not an
artifact of long context.
Layer 3 — 16K and 64K-context eval on v4/v5 Bounded Context Graphs, run on
TPU v6e-x8 slices. This is where cross-file reasoning, template
instantiation across headers, and repo-level refactors get tested. It is
also where 70B generalists tend to stumble: intra-document masking, YaRN /
RNoPE scaling, and our content-dependent sparse attention (the Pallas
prototype in experiments/sparse_pallas/, §1.5) are all specifically tuned
for the C++ long-context regime. Generalists trained on a web-heavy mixture
get no such tuning.
For each variant, we report not a single number but the joint distribution across the four axes, plus compile-rate and judge-agreement confidence intervals. A variant does not "win" by beating on one axis; it wins by Pareto-dominating.
Training Methodology Feeds the Eval
The eval only tells the truth if the training pipeline is honest, and
training_review.md is explicit that earlier TPU runs were not. The
tpu_full_pipeline.py script trained on torch.randint noise with hardcoded
rewards — any "benchmark" off those checkpoints was measuring nothing. We
treat that review as the floor for what counts as a valid eval candidate.
Concretely, a checkpoint is only admitted into Layer 2/3 eval if it was
produced by a pipeline that:
- uses real distributed dataloaders with SPMD sharding, not random tensors;
- uses GQA (
num_kv_heads = num_heads // 4or// 8), not default MHA; - ties input and output embeddings;
- disables weight decay on 1D tensors, biases, and embeddings in SFT / GSPO, not just base training;
- enables gradient clipping (
--max_grad_norm=1.0) and Gemma-style logit softcapping (30.0 at the LM head, 50.0 on attention); - uses intra-document masking on packed sequences and a stepped context
curriculum (4K → 16K → 64K → 128K), not a flat
max_seq_len=1024; - uses RoPE + YaRN or RNoPE for long-context, not bare RoPE with
rope_theta=10000.0.
These are the exact deltas called out in training_review.md and
review_gcp_tpu.md. Each one, on its own, is a small correctness fix; taken
together, they are the difference between a checkpoint whose eval numbers
mean something and one whose numbers are training-noise artifacts.
Production-Side Reproducibility
Training throughput is not a quality metric, but it bounds how many eval
candidates we can actually produce per unit cost. Our H200 and GB10 stacks
are pinned to reproducible configurations in production_status.md and
reproducible_runs.md:
europe-bf16(LOCATION_2, 8x H200 SXM): NAM56R, TP=1 PP=1 EP=4 DP=2, MBS=8 GBS=64, seq=4096, MTP_DEPTHS=2, BF16. Gold record is 289 TFLOP/s per GPU at 29.2% MFU, peak ~127 GiB / 141 GiB per rank. FP8 regresses -34% on this fabric, so BF16 stays canonical.bench3-fp8(LOCATION_1, 8x H200 SXM): same model, TP=1 PP=1 EP=8 DP=1, MBS=10 GBS=80, FP8 tensorwise (--fp8-format hybrid). Steady-state 268 TFLOP/s ± 0.5 at 27.1% MFU, peak ~115 GiB per rank.CG_FLAGS=NONEis mandatory at MBS=10 — the default TransformerEngine CUDA-Graph private pool holds 63.5 GiB and OOMs at iter 1.bench3-smoke: 7-iter smoke test; TFLOP/s must converge to 260–268 by iter 4–7 before any run is admitted to training, let alone to eval.gb10: single-GPU correctness check on NVIDIA GB10 (sm_121, 128 GB unified), BF16, MBS=1 seq=2048. This is not a throughput run; it exists so that the TileLang kernels we ship stay under the sm_121 99 KiB smem cap and produce finite gradients. FP8 Mamba SSM is a dead path on GB10 (0.73–0.91x), so we do not pretend otherwise.
Superseded measurements — the old bench3 269.4 TFLOP/s Liger
reduction="none" number (silent gradient corruption via Liger #968), the
PP=2 193 TFLOP/s europe baseline, the never-real "205 TFLOP/s DualPipeV
baseline" — are retired in production_status.md and explicitly not cited.
The active Liger workaround is reduction="mean" broadcast in
cppmega/megatron/apply_linear_ce_patch.py, and
CPPMEGA_MTP_NATIVE_HOPPER_CE stays OFF because it produces grad_norm=NaN.
That discipline is the point. Our eval numbers are only as good as the checkpoints feeding them, and our checkpoints are only as good as the pinned, smoke-tested, no-silent-corruption training stack they came from.
Cost Per Quality Point
This is where the ensemble argument actually lives. A 70B dense generalist at BF16 needs ~140 GB of weights alone; served at reasonable throughput it wants 2× H100/H200-class GPUs per replica, plus KV cache. Our Dense 1B baseline fits in <1 GB of VRAM and runs happily on a T4; our Fine-Grained MoE target is ~5B total / ~800M active, which still serves comfortably on a single mid-range GPU because only the active experts and shared expert are on the hot path per token.
So the comparison is not "small vs. large model" in the abstract — it is a cost-per-quality-point comparison across three regimes:
- Per eval run. The harness cost of scoring one checkpoint against the full Layer 2 suite on T4 pods is roughly 10–20× cheaper than scoring a 70B via H100 inference at the same batch and context. That lets us score every variant × every seed, which is how we get real distributions and real confidence intervals instead of a single vibes-based number.
- Per training run. A v6e-x4 slice sweeping 20 ablation variants in
parallel at 4K context costs a small fraction of a single 70B pretrain
epoch. The nanochat training logs (
train_100M.log,train_400M.log,train_400M_10b.log) give us the wall-clock anchors; the production H200 stack gives us the scale anchors. - Per deployed quality point. On the four-axis C++ eval, the ensemble — Dense 1B + Fine-Grained MoE + long-context variant, routed per request — wins on context adherence and hallucination rate against 70B generalists in our internal runs, and is within noise on compile-rate for single-file prompts. Because it runs on ~1 T4 per replica rather than ~2 H200s, the $/quality-point ratio is not close.
Why Ensemble > 70B Generalist for C++
Pulling the threads together:
C++ rewards specialization. The grammar is huge, the type system is Turing-complete at compile time, the idioms (templates, SFINAE / concepts, RAII, ODR, ABI stability) are unforgiving, and the "right answer" for a given TU depends on headers and build flags the model has to actually look at. A 70B generalist spends most of its parameters on things that are not C++. A fine-grained MoE with 64 tiny experts and a shared expert (variant 8 in the ablation list) lets us route templates, multithreading, and macro-heavy code to different experts without paying 70B worth of inference tax per token. Engram absorbs the most basic syntax into DRAM so the shared expert can stay small; mHC expands residual capacity without parameter bloat and, empirically, suppresses routing collapse without the aux-loss scaffolding.
Long C++ context rewards architecture, not just size. Our content-dependent
sparse attention (Pallas, experiments/sparse_pallas/) is tuned to TPU v6e
MXU tiles (Bq=256, l'=256, Bk=1024, H=128) and targets ~8–32 active tiles
out of 128 for 128K-context inputs. A 70B generalist with stock dense
attention at 128K is paying a quadratic cost it cannot amortize; a 5B
specialist with correct sparse attention and intra-document masking is
paying a near-linear cost on exactly the repo-level inputs our users care
about.
The eval rewards honesty, not scale. Because the harness scores compile, context adherence, hallucination, and correctness separately — with Tree-sitter ground truth for the adherence and hallucination axes and a real compiler for compile-rate — "bigger model" stops being a free win. A 70B that invents a header or calls a non-existent overload gets penalized the same way a 1B would, and it happens more often than the marketing suggests. Meanwhile a specialist that refuses to hallucinate and sticks to the provided call graph scores well on the axes that actually correlate with "the patch landed".
What We Publish
For every eval cycle, we publish: the exact checkpoint hash, the training
config (optimizer groups, softcap values, context curriculum, masking), the
production config it was trained under (one of europe-bf16, bench3-fp8,
or bench3-smoke), the harness commit, the judge model and prompt, and the
full per-axis distribution across seeds. Superseded numbers are marked as
such and kept for history. No single "headline TFLOP/s" or "headline
pass@1" is reported without the stack fingerprint behind it.
That is the bar. The ensemble claim only counts if the measurement counts, and the measurement only counts if the training stack and the harness are both honest end-to-end. Everything above is the scaffolding that makes "specialists beat a 70B on C++" a falsifiable statement rather than a slogan.
References
- nanochat/architecture_and_eval_en.md — four-axis eval, 20-variant ablation, GKE T4 + Gemini-as-judge harness, Pallas sparse-attention prototype.
- nanochat/training_review.md — GQA vs MHA, embedding tying, logit softcapping, weight-decay groups in SFT/GSPO, grad clipping, context curriculum.
- nanochat/review_gcp_tpu.md — rejection of
torch.randintTPU pipelines, Muon + AdamW split, SPMD sharding, document masking, RNoPE / YaRN. - nanochat/speed_rep_xx.md — throughput anchors,
torch.compile+ FSDP2 regressions, DSA backend benchmarks, MFU baselines. - cppmega/docs/production_status.md —
europe-bf16289 TFLOP/s / 29.2% MFU,bench3-fp8268 TFLOP/s / 27.1% MFU, deprecated-measurement list, Liger #968 workaround. - cppmega/docs/reproducible_runs.md — pinned launch scripts
(
europe-bf16,bench3-fp8,bench3-smoke,gb10), fail-fast signals, smem preflight on GB10.