How We Evaluate the MegaCpp SLM Ensemble on Real C++ Work
The evaluation design, verifier stack, and release gates we use to measure C++ model quality without collapsing everything into a single leaderboard number.

The MegaCpp evaluation story only matters if the measurement is strict enough to reject plausible-looking but wrong code. This post is about that measurement design. It explains what we score, how the harness runs, why compile-and-test signals outrank generic text metrics, and how we compare release candidates across quality, cost, and operational complexity.
The first boundary to keep straight is vocabulary. In this series, eval is the whole measurement program, harness is the orchestration layer around tasks, seeds, and artifacts, verifier is the compile-and-test authority that assigns pass/fail labels, and judge is any softer model-based scorer used only where no deterministic oracle exists. If you want the component view first, eval harness plumbing covers orchestration, verifier-first C++ evals covers why the verifier owns the label, and C++ eval suites and verifiers covers the task families and failure buckets. The shortest checked-in path is C++ eval suites and verifiers for the verifier contract, Compile/runtime receipt sample for the compact receipt shape, then Compile and runtime capture examples for the surrounding compile/runtime guardrails.
Two first-touch labels help here. A repository-grounded task is a prompt plus the real translation-unit context, symbol set, and held-out checks that define success. A verifier-backed label is the compile/test outcome produced under the declared toolchain, timeout, and harness revision. Those are the units that make later scorecards comparable instead of anecdotal.
The central idea is simple: a C++ model should be judged on repository-grounded work, not on whether it can produce a nice-looking snippet in isolation. That means evaluating generated patches against real translation units, real build settings, real symbol sets, and held-out correctness checks.
Why this matters
A broad general-purpose model can look strong on short, self-contained coding prompts and still fail on the things that matter in a real C++ repository. The job is not to emit a plausible std::vector example. The job is to take a translation unit with its headers, macros, call graph, and template instantiations, then produce a change that compiles, links, passes tests, and avoids inventing APIs.
If the evaluation stack cannot tell the difference between "the patch landed" and "the snippet sounded reasonable," it will promote the wrong checkpoints.
1. What we are actually measuring
Our public evaluation design frames code-generation quality along four axes that text-only metrics do not capture well:
- Compilation probability of the generated diff against the original translation unit.
- Context adherence: did the model stay within the provided symbols, callees, and include structure?
- Hallucination rate: references to symbols, headers, or overloads that are not actually present.
- Correctness against held-out tests and repository-grounded task checks.
Perplexity still matters during pretraining because it is cheap and useful for early regression detection, but it is not the product metric. Release decisions lean on verifier-backed outcomes.
Why these four, not pass@k alone
pass@k is useful, but by itself it hides failure modes that matter in C++. A candidate can compile accidentally while still violating context, and a candidate can stay syntactically neat while inventing APIs that do not exist. Keeping the axes separated makes it easier to spot where a checkpoint is getting better and where it is only getting luckier.
2. The harness: cheap inference, compiler-grounded judging
The harness is designed to mirror the product problem more closely than a synthetic sandbox does.
- Candidate checkpoints generate completions against held-out cross-file C++ prompt graphs.
- Generated diffs are checked against compile and verifier rules before any summary score is computed.
- Runs are fanned out by variant and seed so the team can compare distributions rather than a single convenient point estimate.
- External judge models can help with structured review, but deterministic repository signals stay in charge for compile validity and symbol adherence.
LLM-as-a-judge systems have known biases, so the harness leans on deterministic oracles wherever possible. The compile axis is run through a declared C++ frontend and build contract before a reviewer model sees the result. Symbol and callee checks are matched against repository-derived context rather than taste-based grading. For the context side of that claim, the fastest local follow-up is Compile commands and semantic graphs plus Semantic indexing notes.
That layered design is partly about cost discipline, not just rigor. Cheap parser and symbol passes can reject structurally broken or obviously repository-incompatible outputs before a full compile-and-test loop burns most of the evaluation budget on them. That keeps syntax failure separate from the later compile, link, and held-out-test buckets instead of collapsing every bad sample into one opaque "fail."
spec:
parallelism: 8
template:
spec:
containers:
- name: eval-worker
image: eval-worker:<release-tag>
args: ["--variant", "$(VARIANT)", "--seed", "$(SEED)"]
3. Three benchmark layers
We run three layers, from cheapest to most release-like.
Layer 1: held-out loss and perplexity checks during training. These are early warning signals, not headline product metrics.
Layer 2: shorter-context functional evaluation for ablation sweeps and rapid checkpoint comparison.
Layer 3: longer-context evaluation on repository-grounded bounded-graph tasks, where cross-file reasoning and context discipline matter more.
| Layer | Context | Primary purpose | Cost class | Cadence |
|---|---|---|---|---|
| 1 | 1K-4K | regression detection during training | low | every few thousand steps |
| 2 | 4K | rapid functional comparison across variants | medium | every promoted checkpoint |
| 3 | 16K-64K | release-candidate validation on long-context tasks | higher | before release decisions |
The key point is not the exact hardware mix. The key point is that cheaper layers are used to filter candidates early, while the more expensive layers are reserved for decisions that justify the additional evaluation cost.
Layer 3 is where the repository task stops being a toy prompt. The context is large enough to carry neighboring translation units, include relationships, and symbol ownership, so the evaluation can ask whether the model still respects the repo when the easy "single-file snippet" escape hatch is gone. That release-grade lane needs pass-to-pass discipline as well as fail-to-pass wins: a patch that flips one held-out task by breaking unrelated already-green behavior is not actually better. C++ eval suites and verifiers is the local follow-up on how those buckets stay visible.
4. Why verifier-first scoring wins
The strongest lesson from the current stack is that verifier-backed evaluation is more reliable than pure aesthetic judging.
A compiler can reject a patch even when the prose around it sounds convincing. A symbol checker can catch invented APIs even when a judge model finds the explanation persuasive. A held-out test can fail even when the patch looks stylistically correct. Those are not edge cases in C++; they are the point of the evaluation.
That is why the harness treats compile-and-test outcomes as the authority and uses softer review signals only as supporting structure.
5. Release gating and reproducibility
Training throughput is not a user-facing quality metric, but the training stack still matters because unstable training receipts produce unstable evaluation results. Public reproducibility notes therefore focus on configuration discipline rather than on marketing-friendly peak numbers, starting with the dataset and packaging discipline described in SLM data pipeline.
The release process keeps a few principles fixed:
- candidate checkpoints must come from pinned, reproducible training lanes
- smoke variants must converge into an expected steady-state band before they are admitted to larger runs
- superseded or invalidated measurements are retired instead of silently mixed into newer summaries
- evaluation artifacts should record checkpoint identity, harness revision, verifier settings, and seed distributions
Those artifacts also need enough evidence to explain a regression later: extracted source or diff identity, compile bucket, timeout class, verifier settings, and seed-level outcomes. Without that trail, a model regression and a harness drift can land on the same top-line score.
6. Cost per quality point
The cost argument is easiest to reason about when quality and operational complexity are reported together.
A smaller or more specialized serving target can make it practical to evaluate more variants, run more seeds, and keep repository-grounded checks in the loop. A larger general-purpose baseline can still be a useful comparison point, but it often raises evaluation cost enough that teams are tempted to cut corners on repeatability or depth.
For MegaCpp, the practical comparison is therefore not just "small versus large model." It is:
- how much evidence can we afford to collect per candidate release
- how faithfully can we keep compile-and-verifier checks in the loop
- what serving and iteration costs follow from the chosen architecture
The inference lane has its own throughput knobs, but they are not quality metrics. Shorter functional sweeps can run more aggressive concurrency, while 16K-64K repository tasks usually need more headroom so the serving path does not silently shrink or distort the long-context task being measured.
Why ensemble-style evaluation is still useful
A specialist ensemble is interesting only if the evaluation can show where specialization helps and where it does not. The current design is intended to make that visible across context adherence, hallucination control, and long-context repository work rather than collapsing everything into one headline number.
What we kept and what we threw away
We keep the four-axis methodology, compiler-grounded checks on structure and validity, verifier-first release gates, and the practice of comparing distributions rather than cherry-picked single runs.
We threw away single-number reporting, pass@1-only storytelling, and benchmark summaries that are not traceable back to a declared verifier contract.
Frequently asked questions
Why is pass@k not enough for C++ release decisions?+
What makes two eval runs meaningfully comparable?+
What is the practical difference between a harness, a verifier, and a judge?+
Why do long-context repo evals usually batch fewer sequences than short sweeps?+
Why do release-grade repo evals track fail-to-pass and pass-to-pass checks separately?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.
The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.
A grounded architectural read of the MegaCpp small-model stack: hybrid patterns, block semantics, schedule ownership, and why names like ablock,…