Why is pass@k not enough for C++ release decisions?

Because a patch can compile by accident while still violating repository context or inventing APIs. The release view here keeps compile success, context adherence, hallucination rate, and held-out correctness checks separate so regressions stay visible.

What makes two eval runs meaningfully comparable?

Checkpoint identity, harness revision, verifier settings, and seed distribution all need to be pinned and recorded. Without that, the comparison is a convenient anecdote rather than a reproducible measurement.

What is the practical difference between a harness, a verifier, and a judge?

The harness runs the jobs and records the receipts. The verifier assigns the executable truth label. A judge can add narrow supporting signal where no deterministic oracle exists, but on C++ tasks it does not get to overrule the verifier.

Why do long-context repo evals usually batch fewer sequences than short sweeps?

Because repository-grounded prompts spend much more memory on context and cache state. Lower sequence concurrency is often the price of keeping 16K-64K tasks honest instead of silently shrinking the task.

Why do release-grade repo evals track fail-to-pass and pass-to-pass checks separately?

A candidate is not better just because it flips one failing held-out task to green. Release-grade evals also watch the already-green checks, so a patch does not earn credit for fixing the target while quietly breaking unrelated behavior elsewhere in the repository. Keeping fail-to-pass and pass-to-pass buckets separate makes regressions visible instead of hiding them inside one top-line score. C++ eval suites and verifiers explains how those buckets stay visible.

How We Evaluate the MegaCpp SLM Ensemble on Real C++ Work

The MegaCpp evaluation story only matters if the measurement is strict enough to reject plausible-looking but wrong code. This post is about that measurement design. It explains what we score, how the harness runs, why compile-and-test signals outrank generic text metrics, and how we compare release candidates across quality, cost, and operational complexity.

The first boundary to keep straight is vocabulary. In this series, eval is the whole measurement program, harness is the orchestration layer around tasks, seeds, and artifacts, verifier is the compile-and-test authority that assigns pass/fail labels, and judge is any softer model-based scorer used only where no deterministic oracle exists. If you want the component view first, eval harness plumbing covers orchestration, verifier-first C++ evals covers why the verifier owns the label, and C++ eval suites and verifiers covers the task families and failure buckets. The shortest checked-in path is C++ eval suites and verifiers for the verifier contract, Compile/runtime receipt sample for the compact receipt shape, then Compile and runtime capture examples for the surrounding compile/runtime guardrails.

Two first-touch labels help here. A repository-grounded task is a prompt plus the real translation-unit context, symbol set, and held-out checks that define success. A verifier-backed label is the compile/test outcome produced under the declared toolchain, timeout, and harness revision. Those are the units that make later scorecards comparable instead of anecdotal.

The central idea is simple: a C++ model should be judged on repository-grounded work, not on whether it can produce a nice-looking snippet in isolation. That means evaluating generated patches against real translation units, real build settings, real symbol sets, and held-out correctness checks.

Why this matters

A broad general-purpose model can look strong on short, self-contained coding prompts and still fail on the things that matter in a real C++ repository. The job is not to emit a plausible std::vector example. The job is to take a translation unit with its headers, macros, call graph, and template instantiations, then produce a change that compiles, links, passes tests, and avoids inventing APIs.

If the evaluation stack cannot tell the difference between "the patch landed" and "the snippet sounded reasonable," it will promote the wrong checkpoints.

1. What we are actually measuring

Our public evaluation design frames code-generation quality along four axes that text-only metrics do not capture well:

Compilation probability of the generated diff against the original translation unit.
Context adherence: did the model stay within the provided symbols, callees, and include structure?
Hallucination rate: references to symbols, headers, or overloads that are not actually present.
Correctness against held-out tests and repository-grounded task checks.

Perplexity still matters during pretraining because it is cheap and useful for early regression detection, but it is not the product metric. Release decisions lean on verifier-backed outcomes.

Why these four, not pass@k alone

pass@k is useful, but by itself it hides failure modes that matter in C++. A candidate can compile accidentally while still violating context, and a candidate can stay syntactically neat while inventing APIs that do not exist. Keeping the axes separated makes it easier to spot where a checkpoint is getting better and where it is only getting luckier.

2. The harness: cheap inference, compiler-grounded judging

The harness is designed to mirror the product problem more closely than a synthetic sandbox does.

Candidate checkpoints generate completions against held-out cross-file C++ prompt graphs.
Generated diffs are checked against compile and verifier rules before any summary score is computed.
Runs are fanned out by variant and seed so the team can compare distributions rather than a single convenient point estimate.
External judge models can help with structured review, but deterministic repository signals stay in charge for compile validity and symbol adherence.

LLM-as-a-judge systems have known biases, so the harness leans on deterministic oracles wherever possible. The compile axis is run through a declared C++ frontend and build contract before a reviewer model sees the result. Symbol and callee checks are matched against repository-derived context rather than taste-based grading. For the context side of that claim, the fastest local follow-up is Compile commands and semantic graphs plus Semantic indexing notes.

That layered design is partly about cost discipline, not just rigor. Cheap parser and symbol passes can reject structurally broken or obviously repository-incompatible outputs before a full compile-and-test loop burns most of the evaluation budget on them. That keeps syntax failure separate from the later compile, link, and held-out-test buckets instead of collapsing every bad sample into one opaque "fail."

spec:
  parallelism: 8
  template:
    spec:
      containers:
        - name: eval-worker
          image: eval-worker:<release-tag>
          args: ["--variant", "$(VARIANT)", "--seed", "$(SEED)"]

3. Three benchmark layers

We run three layers, from cheapest to most release-like.

Layer 1: held-out loss and perplexity checks during training. These are early warning signals, not headline product metrics.

Layer 2: shorter-context functional evaluation for ablation sweeps and rapid checkpoint comparison.

Layer 3: longer-context evaluation on repository-grounded bounded-graph tasks, where cross-file reasoning and context discipline matter more.

Layer	Context	Primary purpose	Cost class	Cadence
1	1K-4K	regression detection during training	low	every few thousand steps
2	4K	rapid functional comparison across variants	medium	every promoted checkpoint
3	16K-64K	release-candidate validation on long-context tasks	higher	before release decisions

The key point is not the exact hardware mix. The key point is that cheaper layers are used to filter candidates early, while the more expensive layers are reserved for decisions that justify the additional evaluation cost.

Layer 3 is where the repository task stops being a toy prompt. The context is large enough to carry neighboring translation units, include relationships, and symbol ownership, so the evaluation can ask whether the model still respects the repo when the easy "single-file snippet" escape hatch is gone. That release-grade lane needs pass-to-pass discipline as well as fail-to-pass wins: a patch that flips one held-out task by breaking unrelated already-green behavior is not actually better. C++ eval suites and verifiers is the local follow-up on how those buckets stay visible.

4. Why verifier-first scoring wins

The strongest lesson from the current stack is that verifier-backed evaluation is more reliable than pure aesthetic judging.

A compiler can reject a patch even when the prose around it sounds convincing. A symbol checker can catch invented APIs even when a judge model finds the explanation persuasive. A held-out test can fail even when the patch looks stylistically correct. Those are not edge cases in C++; they are the point of the evaluation.

That is why the harness treats compile-and-test outcomes as the authority and uses softer review signals only as supporting structure.

5. Release gating and reproducibility

Training throughput is not a user-facing quality metric, but the training stack still matters because unstable training receipts produce unstable evaluation results. Public reproducibility notes therefore focus on configuration discipline rather than on marketing-friendly peak numbers, starting with the dataset and packaging discipline described in SLM data pipeline.

The release process keeps a few principles fixed:

candidate checkpoints must come from pinned, reproducible training lanes
smoke variants must converge into an expected steady-state band before they are admitted to larger runs
superseded or invalidated measurements are retired instead of silently mixed into newer summaries
evaluation artifacts should record checkpoint identity, harness revision, verifier settings, and seed distributions

Those artifacts also need enough evidence to explain a regression later: extracted source or diff identity, compile bucket, timeout class, verifier settings, and seed-level outcomes. Without that trail, a model regression and a harness drift can land on the same top-line score.

6. Cost per quality point

The cost argument is easiest to reason about when quality and operational complexity are reported together.

A smaller or more specialized serving target can make it practical to evaluate more variants, run more seeds, and keep repository-grounded checks in the loop. A larger general-purpose baseline can still be a useful comparison point, but it often raises evaluation cost enough that teams are tempted to cut corners on repeatability or depth.

For MegaCpp, the practical comparison is therefore not just "small versus large model." It is:

how much evidence can we afford to collect per candidate release
how faithfully can we keep compile-and-verifier checks in the loop
what serving and iteration costs follow from the chosen architecture

The inference lane has its own throughput knobs, but they are not quality metrics. Shorter functional sweeps can run more aggressive concurrency, while 16K-64K repository tasks usually need more headroom so the serving path does not silently shrink or distort the long-context task being measured.

Why ensemble-style evaluation is still useful

A specialist ensemble is interesting only if the evaluation can show where specialization helps and where it does not. The current design is intended to make that visible across context adherence, hallucination control, and long-context repository work rather than collapsing everything into one headline number.

What we kept and what we threw away

We keep the four-axis methodology, compiler-grounded checks on structure and validity, verifier-first release gates, and the practice of comparing distributions rather than cherry-picked single runs.

We threw away single-number reporting, pass@1-only storytelling, and benchmark summaries that are not traceable back to a declared verifier contract.

How We Evaluate the MegaCpp SLM Ensemble on Real C++ Work

Why this matters

1. What we are actually measuring

Why these four, not pass@k alone

2. The harness: cheap inference, compiler-grounded judging

3. Three benchmark layers

4. Why verifier-first scoring wins

5. Release gating and reproducibility

6. Cost per quality point

Why ensemble-style evaluation is still useful

What we kept and what we threw away

Frequently asked questions

Terms used in this article

Evaluation, Benchmarks, and Verifier Loops

How We Evaluate the MegaCpp SLM Ensemble on Real C++ Work

Why this matters

1. What we are actually measuring

Why these four, not pass@k alone

2. The harness: cheap inference, compiler-grounded judging

3. Three benchmark layers

4. Why verifier-first scoring wins

5. Release gating and reproducibility

6. Cost per quality point

Why ensemble-style evaluation is still useful

What we kept and what we threw away

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

Evaluation, Benchmarks, and Verifier Loops