Verifier-first C++ evals: why compile-and-test owns the metric
What the C++ evaluation stack teaches about deterministic extraction, sandbox contracts, pass@k, and why benchmark tables only become trustworthy after the verifier owns the pass label.

Executable evaluation only becomes honest when the verifier is the authority. The current C++ stack is valuable because it does not treat compilation as a cleanup step after generation. It treats deterministic extraction, declared sandboxing, compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid/test outcomes, and failure bucketing as the source of truth, then computes summary metrics on top. That order matters more than any single leaderboard number. If you want the quickest checked-in proof surface first, start with C++ eval suites and verifiers, then Compile/runtime receipt sample, then Compile and runtime capture examples.
Evaluation systems for natural-language tasks can get away with fuzzy matching, grader models, or human preference signals. C++ cannot. There is no meaningful notion of "almost correct" once the deliverable is code that must compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid and satisfy tests. For that reason the verifier is not auxiliary tooling. It is the measurement device.
One first-touch distinction matters before the rest: an authoritative code region is the exact slice of model output the verifier decides to compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid, and a failure bucket is the stable label the verifier emits after extraction, compilation, linking, or execution. If those are fuzzy, everything built on top of them is fuzzy too.
Why verifier-first matters specifically for C++
C++ evaluation fails in two distinct ways. The obvious one is that the model produces wrong code. The subtler one is that the harness misidentifies the intended code or evaluates it under an unstable contract. The second problem is more dangerous because it can fabricate research conclusions.
If a model emits a correct solution wrapped in extra markers, reasoning text, or multiple fenced blocks, a weak harness may choose the wrong region and report a compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid failure. If the sandbox changes compiler flags, timeout windows, or filesystem assumptions between runs, the same candidate can pass on Tuesday and fail on Wednesday for reasons unrelated to model quality. Once that happens, benchmark tables stop describing capability and start describing evaluator drift.
That is why a verifier-first stack is not just "strict." It is epistemically cleaner. It makes the question precise: under this extraction policy and this compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid-and-test contract, did the candidate solve the task?
| Layer | What it must decide | What breaks if it is sloppy |
|---|---|---|
| Extraction | Which code region is authoritative | Format noise becomes fake capability loss |
| Sanitization | Which wrappers are allowed or removed | Harmless scaffolding becomes a compile error |
| Compilation | Which toolchain and flags define validity | Results become non-comparable |
| Execution | Which tests and timeouts define success | Syntactic validity is mistaken for correctness |
| Aggregation | Which labels count toward pass@k | Summary metrics drift away from evidence |
That table is the philosophy of the stack in one place. Every interesting metric question sits downstream of those decisions.
The repo already encodes the right authority boundary
The verifier layer matters because it is where success stops being rhetorical. It defines the logic that converts a model response into a verifier outcome. It is the place where extraction policy, compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid orchestration, and test execution are tied together. Even if different task families require different details, the design pressure is the same: the verifier owns the label.
The harness then sits one level up and does the work an evaluation harness should do: load tasks, call the verifier consistently, record outcomes, and summarize them. That separation is healthy. It prevents the benchmark loop from smuggling in ad hoc per-task decisions that would make historical comparisons noisy.
One practical benefit of this shape is that regressions become localizable. If scores move, you can ask whether the model changed, the extraction logic changed, the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid contract changed, or the aggregation changed. A weaker setup would collapse all of that into a single opaque number.
Deterministic extraction is not polish; it is part of the metric
C++ generations increasingly arrive with extra structure. Some models emit analysis before code. Others produce multiple code fences or tool-style wrappers. A verifier-first stack has to answer a very basic question consistently: what exact text becomes the candidate translation unit?
That is why deterministic extraction belongs inside the metric path rather than in a cleanup script somebody runs later. If the extraction rule is unstable, the metric is unstable. If the extraction rule is implicit, the metric is not reproducible.
Brace-depth truncation is a reasonable first pass, but once models start emitting helper functions, wrapper text, or multiple candidate code regions, syntax-aware extraction earns its keep. The important boundary does not change: smarter extraction still stays subordinate to the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid wall rather than replacing it.
1. identify the authoritative code region deterministically
2. normalize only known wrappers or boilerplate
3. preserve the candidate source exactly after normalization
4. compile under a declared contract
5. record extraction failure separately from compile failure
Those steps matter because they prevent evaluator folklore. Without them, two engineers can look at the same raw sample, pick different code blocks manually, and report different pass rates. Once that happens, there is no real benchmark anymore.
Once generations start mixing reasoning text, multiple fenced regions, or helper functions outside the first obvious block, brace-depth heuristics stop being enough on their own. The practical upgrade is syntax-aware extraction that can keep the authoritative code region stable even when the surface format drifts. The value is not elegance; it is that extraction failure stays its own diagnostic bucket instead of being misreported as a compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid failure. That is the same boundary C++ eval suites and verifiers and Eval harness plumbing are trying to keep visible all the way through the receipts.
Pass@k only means something after the verifier owns the label
pass@k is a useful summary because executable tasks are inherently stochastic. A model may produce several candidate programs, and the engineering question is often whether at least one of them is correct within a limited sample budget. But pass@k becomes meaningless if the underlying pass label is noisy.
That is why the ordering matters so much:
- verifier determines pass, fail, timeout, extraction failure, or sandbox error
- harness aggregates those stable labels
- metrics summarize the already-grounded outcomes
If you reverse that order and let heuristic parsing or soft matching leak into the label itself, pass@k turns into polished ambiguity. You still get a number, but it no longer tracks what engineers care about.
In practice that usually lowers the absolute metric. That is not the verifier being harsh for its own sake; it is the benchmark finally charging extraction mistakes, compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid failures, and sandbox reality to the sample instead of letting heuristics count them as almost-correct.
That is also why the original HumanEval paper matters here for more than the task list. It popularized an unbiased pass@k estimator over repeated samples, but that math only helps after the verifier has already produced stable binary labels. The estimator reduces sampling bias; it does not rescue extraction drift, a moving sandbox contract, or a harness that keeps changing what counts as the authoritative code region.
| Metric view | Good use | Bad use |
|---|---|---|
| compile rate | measure syntax and toolchain compatibility | treat it as complete task success |
| test pass rate | measure executable correctness | ignore extraction instability |
| pass@k | summarize verified multi-sample success | compute over heuristic or manually corrected labels |
| failure buckets | diagnose regressions | hide all failures inside one blended score |
The sandbox contract is part of the benchmark, not an implementation detail
Evaluation people often talk about models and datasets while quietly changing the environment underneath them. For executable tasks that is a mistake. The compiler, flags, timeout limits, filesystem view, and allowed includes are all part of the benchmark contract.
The stack is strongest when those assumptions are explicit. A verifier should not just say "compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid failed." It should know what compiler was used, which phase timed out, whether the failure happened before tests began, and whether the task was single-file or multi-file. That kind of reporting is not bureaucracy. It is what makes results comparable across weeks of research.
Separate compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid and execute timeouts are part of that contract, not a logging detail. A template-heavy candidate that stalls inside the compiler and a binary that compiles and then hangs are different failure surfaces, and the verifier should preserve that difference instead of collapsing both into one generic timeout.
The deployment checklist is short, but it does need to be explicit:
| Requirement | Why it matters |
|---|---|
| fixed toolchain and flags | prevents accidental benchmark drift |
| separate compile and execute timeouts | distinguishes language validity from runtime behavior |
| stable memory/filesystem limits | keeps results comparable |
| explicit single-file vs multi-file policy | avoids hidden task-shape bias |
| structured result schema | makes regressions diagnosable |
Dependency policy belongs on that same checklist. A lane that silently links an extra library, exposes helper headers, or shifts from single-file to multi-file tasks has changed the benchmark even if the prompts stayed the same. That is why environment mismatch deserves its own failure bucket instead of being folded into generic compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid failure. Compile commands and semantic graphs is the adjacent surface for that boundary because it makes the build context reader-visible instead of burying it inside folklore.
The harness should preserve evidence, not just publish scores
One of the best lessons from the current design is that final numbers are not enough. Engineers need to inspect the path from raw sample to extracted source to compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid result to test verdict. The harness is useful precisely because it can sit above those details without erasing them.
That evidence chain is what lets a team tell the difference between four very different events:
- the model generated wrong logic
- the model generated correct logic wrapped in unsupported formatting
- the toolchain or sandbox changed
- the verifier itself regressed
Those are not edge cases. They are routine failure classes in real model-eval work. A verifier-first stack makes them visible enough to debugQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures OOM Debugging Playbook for H200 Training Runs.
It also needs a failure taxonomy with real teeth. Extraction failure, dependency or environment mismatch, compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid error, link-time error, runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot crash or timeout, and semantic test failure are different signals. If those are all merged into one generic fail label, pass-rate changes stop being diagnosable.
The evidence bundle should be cheap enough to persist on every sample too: extracted source, normalized compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid command, compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid stderr slice, timeout class, and final failure bucket. That is what makes later pass@k deltas interpretable. If a score drops after a toolchain bump, the team should be able to prove whether the candidates got worse or whether the verifier contract moved.
What MegaCpp should institutionalize
The long-term rule should be simple: for executable tasks, the verifier owns the success label and every aggregate metric must be downstream of that fact.
That implies a concrete standard:
- deterministic extraction is mandatory
- compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid and run phases are reported separately
- sandbox assumptions are declared and stable
- pass@k is computed only from verifier-backed labels
- reports include failure buckets, not only one top-line score
This is stricter than leaderboard culture, but it is also more useful. TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 stacks need metrics that survive contact with debuggingQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures OOM Debugging Playbook for H200 Training Runs. If a score drops, the team should be able to tell whether the model regressed, the format changed, or the verifier moved. The only way to make that possible is to treat verification as the authority from the start.
Frequently asked questions
Why keep compile timeout separate from execute timeout?+
g++ is a compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…-contract problem; a binary that compiles and then hangs is a runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… or logic problem.Why keep extraction failure separate from compile failure?+
Why not let syntax-aware extraction repair the candidate?+
What environment fields have to be frozen for a C++ benchmark lane?+
Why treat environment mismatch as its own bucket instead of compile failure?+
Why keep link-time error separate from compile error?+
undefined reference means the code shape compiled but the declared build contract was still incomplete. Keeping link-time failure separate helps distinguish broken candidate structure from missing translation units, moved symbols, or multi-file lane drift in C++ eval suites and verifiers and Compile commands and semantic graphs.Why can pass@k fall after the verifier gets stricter?+
Where is the fastest checked-in proof surface for this contract?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.
The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.
Why MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…
Why MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…
A grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…