Eval Harness Plumbing: The Parts That Are Not the Benchmark
The four-axis eval harness plumbing under our C++ benchmarks: sandboxing, compile walls, timeouts, parallel runners, flake isolation, and the contract tests a new benchmark has to pass before it goes into CI.

This post is not about eval results. It is about the mechanics that sit under every eval number we publish: how a model completion gets from tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped output to an exit-coded pass/fail, what isolates that from everything else running on the box, and why adding a new benchmark is not "write a scorer and point it at the checkpoints."
In this series, the harness is the orchestration layer around datasets, scheduling, and artifacts; the verifier is the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid-and-test authority inside that harness; and a watcher is the long-lived poller that notices new checkpoints, runs the harness phases, and mirrors result receipts under namespaced state. The checked-in proof surfaces are split on purpose: C++ eval suites and verifiers shows the verifier vocabulary and failure buckets, Compile and runtime capture examples shows the generic compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid/runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot guardrails, and Compile/runtime receipt sample is only a compact structured receipt rather than the whole eval harness.
If you want checked-in proof surfaces before prose, start with C++ eval suites and verifiers and Compile/runtime receipt sample, then Compile and runtime capture examples, the Compile examples catalog, Data and masking examples, Semantic indexing notes, and Reference corpus pinning notes.
That ordering matters. The verifier article explains what a pass/fail label means, the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid examples show the subprocess and timeout discipline the harness depends on, and this post is about how those pieces are scheduled, isolated, and preserved across runs.
The other useful boundary file next to this article is The Clang semantic indexer. That post explains why build-aware static context can move context-adherence and hallucination labels without changing the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot harness at all. The eval harness only makes sense if the verifier lane, the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid lane, and the semantic-context lane stay distinct.
Why this matters
A four-axis eval whose harness is sloppy will produce a four-axis lie. We grade C++ generations on compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid probability, context adherence, hallucination rate, and end-to-end correctness, and the last two require actually running model output as a program. Running adversarially-shaped C++ on shared infrastructure with naive subprocess plumbing leaks zombies, miscounts compiler errors as model failures, and silently drifts numbers across watcher restarts.
1. The four axes, briefly
We measure code generation on four axes because perplexity does not tell you whether the output is code:
- CompileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid probability against the original translation unit.
- Context adherence: does the model call functions that are actually in the provided call graph, or does it invent them?
- Hallucination rate: references to non-existent symbols, headers, or overloads.
- Correctness, graded against a held-out test set of cross-file prompts.
For first touch, the terms mean the following:
| Axis | First-touch definition | What can move it for the wrong reason |
|---|---|---|
| compile probability | the share of completions that reach a successful compile under the declared verifier toolchain and flags | missing compiler, timeout drift, or wrong error bucketing |
| context adherence | whether generated callees, includes, and symbol uses stay inside the provided translation unit and call graph | stale build context or broken symbol extraction |
| hallucination rate | how often the model references headers, symbols, or overloads that do not exist in the provided repo context | parser drift can look like model invention |
| correctness | whether the compiled candidate passes the held-out test authority | sandbox, timeout, and result-write bugs can invalidate comparisons |
Only compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid and correctness require running generated code. Context adherence and hallucination are static parses against repository-derived context.
2. The sandbox
"Sandbox" overstates it. What we actually run is a disciplined use of TemporaryDirectory plus subprocess.run with hard timeouts and captured output. The closest checked-in matches are C++ eval suites and verifiers for compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid/test labels and Compile and runtime capture examples for the subprocess-and-timeout discipline.
The underlying helper does five things:
- Writes
prompt + model_completion + test_codeto a freshsolution.cppin a per-task temp directory. - Calls the compiler with a fixed flag contract.
- If compilation fails, returns a compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid-failure label with bounded stderr.
- If compilation succeeds, runs the binary under a shorter runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot timeout.
- Returns a structured
(passed, reason)style result.
The real isolation boundary is not perfect syscall sandboxing. It is that the harness runs on dedicated eval machines with fixed toolchains and declared limits.
That distinction matters because tempdir hygiene and host isolation solve different problems. A fresh directory keeps paths clean and cleanup understandable; it does not, by itself, cap memory, hide host processes, or disable network access. If the eval lane ever has to widen beyond fixed dedicated machines, the next safety boundary is kernel-enforced isolation such as namespaces, cgroups, and seccomp, or a container runner built on the same controls, rather than more naming discipline around temp paths.
The practical upgrade path is usually narrower than "copy everything into a sandbox image." Large checkpoints and benchmark payloads can stay mounted as read-only inputs while the verifier gets only a bounded writable scratch directory for compiler outputs and receipts.
3. The compile wall
Compilation, not inference, is often the most expensive part of a C++ eval. That asymmetry drives the plumbing:
- Generation pass: load the model, walk the benchmark, store completions.
- CompileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid pass: fan out across a CPU worker pool that compiles and runs those completions.
In this stack, pass@k means the probability estimate that at least one of k verifier-scored samples is correct, not a hand-picked best case.
That wording matters because the accounting surface is a normalized verifier receipt, not a pile of compiler logs. The harness can keep bounded stderr for debuggingQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures OOM Debugging Playbook for H200 Training Runs, but pass@k only stays comparable across watcher restarts and toolchain drift if each sample first lands in the same structured bucket for extraction failure, compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid failure, runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot failure, or pass. Compile/runtime receipt sample is the compact checked-in reminder of that receipt shape, and C++ eval suites and verifiers is the fuller explanation of why the score comes from verifier labels rather than free-form diagnostics.
Once you run multi-sample evals, there is another quiet CPU leak: many candidate programs are duplicates or near-duplicates. Exact-text compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid reuse and conservative compilation caches belong below the benchmark layer because they change cost more than truth: the harness can still record multiple draws while avoiding redundant verifier work for one effectively identical program shape.
4. Timeouts, layered
Timeouts nest, and the wrong nesting silently changes your numbers.
| Scope | Limit | On expiry | Counted as |
|---|---|---|---|
| per compile | fixed compile timeout | compile timeout | did not compile |
| per run | fixed runtime timeout | runtime timeout | runtime failure |
| per task end to end | bounded by generation + verifier wall | task failure | bounded |
| per benchmark | watcher-configured | watcher kill | degraded |
One thing the harness has to get right is cleanup. A timeout inside subprocess.run(..., capture_output=True) can leave subprocesses behind if process groups are not handled correctly.
5. Parallel runners and watcher state
Parallelism happens on two layers.
Inside one checkpoint, compilation fan-out over a bounded CPU worker pool.
Across checkpoints, a watcher fleet that polls for new checkpoints, runs generation and compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid phases, and mirrors the result JSON.
The namespaced watcher state is what keeps checkpoint isolation deterministic. A watcher should never confuse one checkpoint directory with another just because the last path component happens to match.
6. Flake isolation
Flakes in an execution-based eval come from three places:
Model-side flakes
A non-greedy decode at temperature > 0 is explicitly stochastic. That is intentional for pass@k, so the harness logs seeds and keeps task ordering pinned.
Generated-program flakes
Generated C++ can be non-deterministic on its own. The harness mitigates by preferring assert-based pass/fail over stdout matching and by keeping the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot contract narrow.
Harness-side flakes
The ones worth planning for are tempdir cleanup races, compiler OOM on pathological templates, and partial result writes when uploads or mirrors happen before flushes complete.
Compiler resource exits need their own accounting boundary. A pathological template that exhausts the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid lane did not produce a correct program, but it also should not look like an assert failure or a missing-symbol hallucination. The safe rule is to let the verifier mark it as a compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid-side resource failure, keep the stderr or exit detail bounded, and keep the host-protection mechanism out of the published model score.
Compiler diagnostics are also a quiet flake surface. Even when pass/fail is stable, stderr wording and line-number context can drift with toolchain revision or parallel compiler behavior. That is why the harness keeps bounded stderr for debuggingQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures OOM Debugging Playbook for H200 Training Runs and regression triage but still treats the structured verifier receipt as the scoring authority.
7. Contract tests for a new benchmark
A new benchmark module has to pass a contract test set before it enters rotation.
Required surface:
load_examples()returns the declared example schema.generate_completion(...)uses the shared stopping and extraction rules.compile_and_run(...)classifies compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid, runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot, extraction, and harness failures in the same result vocabulary the reports expect.- Problem-set schema tests reject empty or malformed benchmark inputs.
- Difficulty-distribution checks stop mislabeled benchmark mixes from silently entering rotation.
Required tests before the first checkpoint is allowed to use it:
- shared
pass@ksanity cases - brace-depth and extraction tests
- tool-wrapper and code-block extraction tests
- checkpoint-directory isolation tests
- end-to-end smoke bands on known-good and known-bad checkpoints
The pass@k block is not ceremonial. It has to lock the edge case where fewer than k samples are wrong and the estimator should return 1.0, plus the monotonic and receipt-stability checks that keep multi-sample accounting honest when duplicate draws collapse to one effective program shape.
Only after those pass does the benchmark get a slot in the watcher's rotation.
What we kept and what we threw away
We kept the split between static axes and execution axes, the simple subprocess-plus-tempdir sandbox, the layered timeouts, the namespaced watcher state, the contract tests, and atomic result writes.
We threw away retry-heavy timeout handling, judge-driven grading for static axes, and dynamic ignore lists for flaky tasks.
Frequently asked questions
Why separate generation and compilation into different phases?+
What pass@k edge case has to be handled explicitly?+
k samples are wrong, the unbiased estimator should return 1.0 directly rather than trying to evaluate an impossible failing subset.Why keep brace-depth tests if syntax-aware extraction is the safer direction?+
brace-depth regressions as an early guard while the real extraction contract stays syntax-aware, so extraction failures remain their own bucket instead of being misreported as compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed… failures. Verifier-first C++ evals is the adjacent read for that boundary.Why is raw compiler stderr not the scoring authority?+
What is the difference between the harness and the verifier here?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.
Why MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…
An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…
Why MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…
A deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…
A grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…