MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20268 min readDavid Gornshtein

Evaluation

Testing

Infra

C++

Eval Harness Plumbing: The Parts That Are Not the Benchmark

Q: What pass@k edge case has to be handled explicitly?

If fewer than k samples are wrong, the unbiased estimator should return 1.0 directly rather than trying to evaluate an impossible failing subset.

The four-axis eval harness plumbing under our C++ benchmarks: sandboxing, compile walls, timeouts, parallel runners, flake isolation, and the contract tests a new benchmark has to pass before it goes into CI.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Eval Harness Plumbing: The Parts That Are Not the Benchmark

Published April 18, 2026•8 min read•David Gornshtein

This post is not about eval results. It is about the mechanics that sit under every eval number we publish: how a model completion gets from tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingAbout: Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs History: Tokenizer evolution for C++ code: from v2 proposal to v3 shipped output to an exit-coded pass/fail, what isolates that from everything else running on the box, and why adding a new benchmark is not "write a scorer and point it at the checkpoints."

In this series, the harness is the orchestration layer around datasets, scheduling, and artifacts; the verifier is the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid-and-test authority inside that harness; and a watcher is the long-lived poller that notices new checkpoints, runs the harness phases, and mirrors result receipts under namespaced state. The checked-in proof surfaces are split on purpose: C++ eval suites and verifiers shows the verifier vocabulary and failure buckets, Compile and runtime capture examples shows the generic compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid/runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot guardrails, and Compile/runtime receipt sample is only a compact structured receipt rather than the whole eval harness.

If you want checked-in proof surfaces before prose, start with C++ eval suites and verifiers and Compile/runtime receipt sample, then Compile and runtime capture examples, the Compile examples catalog, Data and masking examples, Semantic indexing notes, and Reference corpus pinning notes.

That ordering matters. The verifier article explains what a pass/fail label means, the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid examples show the subprocess and timeout discipline the harness depends on, and this post is about how those pieces are scheduled, isolated, and preserved across runs.

The other useful boundary file next to this article is The Clang semantic indexer. That post explains why build-aware static context can move context-adherence and hallucination labels without changing the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot harness at all. The eval harness only makes sense if the verifier lane, the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid lane, and the semantic-context lane stay distinct.

Why this matters

A four-axis eval whose harness is sloppy will produce a four-axis lie. We grade C++ generations on compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid probability, context adherence, hallucination rate, and end-to-end correctness, and the last two require actually running model output as a program. Running adversarially-shaped C++ on shared infrastructure with naive subprocess plumbing leaks zombies, miscounts compiler errors as model failures, and silently drifts numbers across watcher restarts.

1. The four axes, briefly

We measure code generation on four axes because perplexity does not tell you whether the output is code:

CompileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid probability against the original translation unit.
Context adherence: does the model call functions that are actually in the provided call graph, or does it invent them?
Hallucination rate: references to non-existent symbols, headers, or overloads.
Correctness, graded against a held-out test set of cross-file prompts.

For first touch, the terms mean the following:

Axis	First-touch definition	What can move it for the wrong reason
compile probability	the share of completions that reach a successful compile under the declared verifier toolchain and flags	missing compiler, timeout drift, or wrong error bucketing
context adherence	whether generated callees, includes, and symbol uses stay inside the provided translation unit and call graph	stale build context or broken symbol extraction
hallucination rate	how often the model references headers, symbols, or overloads that do not exist in the provided repo context	parser drift can look like model invention
correctness	whether the compiled candidate passes the held-out test authority	sandbox, timeout, and result-write bugs can invalidate comparisons

Only compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid and correctness require running generated code. Context adherence and hallucination are static parses against repository-derived context.

2. The sandbox

"Sandbox" overstates it. What we actually run is a disciplined use of TemporaryDirectory plus subprocess.run with hard timeouts and captured output. The closest checked-in matches are C++ eval suites and verifiers for compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid/test labels and Compile and runtime capture examples for the subprocess-and-timeout discipline.

The underlying helper does five things:

Writes prompt + model_completion + test_code to a fresh solution.cpp in a per-task temp directory.
Calls the compiler with a fixed flag contract.
If compilation fails, returns a compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid-failure label with bounded stderr.
If compilation succeeds, runs the binary under a shorter runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot timeout.
Returns a structured (passed, reason) style result.

The real isolation boundary is not perfect syscall sandboxing. It is that the harness runs on dedicated eval machines with fixed toolchains and declared limits.

That distinction matters because tempdir hygiene and host isolation solve different problems. A fresh directory keeps paths clean and cleanup understandable; it does not, by itself, cap memory, hide host processes, or disable network access. If the eval lane ever has to widen beyond fixed dedicated machines, the next safety boundary is kernel-enforced isolation such as namespaces, cgroups, and seccomp, or a container runner built on the same controls, rather than more naming discipline around temp paths.

The practical upgrade path is usually narrower than "copy everything into a sandbox image." Large checkpoints and benchmark payloads can stay mounted as read-only inputs while the verifier gets only a bounded writable scratch directory for compiler outputs and receipts.

3. The compile wall

Compilation, not inference, is often the most expensive part of a C++ eval. That asymmetry drives the plumbing:

Generation pass: load the model, walk the benchmark, store completions.
CompileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid pass: fan out across a CPU worker pool that compiles and runs those completions.

In this stack, pass@k means the probability estimate that at least one of k verifier-scored samples is correct, not a hand-picked best case.

That wording matters because the accounting surface is a normalized verifier receipt, not a pile of compiler logs. The harness can keep bounded stderr for debuggingQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures OOM Debugging Playbook for H200 Training Runs, but pass@k only stays comparable across watcher restarts and toolchain drift if each sample first lands in the same structured bucket for extraction failure, compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid failure, runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot failure, or pass. Compile/runtime receipt sample is the compact checked-in reminder of that receipt shape, and C++ eval suites and verifiers is the fuller explanation of why the score comes from verifier labels rather than free-form diagnostics.

Once you run multi-sample evals, there is another quiet CPU leak: many candidate programs are duplicates or near-duplicates. Exact-text compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid reuse and conservative compilation caches belong below the benchmark layer because they change cost more than truth: the harness can still record multiple draws while avoiding redundant verifier work for one effectively identical program shape.

4. Timeouts, layered

Timeouts nest, and the wrong nesting silently changes your numbers.

Scope	Limit	On expiry	Counted as
per compile	fixed compile timeout	compile timeout	did not compile
per run	fixed runtime timeout	runtime timeout	runtime failure
per task end to end	bounded by generation + verifier wall	task failure	bounded
per benchmark	watcher-configured	watcher kill	degraded

One thing the harness has to get right is cleanup. A timeout inside subprocess.run(..., capture_output=True) can leave subprocesses behind if process groups are not handled correctly.

5. Parallel runners and watcher state

Parallelism happens on two layers.

Inside one checkpoint, compilation fan-out over a bounded CPU worker pool.

Across checkpoints, a watcher fleet that polls for new checkpoints, runs generation and compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid phases, and mirrors the result JSON.

The namespaced watcher state is what keeps checkpoint isolation deterministic. A watcher should never confuse one checkpoint directory with another just because the last path component happens to match.

6. Flake isolation

Flakes in an execution-based eval come from three places:

Model-side flakes

A non-greedy decode at temperature > 0 is explicitly stochastic. That is intentional for pass@k, so the harness logs seeds and keeps task ordering pinned.

Generated-program flakes

Generated C++ can be non-deterministic on its own. The harness mitigates by preferring assert-based pass/fail over stdout matching and by keeping the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot contract narrow.

Harness-side flakes

The ones worth planning for are tempdir cleanup races, compiler OOM on pathological templates, and partial result writes when uploads or mirrors happen before flushes complete.

Compiler resource exits need their own accounting boundary. A pathological template that exhausts the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid lane did not produce a correct program, but it also should not look like an assert failure or a missing-symbol hallucination. The safe rule is to let the verifier mark it as a compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid-side resource failure, keep the stderr or exit detail bounded, and keep the host-protection mechanism out of the published model score.

Compiler diagnostics are also a quiet flake surface. Even when pass/fail is stable, stderr wording and line-number context can drift with toolchain revision or parallel compiler behavior. That is why the harness keeps bounded stderr for debuggingQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…GroundingModal Debugging Guide for Training and Benchmark Failures OOM Debugging Playbook for H200 Training Runs and regression triage but still treats the structured verifier receipt as the scoring authority.

7. Contract tests for a new benchmark

A new benchmark module has to pass a contract test set before it enters rotation.

Required surface:

load_examples() returns the declared example schema.
generate_completion(...) uses the shared stopping and extraction rules.
compile_and_run(...) classifies compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid, runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot, extraction, and harness failures in the same result vocabulary the reports expect.
Problem-set schema tests reject empty or malformed benchmark inputs.
Difficulty-distribution checks stop mislabeled benchmark mixes from silently entering rotation.

Required tests before the first checkpoint is allowed to use it:

shared pass@k sanity cases
brace-depth and extraction tests
tool-wrapper and code-block extraction tests
checkpoint-directory isolation tests
end-to-end smoke bands on known-good and known-bad checkpoints

The pass@k block is not ceremonial. It has to lock the edge case where fewer than k samples are wrong and the estimator should return 1.0, plus the monotonic and receipt-stability checks that keep multi-sample accounting honest when duplicate draws collapse to one effective program shape.

Only after those pass does the benchmark get a slot in the watcher's rotation.

What we kept and what we threw away

We kept the split between static axes and execution axes, the simple subprocess-plus-tempdir sandbox, the layered timeouts, the namespaced watcher state, the contract tests, and atomic result writes.

We threw away retry-heavy timeout handling, judge-driven grading for static axes, and dynamic ignore lists for flaky tasks.

FAQ

Frequently asked questions

Why separate generation and compilation into different phases?+

Because inference is GPU-bound while compilation is CPU-bound. Decoupling them keeps the expensive generation lane from idling while the compiler works through completions.

What pass@k edge case has to be handled explicitly?+

If fewer than k samples are wrong, the unbiased estimator should return 1.0 directly rather than trying to evaluate an impossible failing subset.

Why keep brace-depth tests if syntax-aware extraction is the safer direction?+

Because they catch the cheap failure class, not the full scoring authority. Once completions mix reasoning text, multiple fenced regions, or helper code outside the first obvious block, a pure brace counter is too fragile to decide what code should be scored. The harness therefore keeps brace-depth regressions as an early guard while the real extraction contract stays syntax-aware, so extraction failures remain their own bucket instead of being misreported as compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed… failures. Verifier-first C++ evals is the adjacent read for that boundary.

Why is raw compiler stderr not the scoring authority?+

Because diagnostic text drifts more easily than the underlying verifier outcome. The harness can store bounded stderr for debuggingQuick term guideDebuggingA grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…, but the published buckets need to come from the structured verifier receipt.

What is the difference between the harness and the verifier here?+

The harness owns task queues, watcher state, receipts, and aggregation. The verifier owns extraction, compilation, execution, and the pass label.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Semantic indexing

The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.

Grounding

Compile

Why MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…

Grounding

Data pipeline

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…

Grounding

Runtime boundaries

Why MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…

Grounding

Tokenizer

A deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…

Grounding

Debugging

A grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…

Grounding

Topic hubs

Topic Hub

Evaluation, Benchmarks, and Verifier Loops

A curated evaluation reading path: verifier-first harnesses, ablation structure, benchmark receipts, and the evidence rules that keep comparisons from collapsing into anecdotes.

David Gornshtein • MegaCppMore posts →

Eval Harness Plumbing: The Parts That Are Not the Benchmark

Why this matters

1. The four axes, briefly

2. The sandbox

3. The compile wall

4. Timeouts, layered

5. Parallel runners and watcher state

6. Flake isolation

Model-side flakes

Generated-program flakes

Harness-side flakes

7. Contract tests for a new benchmark

What we kept and what we threw away

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

Evaluation, Benchmarks, and Verifier Loops