Topic Hub

Evaluation, Benchmarks, and Verifier Loops

A curated evaluation reading path: verifier-first harnesses, ablation structure, benchmark receipts, and the evidence rules that keep comparisons from collapsing into anecdotes.

This hub starts with the evaluation contract itself, then moves into the ablation and comparison layer, and finishes with the benchmark receipts and profiler evidence that connect model claims to real runs.

evaluation

verifier

benchmarks

ablation

receipts

profiling

Curated set

Articles in reading order

Why this hub

Best if you want to separate trustworthy evidence from vague score reporting across the MegaCpp archive.

Evaluation Contract

Read these first to understand what the MegaCpp eval lane accepts as evidence.

01
April 18, 2026•7 min read•David Gornshtein
How We Evaluate the MegaCpp SLM Ensemble on Real C++ Work
The evaluation design, verifier stack, and release gates we use to measure C++ model quality without collapsing everything into a single leaderboard number.
The broad evaluation overview and the vocabulary that anchors the rest of the verifier lane.
Evaluation
Benchmarks
SLM
C++
Read article
02
April 18, 2026•8 min read•David Gornshtein
Eval Harness Plumbing: The Parts That Are Not the Benchmark
The four-axis eval harness plumbing under our C++ benchmarks: sandboxing, compile walls, timeouts, parallel runners, flake isolation, and the contract tests a new benchmark has to pass before it goes into CI.
How the harness surface is wired once evaluation has to stay reproducible instead of ad hoc.
Evaluation
Testing
Infra
C++
Read article
03
April 18, 2026•9 min read•David Gornshtein
Verifier-first C++ evals: why compile-and-test owns the metric
What the C++ evaluation stack teaches about deterministic extraction, sandbox contracts, pass@k, and why benchmark tables only become trustworthy after the verifier owns the pass label.
The shortest accurate read on why verifier-first evaluation stayed central for C++ tasks.
Evaluation
C++
Verifier
Benchmarking
Read article
04
April 18, 2026•10 min read•David Gornshtein
The C++ Eval Suites, Verifiers, and the Compile-Then-Test Wall
The C++-specific eval surface we actually run: problem sets, the compile-then-test verifier sandbox, header and include coverage, and how per-specialist scorecards fall out of the same harness.
The suite-level companion piece once the eval lane needs multiple tasks and result vocabularies.
Evaluation
C++
Verifier
Benchmarks
Read article

Ablation and Comparison

These articles explain how model comparisons stay legible once many knobs move at once.

05
April 18, 2026•10 min read•David Gornshtein
SOTA Ablation and Comparison: How MegaCpp Decides What to Keep
The ablation plan, the comparison methodology, and the honest numbers behind the MegaCpp SLM stack — what stacked, what didn't, and what we threw out even though the paper said it would help.
The cleanest overview of how MegaCpp framed comparisons against external baselines without flattening important caveats.
Ablation
Sota
MoE
DSA
Read article
06
April 18, 2026•8 min read•David Gornshtein
What changed after the 10K-step gate: the ablations that stayed honest
A grounded reading of training changes after the configured 10K-step gate: STP activation, auxiliary-head timing, plasticity scheduling, and why later ablations are more trustworthy than warmup-era receipts.
A narrower look at what early-step ablations can and cannot really prove.
Ablation
Training
Stp
Fire
Read article
07
April 18, 2026•10 min read•David Gornshtein
Distillation, best-of-N, and verifier-grounded RL in the post-training loop
How distillation, best-of-N, GRPO, GSPO, and verifier-grounded reward shaping compose the MegaCpp post-training pipeline: what we ship, what we still iterate, and the RL recipes behind the C++ specialist.
The post-training comparison surface once distillation, sampling, and RL-style loops start moving together.
Distillation
Best Of N
GRPO
GSPO
Read article
08
April 18, 2026•11 min read•David Gornshtein
Data Poisoning Drills and Refusal Behavior for the MegaCpp Specialists
Adversarial data tests, poisoning drills against the C++ specialist ensemble, the refusal behaviors we enforce, and the safety regression layer that sits on top of HumanEval-style code evaluation.
A useful companion when evaluation has to include negative tests and refusal behavior instead of only score maximization.
Safety
Eval
Poisoning
Refusal
Read article

Benchmark Receipts and Runtime Evidence

These complete the picture with the receipts and profiler signals behind the topline numbers.

09
April 18, 2026•12 min read•MegaCpp Engineering
Modal Benchmark Receipts: What Counted as Evidence and What Did Not
A grounded guide to benchmark receipts using compile posture, backend identity, and narrow evidence records rather than headline throughput claims.
What counted as a benchmark receipt, what did not, and how the evidence surface stayed honest.
Modal
Benchmarks
Receipts
Throughput
Read article
10
April 18, 2026•5 min read•David Gornshtein
Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes
What we learned running the training stack on rented H100, H200, and B200 boxes through Modal: three benchmark lanes, an 8-GPU FSDP2 hang, and the bookkeeping that lets the numbers survive a week.
The benchmark readback for the rented-GPU lane once launch and warmup noise were controlled.
Modal
Benchmarks
Multi-GPU
Fsdp
Read article
11
April 18, 2026•12 min read•David Gornshtein
Profiler and performance reports: making benchmark runs comparable months later
How MegaCpp samples training, what a structured performance report should contain, and how observability stays bounded so measurement does not become the regression.
The profiling and receipt discipline that ties runtime traces back to claims in the blog.
Observability
Profiler
Goodput
Performance Reports
Read article
12
April 18, 2026•9 min read•David Gornshtein
TPU v6e Performance Deep Dive: Real MFU, Sharding Topology, and the Things That Pretended to Help
How a TPU v6e lane actually spent time, why topology and compile amortization mattered so much, and which optimizations did not survive measurement.
A TPU-side performance deep dive that shows how the evidence rules change once compile and runtime interact.
TPU
V6e
Performance
MFU
Read article

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.

Evaluation, Benchmarks, and Verifier Loops

Evaluation Contract

How We Evaluate the MegaCpp SLM Ensemble on Real C++ Work

Eval Harness Plumbing: The Parts That Are Not the Benchmark

Verifier-first C++ evals: why compile-and-test owns the metric

The C++ Eval Suites, Verifiers, and the Compile-Then-Test Wall

Ablation and Comparison

SOTA Ablation and Comparison: How MegaCpp Decides What to Keep

What changed after the 10K-step gate: the ablations that stayed honest

Distillation, best-of-N, and verifier-grounded RL in the post-training loop

Data Poisoning Drills and Refusal Behavior for the MegaCpp Specialists

Benchmark Receipts and Runtime Evidence

Modal Benchmark Receipts: What Counted as Evidence and What Did Not

Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes

Profiler and performance reports: making benchmark runs comparable months later

TPU v6e Performance Deep Dive: Real MFU, Sharding Topology, and the Things That Pretended to Help

Adjacent topic hubs

GB10 and Blackwell Bring-Up

Modal Training and Benchmark Operations

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

MLA Integration, Dispatch, and Weight Absorption

Megatron Parallelism and Layout Boundaries

TPU Sparse Attention and Pallas Kernels

H200 Training and Kernel Bring-Up

TPU v6e and XLA Runtime Surfaces

C++ Data Pipelines and Corpus Packaging

MoE, Routing, and Distributed Model Splits