Topic Hub

Evaluation, Benchmarks, and Verifier Loops

A curated evaluation reading path: verifier-first harnesses, ablation structure, benchmark receipts, and the evidence rules that keep comparisons from collapsing into anecdotes.

This hub starts with the evaluation contract itself, then moves into the ablation and comparison layer, and finishes with the benchmark receipts and profiler evidence that connect model claims to real runs.

evaluation
verifier
benchmarks
ablation
receipts
profiling
Curated set
12
Articles in reading order
Why this hub

Best if you want to separate trustworthy evidence from vague score reporting across the MegaCpp archive.

Evaluation Contract

Read these first to understand what the MegaCpp eval lane accepts as evidence.

  1. 01
    April 18, 20267 min readDavid Gornshtein

    How We Evaluate the MegaCpp SLM Ensemble on Real C++ Work

    The evaluation design, verifier stack, and release gates we use to measure C++ model quality without collapsing everything into a single leaderboard number.

    The broad evaluation overview and the vocabulary that anchors the rest of the verifier lane.

    Evaluation
    Benchmarks
    SLM
    C++
  2. 02
    April 18, 20268 min readDavid Gornshtein

    Eval Harness Plumbing: The Parts That Are Not the Benchmark

    The four-axis eval harness plumbing under our C++ benchmarks: sandboxing, compile walls, timeouts, parallel runners, flake isolation, and the contract tests a new benchmark has to pass before it goes into CI.

    How the harness surface is wired once evaluation has to stay reproducible instead of ad hoc.

    Evaluation
    Testing
    Infra
    C++
  3. 03
    April 18, 20269 min readDavid Gornshtein

    Verifier-first C++ evals: why compile-and-test owns the metric

    What the C++ evaluation stack teaches about deterministic extraction, sandbox contracts, pass@k, and why benchmark tables only become trustworthy after the verifier owns the pass label.

    The shortest accurate read on why verifier-first evaluation stayed central for C++ tasks.

    Evaluation
    C++
    Verifier
    Benchmarking
  4. 04
    April 18, 202610 min readDavid Gornshtein

    The C++ Eval Suites, Verifiers, and the Compile-Then-Test Wall

    The C++-specific eval surface we actually run: problem sets, the compile-then-test verifier sandbox, header and include coverage, and how per-specialist scorecards fall out of the same harness.

    The suite-level companion piece once the eval lane needs multiple tasks and result vocabularies.

    Evaluation
    C++
    Verifier
    Benchmarks

Ablation and Comparison

These articles explain how model comparisons stay legible once many knobs move at once.

  1. 05
    April 18, 202610 min readDavid Gornshtein

    SOTA Ablation and Comparison: How MegaCpp Decides What to Keep

    The ablation plan, the comparison methodology, and the honest numbers behind the MegaCpp SLM stack — what stacked, what didn't, and what we threw out even though the paper said it would help.

    The cleanest overview of how MegaCpp framed comparisons against external baselines without flattening important caveats.

    Ablation
    Sota
    MoE
    DSA
  2. 06
    April 18, 20268 min readDavid Gornshtein

    What changed after the 10K-step gate: the ablations that stayed honest

    A grounded reading of training changes after the configured 10K-step gate: STP activation, auxiliary-head timing, plasticity scheduling, and why later ablations are more trustworthy than warmup-era receipts.

    A narrower look at what early-step ablations can and cannot really prove.

    Ablation
    Training
    Stp
    Fire
  3. 07
    April 18, 202610 min readDavid Gornshtein

    Distillation, best-of-N, and verifier-grounded RL in the post-training loop

    How distillation, best-of-N, GRPO, GSPO, and verifier-grounded reward shaping compose the MegaCpp post-training pipeline: what we ship, what we still iterate, and the RL recipes behind the C++ specialist.

    The post-training comparison surface once distillation, sampling, and RL-style loops start moving together.

    Distillation
    Best Of N
    GRPO
    GSPO
  4. 08
    April 18, 202611 min readDavid Gornshtein

    Data Poisoning Drills and Refusal Behavior for the MegaCpp Specialists

    Adversarial data tests, poisoning drills against the C++ specialist ensemble, the refusal behaviors we enforce, and the safety regression layer that sits on top of HumanEval-style code evaluation.

    A useful companion when evaluation has to include negative tests and refusal behavior instead of only score maximization.

    Safety
    Eval
    Poisoning
    Refusal

Benchmark Receipts and Runtime Evidence

These complete the picture with the receipts and profiler signals behind the topline numbers.

  1. 09
    April 18, 202612 min readMegaCpp Engineering

    Modal Benchmark Receipts: What Counted as Evidence and What Did Not

    A grounded guide to benchmark receipts using compile posture, backend identity, and narrow evidence records rather than headline throughput claims.

    What counted as a benchmark receipt, what did not, and how the evidence surface stayed honest.

    Modal
    Benchmarks
    Receipts
    Throughput
  2. 10
    April 18, 20265 min readDavid Gornshtein

    Benchmarking the MegaCpp stack on Modal: multi-GPU lessons from rented boxes

    What we learned running the training stack on rented H100, H200, and B200 boxes through Modal: three benchmark lanes, an 8-GPU FSDP2 hang, and the bookkeeping that lets the numbers survive a week.

    The benchmark readback for the rented-GPU lane once launch and warmup noise were controlled.

    Modal
    Benchmarks
    Multi-GPU
    Fsdp

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.