MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 12 min readDavid Gornshtein
Observability
Profiler
Goodput
Performance Reports
Benchmark

Profiler and performance reports: making benchmark runs comparable months later

How MegaCpp samples training, what a structured performance report should contain, and how observability stays bounded so measurement does not become the regression.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Profiler and performance reports: making benchmark runs comparable months later
Published 12 min readDavid Gornshtein

Reproducibility in a fast-moving training stack is mostly a paperwork problem. The model code changes, the optimizer recipe drifts, the loader contract gets tightened, the wheel of torch rolls forward, and six weeks later somebody tries to compare a new run to an old screenshot or chat note. The fix is not heroic profiling; it is the discipline of recording the same five things on every run, in the same format, with a known-bounded sampling cost. This post is about the reporting layer that makes those comparisons hold up over time, and it reads best next to Profiler-Guided Optimization and Training speed by feature, which depend on the same receipts staying comparable.

Here, a performance receipt means one structured run record: lane provenance, hardware and software inventory, steady-state summary metrics, bounded health checks, and links to any heavyweight traces stored out of band. First touch: the receipt is the durable record for one run, the dashboard is the rolling surface built from many receipts, the profiler trace is the heavyweight attachment you follow only when the receipt says a specific phase needs more detail, and the throughput knobs are the compile, routing, fusion, and observability choices that can move the receipt in the first place. A receipt schema is the stable field layout plus explicit schema-version string that lets old and new runs stay comparable. goodput is the fraction of wall time spent doing useful step work, while badput is the wall time lost to compilation, checkpointing, evaluation, data loading, or idle gaps. The checked-in goodput tracker sample, compile/runtime receipt sample, GPU profile receipt sample, and FA4 receipt summary sample show that contract in small, inspectable pieces.

Why MegaCpp cares about this

Benchmarks rot. Specifically: a tok/sec number with no schema attached has a half-life of about two weeks before it stops being comparable to anything. The dataloader contract changes, the bucket alignment changes, the precision recipe changes, the kernel set changes. If we want the loss curve from week N to be honestly compared to the loss curve from week N+8, we need the run itself to carry the metadata that explains every line on the chart. That is the same phase-separation problem described in Compile-time vs runtime tradeoffs and Regional compile without losing the plot: once startup, steady state, and measurement costs get flattened together, the comparison stops meaning anything.

The second reason is failure attribution. Most long runs do not fail outright; they degrade. tok/sec falls 8% one Tuesday and nobody can say whether it was a framework bump, a communication-library change, or a layout rewrite. Structured reports make those attributions reviewable instead of vibes.

The third reason is honesty. MegaCpp publishes numbers. The numbers should be defensible. The performance report is the defense. That is also why this post pairs naturally with Observability and the three dashboards: the dashboards should read these receipts, not replace them, and Profiler-Guided Optimization should treat them as the before-and-after contract for any claimed win.

What we built in MegaCpp

The instrumentation layer is intentionally small and uses Python stdlib wherever possible so it adds zero device overhead and no new hard dependencies.

The public goodput tracker is the wall-clock accountant. Adapted from the MaxText GoodputRecorder model and simplified for a single-leader training loop, GoodputTracker records named milestones (job_start, tpu_init, training_preparation, ...) and accumulates duration per category via the span(category) context manager. Categories are step, checkpoint, eval, compilation, data_loading. compute_goodput() returns step_time / wall_time; compute_badput_breakdown() returns the per-category time plus an idle residual that catches the noise the categories miss. The implementation is thread-safe via a single Lock because checkpoint saves can run on a background thread. The cost model is trivial: a time.monotonic() call at span entry and exit, a dict update under a lock, and a periodic dict copy when somebody calls summary(). Sampling cost is therefore bounded by the number of spans per step, which we keep at one (with tracker.span("step"): forward(); backward(); step()).

The public report builder constructs the run header. It resolves commit, branch, dirty status, and a short commit message, with explicit environment-variable overrides for rollout environments that ship without a .git directory. It also records GPU inventory, platform, Python version, PyTorch version, CPU count, RAM, working-directory context, and an optional cost estimate. The header is written once at job start; it is what makes the performance report still searchable six months later. Without that header, the comparisons in Training speed by feature and Compile-time vs runtime tradeoffs quickly stop being honest. The compile/runtime receipt sample is the smallest checked-in example of that "header plus measured summary" split, while GPU profile receipt sample is the matched before/after comparison shape.

The public temporal-performance tracker is the per-task performance surface for evaluation lanes. It exposes a start_run(config=...), record_step(step, metrics, commits=, tokens=), finish_run(), and to_json(path) lifecycle, with a stable schema-version string baked into the file so future readers can route by version. Peak memory is read from /proc/self/status:VmHWM with a resource.getrusage fallback for macOS development. Throughput is reported as commits_per_sec and tokens_per_sec. The summary block includes mean, median, min, and max for every metric the caller recorded, computed once at finish_run() time so per-step recording stays cheap.

That tracker is intentionally coarse. In a hot training loop, repeated /proc/*/smaps-style walks or nvidia-smi polling are worse tools: the first burns CPU and kernel time, the second mostly sees the allocator's reserved pool rather than the active tensor footprint. Per-step memory accounting should stay on cheap host-side counters or bounded summary reads.

When memory regressions are part of the question, the main receipt should still stay small. The durable shape is peak memory_reserved per rank, a short allocator summary such as inactive_split_bytes, num_alloc_retries, and num_ooms, plus one "largest growing bucket" note from the first and last steady-state snapshots. That is enough to compare two runs months later without turning the report into a stream of nvidia-smi polls or embedding raw snapshots. The heavier artifacts still live out of band, which is the same boundary used in Why a 4B-8B model fills an H200 and still OOMs and OOM Debugging Playbook for H200 Training Runs.

The comparison that matters there is allocated versus reserved, not a single peak-memory headline. If reserved bytes keep climbing while allocated bytes stay roughly flat, the receipt is usually seeing allocator fragmentation or cache growth rather than a sudden jump in live tensor demand. That is why we keep both counters plus allocator-side hints in the compact receipt and push the heavyweight allocator dump into the linked side artifact.

For cloud-run training, the reporting layer parses trainer stdout with a small set of compiled regexes and emits a typed result tagged with an explicit schema version. The summary block is the steady-state aggregate: throughput, loss, gradient norm, MFU, step count, training time, peak memory, and model size. The checks block is the boolean health gate: finite losses, zero exit code, presence of steps, and absence of OOM. Failure runs additionally carry a bounded reason and log tail. That is the minimum needed for a later reader to distinguish a healthy slow run from a broken fast one, and it lines up directly with the operational surface in loss curves and the divergence playbook.

The public report schema covers ablation experiments too. The requirement is simple: every per-layer curve should match the same steps length, layer keys should match the declared model depth, and the model configuration should identify the architecture under test. The point of this structured design is that two ablation runs from different weeks can be opened by the same reader and compared without any ad-hoc parsing.

The public observability layer is the live-telemetry surface. Two things run by default with zero operator action: a metrics pusher sends loss, throughput, and MFU to a monitoring backend every 15 seconds, and OpenTelemetry spans wrap checkpoint saves, validation passes, and eval phases. The push interval is rate-limit-safe and leaves a generous margin. On-demand profiling should stay explicit and bounded: start it only for the window someone is debugging, stop it promptly, and store the resulting trace out of band rather than bloating the main report. The subtle failure mode is export, not span syntax: a coarse span around checkpoint or eval can be cheap, but synchronous export or per-step attribute spam is what turns telemetry into badput. That separation is visible in the goodput tracker sample and the FA4 receipt summary sample: the lightweight receipt stays small while the heavy trace remains a linked side artifact. It is also the bridge into Observability and the three dashboards, where receipts become the rolling data source instead of an afterthought.

The exact flush cadence is less important than keeping export asynchronous and bounded. Fifteen seconds is a reasonable default for an active training lane; minute-scale export can be perfectly fine for slower evaluation lanes or fleet dashboards. What does not scale is synchronous export on the hot path or attribute spam every step. That is the real boundary between telemetry and badput.

The same rule applies inside the tracing stack itself. If spans are part of the lane, the safe posture is a batched background exporter with a bounded queue and an explicit overflow policy, not a "send every span now" path. In practice the processor choice matters more than the trace library name: synchronous per-span export turns checkpoint and eval instrumentation into hidden step-time tax, while a batch-oriented path preserves the receipt boundary this article is trying to protect.

The same separation should hold for retention. Sampling, queue limits, and trace eviction belong on the collector or exporter side, where they can bound cost without reaching back into the training step. The hot path should emit the small set of spans the lane can afford, then hand off to an asynchronous path that decides what to keep.

How it lands in MegaCpp

The schema contracts should lift as-is. The trainer should write the same dict shape, the same field names, and the same float precision, and the reader should not need to know which trainer produced the file. Schema bumps should be explicit version strings, not silent additions.

The public goodput accountant also lifts cleanly. It is stdlib-only, thread-safe, and the cost model is bounded by span count per step, which should stay close to one step span plus small category spans at phase boundaries.

The public report builder may need two kinds of adaptation in a production deployment. First, build provenance can be injected as one structured blob by the release pipeline instead of a loose set of environment variables. Second, cost estimates should come from a centralized pricing source rather than from hand-entered values. The report should not contain guesses.

Schema evolution wants the same restraint as the rest of the receipt layer. New fields should land additively and stay optional by default, and old field meanings should not be silently renamed out from under older readers. If a registry layer appears later, stable field identity matters more than pretty key names.

One practical way to think about that is "stable identity, flexible spelling." If the report store grows beyond flat JSON files, readers should still be able to follow one long-lived field even after the display name changes. Renaming a field should not force old runs to be rewritten or make historical comparisons guess which new label replaced which older one.

Cloud-specific parsing helpers should not become the long-term center of the system. If multiple execution surfaces exist, the parsing rules should move into a shared log-parsing layer so the same report format can be produced across environments.

The observability layer may still need selective rewrites. The monitoring push path can stay, the OpenTelemetry tracer can stay, and the on-demand profiler hooks can stay because they are a cheap “trace what is happening right now” interface. What should change is any hardcoded rate limit or label set that really belongs to a recipe or deployment preset. That split matters even more once the live consumers are the dashboards in Observability and the three dashboards.

The temporal-performance tracker also lifts cleanly into evaluation harnesses. The /proc/self/status peak-memory path is the natural Linux default, and a lightweight fallback is enough for development machines.

Ablations and what we kept

The instrumentation surface itself has been ablated more than once. Three patterns survived; three did not.

Survived:

  • A single span("step") per training iteration plus separate spans for checkpoint, eval, compilation, data_loading. This gives us a clean goodput number and a defensible badput breakdown without per-microbatch instrumentation.
  • Stdout-parsing performance reports. The trainer prints structured step lines; the report builder parses them after the run. This means the trainer never has to know it is being measured, and an old report can be re-derived from an old log file as long as the format is stable.
  • Schema versioning on every performance report. Versioned dicts beat unversioned ones every time an older file needs to be read months later.

Dropped:

  • Per-microbatch tracing. The signal-to-noise ratio was poor and the sampling cost ate measurable step time on small models.
  • An attempt to embed the profiler trace directly into the main report JSON. Trace files are large enough that they should be referenced by URI rather than inlined.
  • A "rich report" path that recorded every CLI flag verbatim. The better approach is to record only the flags that affect numerics or performance.

The dashboards worth trusting are the ones that read the structured report rather than re-deriving from logs. The most useful panels are median tok/sec over time per preset, goodput fraction over time per lane, peak memory MiB over time per preset, and report-check pass rate over the last 100 runs. Anything else is supplementary. That dashboard discipline also belongs next to Throughput vs quality knobs: observability overhead is itself one of the knobs, so the measurement path has to be reported instead of hand-waved. If the question becomes ownership, SLOsQuick term guideSLOsThe small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.GroundingAbout: observability and SLO dashboards Reference: inference serving stack, and rolling drill-down instead of one lane record, hand off to Observability and the three dashboards.

The sample-cost budget is the discipline that keeps observability from becoming the regression it is supposed to catch. The budget on a training lane should be: at most one Python lock acquire per step for goodput accounting, at most one monitoring write every 15 seconds, zero device-side instrumentation by default, and on-demand profiling only inside an explicit bounded window. Observability cost should be measured the same way model cost is measured: if a monitoring change moves step time, the change should be reverted. That is why the reporting layer here should be read together with Profiler-Guided Optimization, not after it.

Failure-mode honesty matters too. The report checks block should not hardcode a throughput threshold. Throughput thresholds are recipe-dependent and versioned; a precision-stress run will look slow against a performance-tuned run, and the health block should not misclassify that as failure. The checks block is only the boolean health surface; performance comparison belongs at dashboard time, not at write time.

Production checklist

  • Wrap the training step in goodput.span("step"), checkpoint saves in goodput.span("checkpoint"), eval in goodput.span("eval"), compile warmup in goodput.span("compilation"), and the dataloader pull in goodput.span("data_loading").
  • Write the run header at job start and include git provenance, GPU info, system info, and cost info.
  • Tag every performance report with an explicit schema version; do not silently add fields.
  • Keep MetricsPusher push interval >= 15 s and respect the per-time-series rate limit.
  • Keep on-demand profiling behind SIGUSR1 / SIGUSR2 and never enable it by default on a long run.
  • Reference profiler-trace files by URI in the report; do not inline them.
  • Validate every ablation report against its schema before writing.
  • Run the dashboard against the report store, not against raw logs.
  • Treat any change that moves step time as an observability cost regression and bisect it before merging.
  • Persist performance reports to durable storage near the checkpoint, so a recovered checkpoint always has its matching report.
FAQ

Frequently asked questions

What must be in a performance receipt to make it reusable later?+
At minimum: build provenance, hardware and software inventory, schema version, steady-state summary metrics, and a boolean health block with a bounded failure reason when the run went bad. In checked-in form, compile/runtime receipt sample is the compact lane header, while GPU profile receipt sample shows the matched throughput, step-count, and peak-memory side. If any of those are missing, the receipt stops being comparable surprisingly fast.
Should profiler traces live inside the main report JSON?+
No. Store the trace out of band and keep the main receipt lightweight, closer to GPU profile receipt sample than to a raw trace dump. Inlining the trace makes the main report heavy, awkward to diff, and harder to archive near checkpoints.
How should a receipt schema evolve without breaking old runs?+
Add fields additively, keep new fields optional by default, and bump the schema version when semantics change instead of silently renaming keys. A reader should be able to open last month's receipt without guessing which field name replaced which.
When is the reporting layer too expensive?+
As soon as a monitoring or profiling change measurably moves step time on the target lane. Observability overhead is part of the system budget, so it should be measured and bisected like any other regression. Goodput tracker sample is the shortest local surface for the “bounded accounting cost” side of that rule.
Should receipts synchronize the GPU to get cleaner timing?+
No. A receipt should record host-side counters, allocator counters, and bounded phase spans without adding a device-wide barrier. cudaDeviceSynchronize() is useful when a debugging run needs exact completion semantics, but in a normal training receipt it turns measurement into a pipeline bubble. Keep synchronization inside explicit profiler or debugger windows, not inside the always-on receipt path.
Why keep both allocated and reserved memory in a receipt?+
Because they answer different questions. allocated tracks memory occupied by live tensors, while reserved tracks the larger pool the caching allocator is holding onto. If reserved keeps climbing while allocated stays roughly flat, the receipt is usually seeing allocator fragmentation or cache growth rather than a real jump in live tensor demand. That is why the compact receipt keeps both counters and pushes any heavyweight allocator dump into a linked side artifact. The local continuation is Why a 4B-8B model fills an H200 and still OOMs and OOM Debugging Playbook for H200 Training Runs.
What is the difference between a performance receipt and a live dashboard?+
The receipt is the per-run source record: one lane, one schema version, one bounded health block, one set of links to heavy traces. The dashboard is the rolling consumer of many receipts plus live metrics. If a dashboard panel cannot be traced back to receipts, it stops being audit-friendly quickly. Compile/runtime receipt sample is the shortest checked-in proof that one lane should collapse into one readable record, while goodput tracker sample is the wall-time surface that later rolls up into trend panels. Observability and the three dashboards is the direct continuation when the question shifts from one receipt to a fleet-level rolling view.
Should one receipt try to explain every problem in a run?+
No. Keep one narrow receipt per lane or failure family so two runs still compare on the same compact fields instead of turning one file into a mixed narrative. Distributed debugging notes makes the same point directly, and the checked-in compile/runtime receipt sample plus GPU profile receipt sample show the public-safe shape: one lane header, one matched throughput and memory summary, with heavier traces or sibling receipts linked out of band.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

SLOs

The small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Topic hubs