MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 12 min readMegaCpp Engineering
Modal
Benchmarks
Receipts
Throughput
Evidence

Modal Benchmark Receipts: What Counted as Evidence and What Did Not

A grounded guide to benchmark receipts using compile posture, backend identity, and narrow evidence records rather than headline throughput claims.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Modal Benchmark Receipts: What Counted as Evidence and What Did Not
Published 12 min readMegaCpp Engineering

Modal Benchmark Receipts: What Counted as Evidence and What Did Not

A benchmark number is only trustworthy when it comes with a receipt: the exact model lane, exact operator family, compile posture, and known exclusions. The same nominal model can produce very different throughput depending on whether it used padded MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack or jagged MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, stable dense compile or graph breaks, sparse recovery or a still-regressing backend.

The best adjacent reading is profiler and receipts and profiler-guided optimization, because the whole point of this post is that benchmark notes should inherit the same narrow evidence standard as the profiling notes they summarize.

Benchmark arguments get sloppy fast when infrastructure is moving. One run is quoted after a backend fix, another after a config change, another after a compile improvement, and suddenly the comparison sounds cleaner than it is. The project avoided that trap in its better artifacts by treating benchmark claims as evidence records rather than wins to advertise.

That is what a receipt is here. A receipt is not just a log file. It is a bounded claim about one lane: what model mix it used, what backend family was active, what compile status was in play, what accelerator shape was requested, what shape actually ran, and what nearby caveats still existed.

The checked-in examples make that concrete. The GPU profile receipt sample keeps measured throughput, requested-vs-observed dispatch, and peak memory in one typed record. The FA4 receipt summary sample separates backend truth, kernel verification, compile time, wall time, and throughput so one fast run cannot hide a backend mismatch.

Why Headline Numbers Are Dangerous

The most direct evidence is in the run history itself. It records several moments where the apparent faster path was not actually the best benchmark lane once compile behavior was included. The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack examples are especially important. A jagged grouped path could look better in local operator terms, but if it forced graph breaks or required @torch.compiler.disable, then the end-to-end training lane could lose to a padded alternative that preserved a more coherent compiled graph.

That is not a small footnote. It changes what a throughput number means. If one lane is measuring arithmetic efficiency and another is measuring a compiled end-to-end system with fewer breaks, they are not substitutes. The receipt has to say which one it is.

Receipt ingredient Why it matters
Exact lane or recipe Throughput only makes sense for one concrete topology and feature mix
Compile posture Eager, partial compile, and whole-model compile are different systems
Operator family Dense, sparse, padded MoE, and jagged MoE can behave very differently
Requested and observed accelerator shape Hosted runtimes can auto-upgrade or otherwise drift from the nominal label you asked for
Known caveats Graph breaks, fallbacks, and disabled paths explain why one number moved

The point is not to make reporting harder for its own sake. The point is to stop lying accidentally.

That may sound severe, but the repo history earns the severity. Performance changed for multiple legitimate reasons at once: compile behavior, sparse backend recovery, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack path selection, and validation narrowing. A benchmark note that hides those causes may still be numerically accurate for one run and yet be misleading for every comparison that follows it.

A Good Receipt Names the Structural Tradeoff

The better benchmark notes do this explicitly. For example, the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack compile writeups explain that the padded path could be faster overall because it remained compilable while the jagged fused path introduced graph breaks. That is not merely a "backend detail." It is the central interpretation key for the benchmark.

A good benchmark receipt therefore has to identify the structural tradeoff in plain language. If a run used padded MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, say it. If the fused path was disabled because compile could not tolerate it, say that too. If a sparse attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns recovery restored correctness but did not fully recover the old throughput envelope, the receipt should separate those statements instead of compressing them into a single success narrative.

In practice the padded-versus-jagged tradeoff fails in two different ways. Padded paths burn memory and arithmetic on inactive tokens, while jagged paths save that padding but can fragment compile so badly that end-to-end utilization falls harder than the saved tokens help. That is exactly why a receipt has to carry both backend family and compile posture in the same record instead of treating "MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack" as one benchmark class.

Block-sparse MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack is a third receipt class rather than the winner of that two-way argument. It can avoid padding waste without inheriting the same graph-break pattern as jagged routing, but only by moving into a different kernel family with its own validation and fallback surface. If a run used block-sparse dispatch, the receipt should name that lane explicitly instead of presenting it as "better padded MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack" or "fixed jagged MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack." The adjacent architecture-side continuation when that lane matters is Fused MoE and DeepEP on NVIDIA.

Receipt pattern:
model family + feature mix + compile posture + backend family + known caveats

That structure may look verbose, but it is still cheaper than debugging a misleading benchmark comparison later.

The Best Receipts in the Repo Narrow the Frontier First

A public H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 training excerpt is not a ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware-only document, but it demonstrates the right habit. It distinguishes stable dense or mixed lanes from adjacent unstable lanes. It does not report one blended number as if all nearby variants were equally solved.

That style is exactly what benchmark reporting needs on hosted accelerators as well. The benchmark should inherit the same frontier language:

  1. this exact lane is alive,
  2. this adjacent lane is still unstable,
  3. this number belongs only to the first lane.

Without that separation, teams start quoting the best number from the easiest slice and mentally attaching it to the hardest configuration. The repo's better reports resist that temptation.

That resistance is what makes them reusable. A later engineer can come back, see which lane the number belonged to, and decide whether a new run is actually comparable or only superficially adjacent.

The same frontier discipline shows up in training on H200 eight-GPU machines, where a stable dense lane and an adjacent unstable lane are treated as different claims rather than as one blended progress story. The checked-in training on H200 eight-GPU machines, GPU profile receipt sample, and measured optimization receipts are the compact local proof surfaces for that frontier style.

Hosted benchmark environments add their own source of confusion. Startup overhead, artifact staging, compile cache state, and exact environment drift all influence observed speed. That means a hosted benchmark receipt needs more than model arguments. The environment-side details are exactly the surfaces described in Modal image and cold-start and Modal multi-GPU issues and fixes.

ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware's public docs make the stable platform surfaces easy to name: requested GPU type and count, writable VolumesQuick term guideVolumeModal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.GroundingAbout: Modal image and cold start History: Modal multi-GPU issues and fixes Reference: Modal training platform overview, and background or detached executionQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingHistory: multi-GPU Modal benchmarks Reference: Modal debugging playbook Reference: Modal batch processing docs are all explicit product primitives. Receipt hygiene starts by separating those stable platform fields from run-specific metadata such as cache warmth, fallback behavior, and environment drift. That distinction matters extra on ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware because some GPU requests can be auto-upgraded unless benchmarking pins the exact class, so a receipt that stores only "we ran on Hopper" is weaker than one that preserves both the request and the observed hardware label.

First-touch definition for this article: compile posture is the execution stance that gave you the number, not a vague performance note afterward. Eager, partial compile, regional compile, and warmed whole-model compile are different receipt classes because they change the meaning of the number.

The receipt is stronger when it separates three utilization layers that are easy to blur together: how long the GPUs were merely allocated, how long kernels were actually running, and how much model work those kernels delivered. Without that split, a cold boot, cache-staging hiccup, or heavy-padding regime can get laundered into "the model was slow" when the lane was actually infrastructure-bound or compile-bound.

Startup state deserves the same explicit treatment. A cold container boot, a warmed compile cache, and a full memory-snapshot restore are different benchmark starting lines even if the model arguments are identical. If the receipt does not say which one happened, later readers will over-attribute platform overhead or cache luck to the model itself.

That is also why receipts are better than screenshots or one-line summaries. A screenshot can show a number. A receipt explains whether the number came from a stable compiled lane, a fallback backend, or a warm-cache run that no adjacent lane can reproduce. When the question stops being "is this number comparable" and becomes "why did this lane fail", the follow-on is Modal debugging playbook, not another screenshot.

This is where hosted benchmarking usually becomes misleading. The same nominal recipe can move because of cache warmth, compile posture, fallback behavior, or environment drift. If those are not logged alongside the number, later comparisons start looking more stable than the code actually was.

Hybrid Patterns Make Benchmark Labels More Important

In a pure dense stack, benchmark labeling is still easy to get wrong, but at least the model family is simple. In a hybrid stack with A, M, E, and R blocks, labels like NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample carry real architectural meaning. They imply different operator mixes and therefore different benchmark expectations.

An AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample-style pattern should never be benchmarked or discussed as if it were just "another transformer run." The E blocks introduce expert routing and dispatch behavior. The M blocks change the runtime profile. The R suffix or recurrent family changes state handling again. If a receipt does not mention that structure, it is too weak to compare against anything serious.

Family cue Benchmark implication
A heavy dense lane Throughput mostly reflects dense attention and projection economics
E heavy lane Routing, capacity, dispatch, and compile behavior become central
Mixed A/M/E/R pattern End-to-end number reflects family interactions, not one kernel story
NAM52 vs NAM56R Different receipts, different claims, different performance envelopes

This is one reason the repo's use of notation is valuable. It forces performance claims back onto real structure. The checked-in NAM56R block taxonomy sample and NAM56R pattern composition sample are the fastest local cross-checks when the benchmark label itself is the thing under review.

It also improves review quality. Once a benchmark is tied to a named family mix rather than a generic model label, reviewers can challenge the comparison on architecture grounds instead of only arguing about logging hygiene.

It also prevents benchmark laundering across adjacent lanes. Once a pattern name carries real architectural meaning, a result from the easiest runnable subset cannot honestly be promoted into a claim about the hardest intended target.

The topology label gets stronger when it carries one more field: the active-versus-total parameter shape. Hybrid receipts can otherwise look artificially efficient because a sparse or recurrent lane may activate much less of the model per token than the headline parameter count suggests. Keeping the pattern string and the active share side by side makes a fast hybrid result harder to compare as if it were a dense run with the same nominal size.

Sparse and Recovery Work Changed What Benchmarks Meant

The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample SDPA recovery bundle is another reminder that correctness, backend choice, and throughput should be logged separately. One recovery can materially improve throughput without erasing all later regression analysis. The useful benchmark after a recovery is the one that says whether it reflects restored correctness, restored backend selection, partial throughput recovery, or all three.

The best validation notes do this with restraint. They leave the remaining gap open instead of pretending one recovered run restored the entire older envelope. That makes the benchmark more useful, not less useful, because later readers know exactly what still needs explanation.

What a Strong Modal Benchmark Receipt Looks Like

A strong receipt in this stack should include:

Field Example meaning
Model lane NAM52 dense compile lane, or NAM56R hybrid lane with MoE enabled
Pattern or family Dense A-heavy, or hybrid AEMEAEMEAEMR
Compile status eager, partially compiled, or whole-model compiled
Backend detail padded MoE, jagged grouped MoE, sparse SDPA, dense attention, and whether a backend or precision fallback fired
Startup and cache state cold boot, warm compile cache, or snapshot restore; same flags do not imply the same starting line
Stability note stable frontier or adjacent known blocker
Requested vs observed hardware exact GPU class asked for, plus the class and count the run actually observed
Artifact pointer report, receipt, or preserved validation note

One more rule matters here: topology, operator lane, and environment state should not be collapsed into one opaque config label. A family pattern like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample, an execution lane like padded, jagged, or block-sparse MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, and a hosted starting line like cold boot, warmed cache, or snapshot restore answer different questions. Keeping them separate is what lets later readers tell whether a number moved because the model structure changed, because the kernel family changed, or because the hosted environment started from a different place. That is why training on H200 eight-GPU machines and Modal image and cold-start belong next to receipt hygiene instead of far away from it.

That may feel like more ceremony than most benchmark dashboards want. But the alternative is worse: optimistic numbers nobody can align to code.

Without startup and cache state, hosted receipts quietly compare model changes to infrastructure changes. Without backend-fallback state, sparse recoveries and mixed-precision regressions get compressed into one misleading throughput line.

A receipt-heavy culture may feel slower in the short term, but it saves time later because fewer benchmark disputes have to be re-opened from scratch.

And code alignment is the only standard that survives fast-moving infrastructure. A benchmark with exact report anchors and caveats can still be interpreted after several later patches. A benchmark without that context ages into trivia almost immediately.

The Main Rule: Report the Lane You Actually Measured

The strongest through-line across the codebase is this: report the exact lane you measured, not the family you wish you had measured. If compile was only stable on the padded MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack path, the receipt belongs to the padded MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack path. If the dense TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding+SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel+FSDP compile lane passed but the real MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layer remained the next blocker, the benchmark belongs to the dense lane. If a sparse recovery restored one backend path but left a broader performance gap unexplained, say that too.

Hosted benchmark arguments usually go wrong when people merge nearby truths into one sentence. The receipts in this project are useful because they refuse to do that. They tell you what worked, what did not, and why one number should not be generalized across the whole stack.

That is what made a benchmark believable here. Not the highest tokens-per-second line, but the narrowest honest claim around it.

That standard is stricter than typical benchmark culture, but it is the right one for a hybrid system where compile posture, sparse routing, and hosted-runtime state can all change independently. The narrower the receipt, the longer it remains useful.

FAQ

Frequently asked questions

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Modal

A container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

AEMEAEMEAEMR

A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

A / M / E / R

MegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

SP

Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Volume

Modal's writable persistence surface for cache, checkpoints, copied shards, and other mutable state that must survive container turnover.

Memory Snapshots

Modal's snapshot restore surface for import-time or initialization-time startup state. In these articles it narrows cold-start tax but does not replace warm compile caches or host-local storage semantics.

Topic hubs