MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 6 min readDavid Gornshtein
Compile
Torch Compile
CUDA Graphs
Runtime
Distributed

Regional compile without losing the plot

Why MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed and CUDA-graph reality.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Regional compile without losing the plot
Published 6 min readDavid Gornshtein

Regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid is only useful if it reduces compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid debt without making the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed story less honest. In MegaCpp, that means compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid regions are allowed to exist only when they preserve three things at the same time: the distributed ownership contract, the optimizer contract, and the graph-capture contract.

That is the part many compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid writeups skip. They talk about compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid as if it were a single switch. In a real hybrid stack it is not. It is a placement problem. Some surfaces want to stay compiled together, some need an explicit boundary, and some should remain opaque because the surrounding runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed is more valuable than one more fused region.

Why MegaCpp does not treat compile as a blanket mode

The compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid examples in this repo already show the real constraint. A compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid region is not judged only by whether it lowers. It is judged by whether it can coexist with distributed wrappers, dynamic batch policy, CUDA graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample, and compiled optimizer policy without turning the step into a graph-break storm.

The checked-in proof surfaces stay intentionally small: Regional compile ordering sample for wrapper order, Compile/runtime receipt sample for the effective lane, Compile warmup policy sample for explicit warmup decisions, and Goodput tracker sample for keeping compilation, checkpoint, and step wall time separate.

That is why the public compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid pack is organized around runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed contracts rather than compiler internals:

  • The ordering sample keeps the ordering explicit instead of pretending regional compile can be inserted anywhere.
  • The warmup-policy sample makes warmup a policy choice instead of a superstition.
  • The compiled-optimizer sample keeps the optimizer step in the same runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed conversation rather than treating it as an afterthought.
  • The block-validation sample checks whether a block is even a valid capture target before forcing CUDA graphs around it.
  • The opaque-kernel wrapper sample shows the opposite move: when one fragile surface should stay opaque so the surrounding block can remain stable.

The general lesson is narrow. Regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid is a scheduling tool, not a compiler victory lap.

The ordering rule matters more than the slogan

The strongest local lesson from the examples is that ordering is the real contract. If distributed wrapping, compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid insertion, and CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample are applied in the wrong order, the system can still look "compiled" in logs while doing the wrong work operationally.

MegaCpp keeps that failure mode visible by treating compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid as one stage in a lane definition rather than as a global launch toggle. The regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid example is useful precisely because it is boring. It records the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed order directly. That makes it auditable.

The checked-in Regional compile runtime sample and Regional compile block-identity sample are the next useful proof surfaces because they keep wrapper order and visible module identity tied to the same block. If later checkpoint or hook placement stops talking about the same runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed unit, the compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid seam is already in the wrong place.

The block-identity sample keeps one narrower attachment rule visible too: try the in-place compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid path on the block first and fall back only when the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed does not expose it. That preserves the same module identity for later distributed wrapping, hook installation, and checkpoint policy instead of silently moving those decisions onto a replacement wrapper.

That same boundary has to stay aligned with recompute policy. The block-identity sample and Activation recompute boundaries in hybrid stacks are the short local continuation when warmup, replay, and saved-state policy stop describing the same block.

This is the right level of claim for public documentation:

That is a much stronger story than "we use regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid for speed."

Why CUDA-graph boundaries belong in the same article

Regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid without graph-capture discipline is only half a system. CompileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid regions change what the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed sees as a stable execution unit. CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample makes the same question even stricter: is this region really stable enough to capture repeatedly, or did we just move instability to a later phase?

The MegaCpp examples keep those questions together on purpose. A compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid region that looks elegant in isolation can still be wrong if the block registry or shape policy says the region is not a legitimate graph-capture target. The block-validation example is therefore not a side note. It is part of the same runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed truth.

The public validation receipt also avoids treating a launch flag as proof. It reports the requested, found, and missing block types before enabling graph-tree runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed controls, so the dangerous failure mode changes from "graphs were requested" to "these compiled block classes were actually visible to the lane."

One practical extension is to keep warmup and replay tied to a small family of fixed-capacity launches instead of letting every nearby shape variation invent a fresh boundary. That is why CUDA graph block validation sample and Pipeline compile warmup sample belong in the same conversation.

What regional compile is actually buying here

The public examples support a narrower, more defensible benefit statement:

  • smaller compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid domains can reduce cold-start compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid overhead
  • deliberate boundaries can reduce the blast radius of graph breaks or dynamic shape churn
  • runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed-specific opaque wrappers can preserve stability when one kernel family remains fragile
  • compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid policy becomes easier to compare when the lane shape is explicit

Notice what is missing: any claim that regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid is universally faster. That would be the wrong public wording. The honest claim is that regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid can be the cheaper operational choice when the stack contains repeated substructures plus a small number of unstable boundaries.

Where this lands in MegaCpp

In MegaCpp, regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid belongs above the kernel layer and below the launcher profile. It is not the same thing as a model feature, and it is not the same thing as a vendor backend. It is a lane-shaping decision that keeps the compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid story compatible with the actual runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed shape.

That is why the compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid directory is so valuable as public evidence. It does not just say "compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid happened." It shows where compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid begins, where it should stop, and why the optimizer and graph-capture boundaries have to be part of the same decision.

Prior art and context

The general compiler ideas are not unique to MegaCpp. PyTorch's torch.compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid docs, graph-break guidance, recompilation notes, and regional compilation recipe all describe the same high-level tradeoff: you often win by keeping repeated regions compiled while leaving unstable boundaries explicit. MegaCpp's contribution here is narrower. The examples in this repo show how that general idea is turned into a runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed contract for a hybrid, distributed training lane instead of a one-line benchmark trick.

FAQ

Frequently asked questions

Is regional compile a blanket speed switch?+
No. In MegaCpp it is a lane-shaping choice used when repeated regions are stable enough to compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +… while the unstable boundaries remain explicit and auditable. The checked-in Regional compile ordering sample and Compile warmup policy sample are the quick local proof that this is an ordering and policy decision, not a universal speed toggle.
Why keep CUDA-graph validation in the same story?+
Because compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +… regions and graph-capture boundaries define the same runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… unit. If they are chosen independently, logs can still say "compiled" while the operational lane is wrong; DSA and CUDA graph safety is the adjacent runtime proof.
When should a kernel stay opaque instead of joining the compiled region?+
When the kernel is a stable runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… dependency but an unstable tracing surface. The checked-in Opaque kernel compile wrapper sample keeps that fused kernel behind a custom-op-shaped boundary so the surrounding block can stay compiled; Dynamo and compile breakage is the continuation when that boundary starts leaking graph breaks back into the lane.
What is the practical rule for using it?+
Keep the ordering explicit: distributed wrapping first, then regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…, then graph validation and optimizer policy. In checked-in form that means Regional compile ordering sample first, CUDA graph block validation sample for the capture boundary, and Pipeline compile warmup sample for the warmup lane. That same rule can still keep one unstable block family outside the compiled region when the lane needs an explicit exclusion.
Should the optimizer live inside the same regional-compile boundary as the model block?+
No. The checked-in Compiled AdamW policy sample keeps optimizer compilation as its own policy surface: CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. can use one dynamic optimizer graph across many parameter-shape families, while TPU/XLA-style lanes stay eager. That keeps optimizer policy explicit instead of silently redefining the model's regional-compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +… boundary.
Where does activation recompute sit relative to the compiled region?+
Treat recompute policy as part of the block contract, not as an outer cleanup pass. The checked-in Regional compile ordering sample keeps the recompute boundary attached before the compiled leaf is finalized, and Activation recompute boundaries in hybrid stacks is the adjacent note for the saved-state and replay side of that decision. If recompute, warmup, and capture stop talking about the same block, the regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +… seam is already too late.
Why does block identity still matter if the region already compiles?+
Because regional-compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +… boundaries are also ownership boundaries. The checked-in Regional compile block-identity sample tries the in-place compile path first so later distributed wrapping and checkpoint policy still point at the same runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… block, with wrapper fallback only when that path is unavailable.
Does "distributed wrapping first" mean every outer wrapper goes on before compile?+
No. The local ordering receipt is narrower than that slogan: tensor and sequence-parallel setup need to happen before regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +… so the traced path still sees the right distributed block shape, but outer runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… wrappers such as FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism. still belong after the compiled leaves are chosen; the checked-in Regional compile ordering sample is the public-safe receipt for that split.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Compile

Graph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…

Runtime boundaries

Why MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…