Regional compile without losing the plot
Why MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed and CUDA-graph reality.

Regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid is only useful if it reduces compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid debt without making the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed story less honest. In MegaCpp, that means compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid regions are allowed to exist only when they preserve three things at the same time: the distributed ownership contract, the optimizer contract, and the graph-capture contract.
That is the part many compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid writeups skip. They talk about compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid as if it were a single switch. In a real hybrid stack it is not. It is a placement problem. Some surfaces want to stay compiled together, some need an explicit boundary, and some should remain opaque because the surrounding runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed is more valuable than one more fused region.
Why MegaCpp does not treat compile as a blanket mode
The compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid examples in this repo already show the real constraint. A compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid region is not judged only by whether it lowers. It is judged by whether it can coexist with distributed wrappers, dynamic batch policy, CUDA graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample, and compiled optimizer policy without turning the step into a graph-break storm.
The checked-in proof surfaces stay intentionally small:
Regional compile ordering sample
for wrapper order,
Compile/runtime receipt sample
for the effective lane,
Compile warmup policy sample
for explicit warmup decisions, and
Goodput tracker sample for
keeping compilation, checkpoint, and step wall time separate.
That is why the public compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid pack is organized around runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed contracts rather than compiler internals:
- The ordering sample keeps the ordering explicit instead of pretending regional compile can be inserted anywhere.
- The warmup-policy sample makes warmup a policy choice instead of a superstition.
- The compiled-optimizer sample keeps the optimizer step in the same runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed conversation rather than treating it as an afterthought.
- The block-validation sample checks whether a block is even a valid capture target before forcing CUDA graphs around it.
- The opaque-kernel wrapper sample shows the opposite move: when one fragile surface should stay opaque so the surrounding block can remain stable.
The general lesson is narrow. Regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid is a scheduling tool, not a compiler victory lap.
The ordering rule matters more than the slogan
The strongest local lesson from the examples is that ordering is the real contract. If distributed wrapping, compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid insertion, and CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample are applied in the wrong order, the system can still look "compiled" in logs while doing the wrong work operationally.
MegaCpp keeps that failure mode visible by treating compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid as one stage in a lane definition rather than as a global launch toggle. The regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid example is useful precisely because it is boring. It records the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed order directly. That makes it auditable.
The checked-in Regional compile runtime sample and Regional compile block-identity sample are the next useful proof surfaces because they keep wrapper order and visible module identity tied to the same block. If later checkpoint or hook placement stops talking about the same runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed unit, the compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid seam is already in the wrong place.
The block-identity sample keeps one narrower attachment rule visible too: try the in-place compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid path on the block first and fall back only when the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed does not expose it. That preserves the same module identity for later distributed wrapping, hook installation, and checkpoint policy instead of silently moving those decisions onto a replacement wrapper.
That same boundary has to stay aligned with recompute policy. The block-identity sample and Activation recompute boundaries in hybrid stacks are the short local continuation when warmup, replay, and saved-state policy stop describing the same block.
This is the right level of claim for public documentation:
- compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid regions are deliberate runtime boundariesQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed
- optimizer compilation is a separate policy surface
- CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample must validate the requested block boundary first
- one opaque kernel can be left outside the region if that keeps the rest of the lane stable
That is a much stronger story than "we use regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid for speed."
Why CUDA-graph boundaries belong in the same article
Regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid without graph-capture discipline is only half a system. CompileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid regions change what the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed sees as a stable execution unit. CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample makes the same question even stricter: is this region really stable enough to capture repeatedly, or did we just move instability to a later phase?
The MegaCpp examples keep those questions together on purpose. A compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid region that looks elegant in isolation can still be wrong if the block registry or shape policy says the region is not a legitimate graph-capture target. The block-validation example is therefore not a side note. It is part of the same runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed truth.
The public validation receipt also avoids treating a launch flag as proof. It reports the requested, found, and missing block types before enabling graph-tree runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed controls, so the dangerous failure mode changes from "graphs were requested" to "these compiled block classes were actually visible to the lane."
One practical extension is to keep warmup and replay tied to a small family of fixed-capacity launches instead of letting every nearby shape variation invent a fresh boundary. That is why CUDA graph block validation sample and Pipeline compile warmup sample belong in the same conversation.
What regional compile is actually buying here
The public examples support a narrower, more defensible benefit statement:
- smaller compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid domains can reduce cold-start compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid overhead
- deliberate boundaries can reduce the blast radius of graph breaks or dynamic shape churn
- runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed-specific opaque wrappers can preserve stability when one kernel family remains fragile
- compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid policy becomes easier to compare when the lane shape is explicit
Notice what is missing: any claim that regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid is universally faster. That would be the wrong public wording. The honest claim is that regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid can be the cheaper operational choice when the stack contains repeated substructures plus a small number of unstable boundaries.
Where this lands in MegaCpp
In MegaCpp, regional compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid belongs above the kernel layer and below the launcher profile. It is not the same thing as a model feature, and it is not the same thing as a vendor backend. It is a lane-shaping decision that keeps the compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid story compatible with the actual runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed shape.
That is why the compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid directory is so valuable as public evidence. It does not just say "compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid happened." It shows where compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid begins, where it should stop, and why the optimizer and graph-capture boundaries have to be part of the same decision.
Prior art and context
The general compiler ideas are not unique to MegaCpp. PyTorch's torch.compileQuick term guideCompileGraph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…GroundingDynamo and torch.compile Breakage on a Mamba-3 Hybrid
docs, graph-break guidance, recompilation notes, and regional compilation
recipe all describe the same high-level tradeoff: you often win by keeping
repeated regions compiled while leaving unstable boundaries explicit. MegaCpp's
contribution here is narrower. The examples in this repo show how that general
idea is turned into a runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed contract for a hybrid, distributed training lane
instead of a one-line benchmark trick.
Frequently asked questions
Is regional compile a blanket speed switch?+
Why keep CUDA-graph validation in the same story?+
When should a kernel stay opaque instead of joining the compiled region?+
What is the practical rule for using it?+
Should the optimizer live inside the same regional-compile boundary as the model block?+
Where does activation recompute sit relative to the compiled region?+
Why does block identity still matter if the region already compiles?+
Does "distributed wrapping first" mean every outer wrapper goes on before compile?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.
Graph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 +…
Why MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…