MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202610 min readDavid Gornshtein

Torch Compile

Torch XLA

Triton

Inductor

Training Infra

The Compile-Time Tax We Accept for Runtime Speed

Q: When is full-graph compile still worth trying?

When the model shape is small enough and stable enough that one compile unit stays warm after the first hit. On the depth-52 hybrid preset we keep regional_compile; full-graph is the comparison lane, not the default, and it only earns its keep when the compile blast radius stays smaller than the steady-state win. The shortest companion is Regional compile without losing the plot, because that is where we keep the local proof for why the narrower region survives contact with real runs.

Q: What does regional compile give up?

It gives up some whole-model visibility. A smaller region can reduce cold-start cost and confine recompile damage, but it also means cross-block fusion and graph-capture decisions need their own receipts. That is why regional_compile is not proof by itself: the region has to be the same runtime unit checked by Regional compile without losing the plot and the CUDA graph block validation sample.

Q: What must be configured before the first import?

The writable cache roots, not just the accelerator flags. On CUDA that means TORCHINDUCTOR_CACHE_DIR, TORCHINDUCTOR_FX_GRAPH_CACHE=1, and TORCHINDUCTOR_AUTOGRAD_CACHE=1; on TPU it also means JAX_COMPILATION_CACHE_DIR plus the separate Torch/XLA cache path before any JAX or torch_xla work happens. Compile/runtime receipt sample and XLA compile/runtime controls sample are the compact checked-in setup receipts.

Q: How do I know the compile tax has turned into compile debt?

When the lane stops paying the compile bill once and starts repaying it on shape changes, guard flips, or cache misses that should have been avoided. The shortest checked-in proof surfaces are Compile/runtime receipt sample and XLA compile/runtime controls sample, because they show whether warmup flattened into a stable steady state.

Q: Should we just raise the recompile limit?

Only for a bounded, understood set of shapes. PyTorch stops trying to compile a function after the recompile limit is exceeded and runs it eagerly instead, so raising torch._dynamo.config.recompile_limit can buy time during diagnosis but it is not the fix for a counter, flag, or requires_grad boundary that flips every step. In production we treat the limit as a tripwire: if the same region keeps recompiling, the region is too dynamic or the mutable state is in the wrong place.

Q: Why not rely on a linter to catch recompile-causing Python state?

Because the dangerous part is behavioral, not just syntactic. A linter can flag obvious smells, but it cannot prove whether a Python counter flips on the compiled hot path, whether a branch stayed outside the region that actually compiled, or whether the cache miss rate flattened after warmup. Dynamo and compile breakage is the bug catalogue; the compile/runtime receipts are still the proof.

Why MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile, torch_xla and Triton caches honest across runs.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

The Compile-Time Tax We Accept for Runtime Speed

Published April 18, 2026•10 min read•David Gornshtein

Compilation is not a performance feature, it is a debt contract. We pay it once at session start (autotune, JIT, kernel selection), we pay it again every time a guard fires or a shape changes, and in exchange we are allowed to ship a steady-state step that is meaningfully cheaper than the eager one. This post is the trade-off rationale we use to decide where that contract is worth signing inside the MegaCpp training stack, what we instrument so the bill stays bounded, and the operational rule we ended up writing in blood after the famous "compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid hangs and NaN" episode. It is intentionally separate from Dynamo and compile breakage (which is the bug catalogue) and Graph recompilation hell (which is the TPU side); this post is about why we accept the tax at all.

Why MegaCpp cares about this

A representative hybrid stack runs across two very different toolchains: PyTorch 2.x with Inductor and Triton on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and B200, and torch_xla plus a JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: libtpu / PJRT / JAX ownership boundaries Reference: Pallas on TPU path on TPU v6e. Both buy us steady-state throughput by burning wall clock at start and risking wall clock at every recompile. On a depth-52 hybrid preset with Mamba-3 SSM blocks, an MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack expert tail and an MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries/DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns minority, the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid surface is huge. Inductor wants to fuse, but a Triton SSM kernel is opaque to it. Dynamo wants to specialize, but MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing is shape-dynamic by construction. XLA wants a single HLO, but optional adapters and MTP heads add side branches. If we let the compilers run with defaults, first-step time blows up to tens of minutes, and a single mis-typed counter can trigger a recompile storm that looks indistinguishable from a hang on a multi-rank job.

We accept the tax because the alternative is worse. Eager-mode steady state on the same model leaves a low-double-digit percentage of step time on the table, mostly in elementwise glue around the SSM and attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns paths and in the per-microbatch overhead of small Python dispatches. We have measured this on multiple presets and the answer has not changed: compiled is faster as long as the cache is warm, the guards are stable, and the recompile budget is bounded. The whole job of the rules below is to make those three preconditions hold.

What we built in the MegaCpp training stack

MegaCpp is where we discovered all the failure modes; production then encodes the survivors. Five surfaces matter.

The first is TORCHINDUCTOR_CACHE_DIR plumbing. Every launch script we still ship sets it explicitly, alongside TORCHINDUCTOR_FX_GRAPH_CACHE=1 and TORCHINDUCTOR_AUTOGRAD_CACHE=1, before Python starts. The runtime should pin the cache to a writable persistent volume, never to a small root disk where Inductor can silently fill the partition mid-run. We learned this the hard way in an early ablation pass when nine consecutive runs failed with "DISK FULL (inductor cache)" and were initially misclassified as model bugs. Public change notes show that the fix was operational, not algorithmic. The practical rule is: export TORCHINDUCTOR_CACHE_DIR to a per-run subdirectory on a large persistent volume, and refuse to start training if the directory is missing or unwritable. Inductor's persistent FX-graph cache then survives across runs and across processes, which is the thing that actually moves first-step wall clock from "tens of minutes" to "tens of seconds" on a warm host.

The second is the choice between regional_compile and a full-graph compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid. A representative model runtime module exposes regional_compile: bool at the model config level, and the hot path is structured so that the block is the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid unit, not the whole model. Mamba blocks, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack blocks and the hyper-connection plumbing all carry comments that explicitly call out "regional_compile region" boundaries; the variadic unpack at the block boundary is intentionally kept in eager Python so Dynamo does not have to trace through *args, **kwargs. We tried full-graph compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid on the depth-52 preset and rejected it: the guard surface across 52 mixed blocks is too large, autotune time on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 SM90 with TORCHINDUCTOR_DISTRIBUTED_MAX_AUTOTUNE_GEMM=1 ran the per-rank Triton subprocess into OOM, and one shape mismatch anywhere in the graph forces a full recompile of all 52 blocks at once. Regional compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid localises both the autotune cost and the recompile blast radius: a guard miss in MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack only invalidates the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack region, not the SSM region next door.

The third is the dynamo guards we actively avoid. The pattern we now treat as a code smell is any Python-level state that a compiled function can read on the hot path: an int counter, a bool flag, a getattr(self, ...) lookup, an environment variable read inside a method that is later compiled. Each of these becomes a guard, and a guard that flips even once can trigger recompilation. The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack module carries explicit comments at the points where this matters: loss accumulators are stored as Tensors not as None-or-Tensor unions (so Dynamo does not synthesise a type-change guard), buffer caches are looked up via direct attribute access instead of getattr (no attribute-lookup guard), and routing branches are pre-baked into module-level booleans at construction time so Dynamo can specialize without a per-call guard. Where we genuinely need a Python branch we mark it with @torch.compiler.disable, accept the graph break, and document it in place.

PyTorch's observability tools make this easier to prove, but they do not change the contract. TORCH_LOGS=recompiles and tlparse show which guard failed, and PyTorch's dynamic-shape controls are useful for confirming when an nn.Module integer field is being over-specialized. We still treat those as debugging escape hatches rather than as the production fix: the durable fix is to move mutable Python state out of the compiled hot path so the guard surface stays stable and reviewable.

The fourth is the Triton-kernel compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid wrapper choice for Mamba. The Mamba-3 SISO kernel and the older Mamba-2 SSD kernel are both torch.autograd.Function implementations whose backward dereferences Triton-specific APIs. Dynamo's FakeTensor proxy crashes when it tries to trace those backward bodies. We wrap both forward and backward in torch.library.custom_op (the checked-in opaque-kernel compile wrapper sample), provide register_fake shape-only stubs, and wire autograd via setup_context plus register_autograd. The kernel becomes opaque to Inductor; the rest of the M-block (in_proj, conv1d, out_proj, RMSNorms) compiles around it. The trade-off is explicit: we forfeit any Inductor fusion across the SSM boundary in exchange for keeping the surrounding linear/norm work compiled. Repeated measurements show this is a win at the target shapes; a fully eager M-block lost more on the surrounding glue than it gained on the kernel.

The fifth is XLA persistent cache discipline. The TPU side keeps its own JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: libtpu / PJRT / JAX ownership boundaries Reference: Pallas on TPU persistent compilation cache (jax.config.update("jax_compilation_cache_dir", ...)) and its own torch_xla HLO cache. Both must be pinned to writable, persistent directories before any JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: libtpu / PJRT / JAX ownership boundaries Reference: Pallas on TPU or torch_xla operation runs. The JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: libtpu / PJRT / JAX ownership boundaries Reference: Pallas on TPU side is configured via JAX_COMPILATION_CACHE_DIR; the checked-in TPU control receipt keeps the Torch/XLA cache path separate as XLA_COMPILATION_CACHE_DIR, with runtime cache initialization still happening before first computation. Without those paths every run paid the full HLO compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid cost over again, which on a hybrid preset is not a few seconds. The same precondition applies on the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 side: a missing or non-writable cache directory should abort launch.

A warm shared cache is still not a portability promise. JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: libtpu / PJRT / JAX ownership boundaries Reference: Pallas on TPU's persistent-cache key includes the non-optimized HLO, jaxlib, relevant XLA flags, and device topology, and on cold multi-node runs all processes compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid while only global rank 0 writes the cache. In practice that means a shared mount is only setup; the real receipt is whether cache misses flatten after warmup on the actual topology you plan to run.

How it lands in MegaCpp

In the deployed MegaCpp stack the rules become non-optional.

We lift the cache plumbing as-is. Every preset launcher exports TORCHINDUCTOR_CACHE_DIR, TORCHINDUCTOR_FX_GRAPH_CACHE, TORCHINDUCTOR_AUTOGRAD_CACHE, the JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: libtpu / PJRT / JAX ownership boundaries Reference: Pallas on TPU cache directory and the Torch/XLA cache directory before Python is allowed to start. The MegaCpp bring-up script verifies free space and write permission on the cache volume before allocating GPUs.

We rewrite the autotune surface. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 SM90 we keep TORCHINDUCTOR_DISTRIBUTED_MAX_AUTOTUNE_GEMM=0 because the distributed-autotune subprocess OOMs against a concurrent FSDP/EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding layout; on B200 and on single-node H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 with TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 we widen the autotune backends to ATEN,TRITON and accept the longer first-compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid in exchange for better steady-state kernels. The TORCHINDUCTOR_MAX_AUTOTUNE_SUBPROC_RESULT_TIMEOUT is pinned to a value that survives our slowest kernel without false-failing.

We drop the NO_COMPILE escape hatch. That switch existed only because of the regression that gave this post its operational rule; it is no longer needed and is not exposed in the deployed MegaCpp stack.

We move the Mamba kernel wrappers under a feature flag. The custom_op wrappers are the production default; the eager fallback exists only for the SM<80 development boxes (where bf16 Triton codegen is broken and we fall back to fp16) and is gated behind a single env switch.

We move the FIRE/DASH plasticity hooks out of the regional-compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid region. The hyper-connections and FIRE-orthogonalisation passes have explicit regional_compile boundaries that stay eager-only; in the deployed MegaCpp stack those boundaries are enforced by startup checks instead of comments alone.

The whole set can be summarized as a compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid contract enforced at startup; if any contract item is violated, the run should abort before touching an accelerator.

Ablations and what we kept

Ablation history tells the trade-off story more honestly than any micro-benchmark. The throughput investigation entry from this spring is the canonical reference: the 8-GPU DDP path with torch.compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid enabled was hanging or going NaN, and the diagnosis bounced through three wrong root causes (Muon, bf16, network) before landing on the real one. The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack module's _overflow_total counter was a Python int that incremented every forward(). Dynamo specialized on the value of that counter and recompiled. With the recompile limit at 64 it took only a handful of microbatches before we hit the limit on every rank, and the symptoms - a multi-minute stall followed by NaN - looked exactly like a numerical instability, not a compiler bug.

The fix is the operational rule we shipped after that episode and the single most important paragraph in this post:

No Python-level mutable state on the compiled hot path. Counters, accumulators and flags that change across forwards must be register_buffer Tensors mutated in-place under torch.no_grad(). Anything else is a recompile waiting to happen.

The fix in the main MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack runtime module was mechanical - convert _overflow_total and the related accumulators to Tensor buffers, mutate them with .add_() - but the rule generalises. We now grep for int and bool attributes on any nn.Module that lives inside a regional_compile region and treat each one as a code-review block. The same rule eliminated three other near-misses: a getattr lookup that Dynamo treated as opaque (rewritten as direct attribute access set in __init__), a None-or-Tensor loss accumulator (always-Tensor with a zero default), and an env-var read inside forward() (cached at construction time, with a comment that changing the env now requires recreating the module).

The other ablation worth keeping is the regional vs full-graph comparison. We re-ran the depth-52 hybrid preset under both modes after the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack counter fix landed: full-graph had a longer first-compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid, hit autotune OOM under DDP unless we disabled distributed autotune, and recompiled the entire 52-block graph on every shape change. Regional kept the recompile blast radius local. Steady-state throughput was within noise; first-compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid and recompile cost was strictly worse for full-graph. We kept regional.

Production checklist

Export TORCHINDUCTOR_CACHE_DIR, TORCHINDUCTOR_FX_GRAPH_CACHE=1, TORCHINDUCTOR_AUTOGRAD_CACHE=1 before Python starts; pin to a persistent volume; abort the run if the directory is missing or unwritable.
Pin JAX_COMPILATION_CACHE_DIR and the Torch/XLA cache path the same way; no JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: libtpu / PJRT / JAX ownership boundaries Reference: Pallas on TPU or torch_xla work is allowed before cache initialization.
Default to regional_compile on the depth-52 hybrid preset; full-graph is opt-in for small experiments only.
Wrap Triton autograd kernels in torch.library.custom_op with register_fake; never let Dynamo trace into a Triton backward.
No Python-level mutable state on the compiled hot path. All accumulators are register_buffer Tensors mutated under torch.no_grad().
Mark all Python branches that must stay eager with @torch.compiler.disable and document why in place.
Disable TORCHINDUCTOR_DISTRIBUTED_MAX_AUTOTUNE_GEMM on multi-rank H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 SM90; re-enable in subprocess mode on B200 with a bounded result timeout.
Track the per-run unique program count (CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200: Inductor cache misses; XLA: distinct HLO keys). Recompile budget is bounded; exceeding it pages a human.
Doctor script verifies cache plumbing, dynamo recompile counters and Triton autotune timeouts before any rank claims a GPU.
The compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot Dynamo and torch.compile Breakage on a Mamba-3 Hybrid contract doc is the source of truth; launchers read it at start and refuse to run on violation.

FAQ

Frequently asked questions

When is full-graph compile still worth trying?+

When the model shape is small enough and stable enough that one compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed… unit stays warm after the first hit. On the depth-52 hybrid preset we keep regional_compile; full-graph is the comparison lane, not the default, and it only earns its keep when the compile blast radius stays smaller than the steady-state win. The shortest companion is Regional compile without losing the plot, because that is where we keep the local proof for why the narrower region survives contact with real runs.

What does regional compile give up?+

It gives up some whole-model visibility. A smaller region can reduce cold-start cost and confine recompile damage, but it also means cross-block fusion and graph-capture decisions need their own receipts. That is why regional_compile is not proof by itself: the region has to be the same runtime unit checked by Regional compile without losing the plot and the CUDA graph block validation sample.

What must be configured before the first import?+

The writable cache roots, not just the accelerator flags. On CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. that means TORCHINDUCTOR_CACHE_DIR, TORCHINDUCTOR_FX_GRAPH_CACHE=1, and TORCHINDUCTOR_AUTOGRAD_CACHE=1; on TPU it also means JAX_COMPILATION_CACHE_DIR plus the separate Torch/XLA cache path before any JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes. or torch_xla work happens. Compile/runtime receipt sample and XLA compile/runtime controls sample are the compact checked-in setup receipts.

How do I know the compile tax has turned into compile debt?+

When the lane stops paying the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed… bill once and starts repaying it on shape changes, guard flips, or cache misses that should have been avoided. The shortest checked-in proof surfaces are Compile/runtime receipt sample and XLA compile/runtime controls sample, because they show whether warmup flattened into a stable steady state.

Should we just raise the recompile limit?+

Only for a bounded, understood set of shapes. PyTorch stops trying to compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed… a function after the recompile limit is exceeded and runs it eagerly instead, so raising torch._dynamo.config.recompile_limit can buy time during diagnosis but it is not the fix for a counter, flag, or requires_grad boundary that flips every step. In production we treat the limit as a tripwire: if the same region keeps recompiling, the region is too dynamic or the mutable state is in the wrong place.

Why not rely on a linter to catch recompile-causing Python state?+

Because the dangerous part is behavioral, not just syntactic. A linter can flag obvious smells, but it cannot prove whether a Python counter flips on the compiled hot path, whether a branch stayed outside the region that actually compiled, or whether the cache miss rate flattened after warmup. Dynamo and compile breakage is the bug catalogue; the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…/runtime receipts are still the proof.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Grounding

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

Grounding

JAX

A separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Grounding

Compile

Why MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

David Gornshtein • MegaCppMore posts →

The Compile-Time Tax We Accept for Runtime Speed

Why MegaCpp cares about this

What we built in the MegaCpp training stack

How it lands in MegaCpp

Ablations and what we kept

Production checklist

Read next

References

Frequently asked questions

Terms used in this article