MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 2 min readDavid Gornshtein
XLA
TPU
Recompilation
Graph
Performance

Graph recompilation hell: shape drift, graph contracts, and why TPU runs slow down without crashing

A walkthrough of the most common TPU recompilation failure mode: changing shapes, unstable graph contracts, and weak runtime discipline.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Graph recompilation hell: shape drift, graph contracts, and why TPU runs slow down without crashing
Published 2 min readDavid Gornshtein

The most expensive TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries failures are often not crashes. They are recompiles: the run stays alive, step time stretches, the compile cache stops helping, and the team spends hours looking at the wrong layer. In practice this is usually a graph-contract problem, not a random TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries failure.

What public XLA guidance already says

PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations's public recompilation notes are direct: XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations prefers static shapes, and changing shapes can trigger recompilation. The bounded-dynamic-shape docs soften that story, but they do not erase it. Bounded dynamic shape reduces some classes of recompilation. It does not make graph drift disappear.

That matters because recompilation is often described as if it were random. It usually is not. It is a symptom that the runtime is seeing a materially different graph contract.

The most common causes

Cause Why it recompiles
input shape drift XLA sees a different graph signature
data-dependent graph changes the traced program no longer matches the previous step
hidden startup inconsistency runtime policy changed between launches
weak batching discipline supposedly identical steps do not really match

Why dynamic shape is not a magic fix

Bounded dynamic shape is useful on TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries because it can absorb some variation while keeping memory allocation compatible with accelerator constraints. But the public docs are clear: it reduces some recompiles, not all of them.

The operational boundary is narrower than many first readings imply. A bounded shape only helps while the step stays inside one declared upper-bounded family and the hot path keeps using that symbolic contract. Once the code starts reading real dimensions back into Python or letting value-dependent operators decide downstream shapes, the run is back in ordinary graph-break territory rather than "dynamic shape fixed it" territory.

That is why the practical debugging rule is to change one runtime dimension at a time and keep a small deterministic smoke lane. If several things move together, recompilation stops being diagnosable.

Practical rule

If a TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries run slows down without crashing, ask three questions before touching model math:

  1. Did input shapes change?
  2. Did the runtime profile or SPMD policy change?
  3. Did a previously hidden graph branch become active?

Most of the time, one of those answers is yes.

FAQ

Frequently asked questions

Which debug switches belong in the first recompilation repro?+
Start with PT_XLA_DEBUG_LEVEL=2 for the compile/execute cause summary, then add XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 only when you need Python stack traces attached to IR and HLO. Keep those flags out of normal timing runs because the PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here. docs describe them as debugging aids with performance cost. Once the noisy cause is visible, pair the log with torch_xla.debug.metrics instead of guessing from wall-clock time alone. The local-safe surfaces are TPU compile/runtime control sample and Canonical XLA flag profile.
Which counters prove the run is still recompiling after warmup?+
After the warmup window, CompileCacheMiss should stop climbing for one stable bucket. If it keeps increasing, read it next to CompileTime, host/device transfer counters, and aten:: fallback counters such as aten::nonzero or aten::_local_scalar_dense. That combination usually means the graph contract is still changing, not merely that the step is slow. The next grep targets are value-dependent ops such as torch.nonzero, torch.unique, boolean indexing, tensor scalar reads, concrete dynamic-dimension queries, and rank-specific SPMD or logging branches. If those appear while input buckets are stable, debug the graph boundary before changing model math.
Can stable input buckets still recompile when SPMD policy changes?+
Yes. A bucket is only stable when the input shape, sharding annotations, runtime profile, and rank-local branches stay stable together. If CompileCacheMiss keeps climbing while sequence buckets are fixed, compare the SPMD profile and any rank-specific logging or control-flow path before blaming model math. The local handoff is XLA SPMD sharding annotations plus the TPU compile/runtime control sample.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

XLA SPMD

The explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.

XLA

The compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.

TPU

Google's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.

Topic hubs