Graph recompilation hell: shape drift, graph contracts, and why TPU runs slow down without crashing
A walkthrough of the most common TPU recompilation failure mode: changing shapes, unstable graph contracts, and weak runtime discipline.

The most expensive TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries failures are often not crashes. They are recompiles: the run stays alive, step time stretches, the compile cache stops helping, and the team spends hours looking at the wrong layer. In practice this is usually a graph-contract problem, not a random TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries failure.
What public XLA guidance already says
PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations's public recompilation notes are direct: XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations prefers static shapes, and changing shapes can trigger recompilation. The bounded-dynamic-shape docs soften that story, but they do not erase it. Bounded dynamic shape reduces some classes of recompilation. It does not make graph drift disappear.
That matters because recompilation is often described as if it were random. It usually is not. It is a symptom that the runtime is seeing a materially different graph contract.
The most common causes
| Cause | Why it recompiles |
|---|---|
| input shape drift | XLA sees a different graph signature |
| data-dependent graph changes | the traced program no longer matches the previous step |
| hidden startup inconsistency | runtime policy changed between launches |
| weak batching discipline | supposedly identical steps do not really match |
Why dynamic shape is not a magic fix
Bounded dynamic shape is useful on TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries because it can absorb some variation while keeping memory allocation compatible with accelerator constraints. But the public docs are clear: it reduces some recompiles, not all of them.
The operational boundary is narrower than many first readings imply. A bounded shape only helps while the step stays inside one declared upper-bounded family and the hot path keeps using that symbolic contract. Once the code starts reading real dimensions back into Python or letting value-dependent operators decide downstream shapes, the run is back in ordinary graph-break territory rather than "dynamic shape fixed it" territory.
That is why the practical debugging rule is to change one runtime dimension at a time and keep a small deterministic smoke lane. If several things move together, recompilation stops being diagnosable.
Practical rule
If a TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries run slows down without crashing, ask three questions before touching model math:
- Did input shapes change?
- Did the runtime profile or SPMD policy change?
- Did a previously hidden graph branch become active?
Most of the time, one of those answers is yes.
Frequently asked questions
Which debug switches belong in the first recompilation repro?+
PT_XLA_DEBUG_LEVEL=2 for the compile/execute cause summary, then add XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 only when you need Python stack traces attached to IR and HLO. Keep those flags out of normal timing runs because the PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here. docs describe them as debugging aids with performance cost. Once the noisy cause is visible, pair the log with torch_xla.debug.metrics instead of guessing from wall-clock time alone. The local-safe surfaces are TPU compile/runtime control sample and Canonical XLA flag profile.Which counters prove the run is still recompiling after warmup?+
CompileCacheMiss should stop climbing for one stable bucket. If it keeps increasing, read it next to CompileTime, host/device transfer counters, and aten:: fallback counters such as aten::nonzero or aten::_local_scalar_dense. That combination usually means the graph contract is still changing, not merely that the step is slow. The next grep targets are value-dependent ops such as torch.nonzero, torch.unique, boolean indexing, tensor scalar reads, concrete dynamic-dimension queries, and rank-specific SPMD or logging branches. If those appear while input buckets are stable, debug the graph boundary before changing model math.Can stable input buckets still recompile when SPMD policy changes?+
CompileCacheMiss keeps climbing while sequence buckets are fixed, compare the SPMD profile and any rank-specific logging or control-flow path before blaming model math. The local handoff is XLA SPMD sharding annotations plus the TPU compile/runtime control sample.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.
The compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.
Google's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.