Torch/XLA 2.11 expectations vs TPU reality
What MegaCpp expected from the Torch/XLA 2.11 line on TPU, what the shipped stack actually looked like in practice, and how that changed our bringup strategy.

When teams talk about a new Torch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations line, they often compress two very different questions into one. The first question is what the version number suggests: newer runtime, newer PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: libtpu / PJRT ownership boundaries About: Torch XLA / PJRT reality Example: TPU backend ownership note plumbing, maybe better cache behavior, maybe fewer bringup patches. The second question is the one operators actually have to answer: what exact wheel lineage, libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingAbout: libtpu / PJRT ownership boundaries Example: TPU backend ownership note Example: XLA runtime probe sample build, cache path, and compile policy can survive a real TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries training lane?
MegaCpp ended up learning that the second question mattered much more. The public repo history shows an initial expectation that the newer Torch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations 2.11-class stack might simplify TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries work. What actually shipped into our working TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lane was more complicated: custom builds, mixed-stack experimentation, partial wins, and a bringup strategy that had to stay stricter than the version number suggested.
The narrow research-grounded correction is simpler than a new historical rewrite. The checked-in TPU backend ownership, TPU runtime probe sample, and XLA compile/runtime controls sample all point back to the same rule: a version label is weaker than one validated bundle that has actually been probed, launched, and restarted.
What we expected from the 2.11 line
The expectation was reasonable. A newer Torch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations line appeared to offer three things at once:
- a newer OpenXLA and PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: libtpu / PJRT ownership boundaries About: Torch XLA / PJRT reality Example: TPU backend ownership note contract
- a chance to use newer
libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingAbout: libtpu / PJRT ownership boundaries Example: TPU backend ownership note Example: XLA runtime probe sample builds - fewer local workarounds around TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries compile and startup behavior
That hope shows up indirectly in the repo history. There is a dedicated March 2026 fix titled Guard FSDP nightly mesh fix on torch 2.11, which already tells you the team was actively validating 2.11-specific behavior rather than treating TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries as frozen on an older stack. The changelog also records a full TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries deployment wave where all eight workers on one TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries experiment were moved to torch 2.11.0a0+git7afdbae, torch_xla 2.11.0+gitc04e61c, libtpu 0.0.37, and jax 0.9.0.
On paper, that looks like the kind of stack refresh that should collapse some bringup complexity. A newer framework line plus a newer TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries runtime is exactly the combination people usually hope will reduce patch debt.
What Google actually shipped into the practical path
The operator docs and install scripts show a different reality. MegaCpp's checked-in TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries install path does not say "use the stock upstream 2.11 wheels and move on." It says the repo-preferred path is a custom wheel stack. The mainline TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries installer pins:
- custom
torch 2.9.0a0+git21fec65 - custom
torch_xla 2.9.0+gitc04e61c - Python
3.13 jax 0.9.0libtpu 0.0.24minimum, with0.0.36as the preferred tested receipt
The repo README says this explicitly: the validated TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lane uses the custom 2.9-based stack, and the custom torch_xla line is required for the modern libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingAbout: libtpu / PJRT ownership boundaries Example: TPU backend ownership note Example: XLA runtime probe sample and PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: libtpu / PJRT ownership boundaries About: Torch XLA / PJRT reality Example: TPU backend ownership note contract because stock PyPI torch_xla==2.9.0 stayed constrained to an older libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingAbout: libtpu / PJRT ownership boundaries Example: TPU backend ownership note Example: XLA runtime probe sample lane.
That is the first important correction to the simple 2.11 story. The practical MegaCpp TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries path did not converge on "2.11 solved it." It converged on "we needed custom wheels to get the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries runtime contract we actually wanted, and the stable operator story remained a custom stack."
The 2.11 experiments were real, but they did not become the simple default
The public changelog makes clear that 2.11 was not imaginary or incidental. It records a special validation lane with a newer OpenXLA snapshot, newer PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: libtpu / PJRT ownership boundaries About: Torch XLA / PJRT reality Example: TPU backend ownership note framework version, and custom torch_xla 2.11 wheels. It also records why that newer line did not instantly become the whole TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries story.
One operator note is especially revealing: the fleet became intentionally mixed-stack. Some TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries machines stayed on the custom 2.9 line, while a validation lane ran a newer 2.11 build. The changelog warns that comparisons had to remain explicit because the mixed stack could not be treated as one uniform environment. That is the opposite of a clean version-bump narrative.
Another note is even more operationally important: the validation lane's custom HLO cache patch still did not remove the expensive libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingAbout: libtpu / PJRT ownership boundaries Example: TPU backend ownership note Example: XLA runtime probe sample compilation step. The repo notes that cache files were written, but restart still re-entered libtpu Compile(). In other words, part of the expected 2.11 story was "newer stack, maybe persistent cache finally pays off." The observed story was "some cache plumbing improved, but the operator-facing compile pain remained."
The checked-in cache-control surfaces make the proof bar narrower than "a directory received files." In practice, persistent cache only counts when the same pre-import runtime policy delivers a measurably faster warm restart. That is why XLA flag profile sample and XLA compile/runtime controls sample are useful companions here: they keep PJRT_DEVICE, cache location, and startup policy explicit enough to compare restart behavior instead of inferring it from files on disk.
There is a second correction here. "Google shipped a newer TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries software line" and "the practical TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries bringup got much easier" are not the same statement.
Why the compile-cache story mattered so much
This detail shaped bringup more than any headline version number. MegaCpp's TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries notes repeatedly narrow the real runtime contract to a few stubborn truths:
- TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries launch must use the PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: libtpu / PJRT ownership boundaries About: Torch XLA / PJRT reality Example: TPU backend ownership note TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries runtime path explicitly
- SPMD must be enabled early and treated as part of startup correctness
XLA_NO_SPECIAL_SCALARS=1remains part of the core run contract- the compilation cache needs an early, explicit local path such as
XLA_COMPILATION_CACHE_DIR - model-level
torch.compile(...)is not the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries strategy; the practical path istorch_xla.compile()around forward and backward micro-steps, with separate optimizer-step handling
That compile-policy split is not only local habit. Current PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations docs still
recommend torch_xla.compile() for training while positioning
torch.compile(..., backend="openxla") as the cleaner inference-side surface.
For this lane the operational takeaway stays narrow: if a newer stack still
needs cache and restart debugging, keep the training compile boundary explicit
enough that warm-restart receipts and compile-policy changes remain comparable
run to run.
That is a much more hands-on contract than the optimistic reading of a newer framework line. The TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries setup guide also warns against stale habits such as exporting XLA_USE_BF16=1 for the current Pallas attention path or assuming that XLA_USE_SPMD=1 is the real activation switch. The working lane is defined by current repo code and startup order, not by folklore about what XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations used to want.
This is why the cache story hurt so much. If persistent cache had fully delivered on restart, bringup would have become much less fragile operationally. Instead, the public notes show a more limited outcome: cache write paths existed, but restart behavior still did not eliminate the expensive compile wall in the way operators hoped.
How this changed our TPU bringup strategy
The version story stopped being the organizing principle. Bringup moved to a stricter playbook.
1. We stopped treating upstream version labels as the main unit of truth
The operator docs now privilege the checked-in install script, current runtime notes, and training code over generic assumptions about a release line. That is the right response when "2.11" can mean several materially different combinations of OpenXLA snapshot, libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingAbout: libtpu / PJRT ownership boundaries Example: TPU backend ownership note Example: XLA runtime probe sample version, wheel provenance, and startup flags.
2. We separated "experimental validation lane" from "operator-preferred lane"
This is one of the cleanest decisions in the repo. The public docs distinguish the validated day-to-day TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries stack from older or alternative reference paths. That separation matters because it stops the team from over-reading one successful experiment as a global convergence point.
3. We treated bringup as a runtime-contract problem, not a package-upgrade problem
The current TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries documentation is almost entirely about contract details: compile mode, retry ladder, flag profiles, cache location, sharding shape, and startup order. That is exactly what you would expect after learning that a newer framework line does not automatically erase TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries-specific operational constraints.
4. We kept mixed-stack evidence explicit instead of flattening it
The changelog explicitly warns that 2.9 and 2.11 comparisons must stay run-by-run. That is a healthy discipline. Once one lane has a newer OpenXLA snapshot and another does not, a good bringup log should not pretend they are directly interchangeable.
The deeper lesson
The main lesson is not that Torch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations 2.11 was bad. The lesson is that TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries bringup quality depends less on the abstract release name than on the exact runtime contract that the team can reproduce.
For MegaCpp, the public record shows that a newer 2.11 line was useful as an experimental and validation surface. It exposed version-specific fixes, newer OpenXLA behavior, and a place to test cache-related ideas. But the practical operator lane still centered on a custom, pinned stack and on repo-specific startup discipline. In that sense, the 2.11 expectation was "maybe the platform line itself becomes the simplifier." The shipped reality was "the simplifier is still a checked-in install path plus a narrow, reproducible runtime contract."
That is a more conservative story, but it is also the more useful one. It explains why TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries bringup eventually became less about chasing a magic wheel version and more about preserving a stable launch recipe, recording exact receipts, and separating hoped-for platform behavior from observed platform behavior.
What this means for future TPU stack upgrades
The best way to read a future Torch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations upgrade is now obvious.
- Ask which exact wheel lineage is operator-blessed.
- Ask which
libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingAbout: libtpu / PJRT ownership boundaries Example: TPU backend ownership note Example: XLA runtime probe sample line is actually validated with it. - Ask whether restart really reuses compilation work or only saves tracing overhead.
- Ask whether the sharding and compile policy stayed the same.
- Ask whether the new line replaced the old lane or merely created a second experimental lane.
Those questions are better than "are we on 2.11 yet?" because they are the ones that determine whether TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries bringup gets easier in practice.
Frequently asked questions
What should I verify before trusting a newer TPU/XLA lane?+
Why keep torch_xla.compile() at the step boundary instead of defaulting to model-level torch.compile()?+
torch.compile(..., backend="openxla") is the cleaner inference story, but the training docs still recommend torch_xla.compile() around a step-shaped function so forward, backward, and optimizer update stay inside one explicit compiled region. For bringup that makes cache and restart receipts easier to compare than a fuzzier whole-model compile boundary.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.
The TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.
The compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.
The named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense.
Google's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.
JAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.