MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 19, 20267 min readDavid Gornshtein

libtpu

PJRT

JAX

Torch XLA

TPU

MegaCpp

libtpu, PJRT, JAX, and ownership boundaries

Q: Why does the JAX bridge need import-order discipline?

Because call_jax is not just another local helper. It crosses from a PyTorch/XLA-owned lane into a JAX/Pallas-owned kernel surface, so import order becomes part of runtime ownership. Keep PyTorch/XLA as the main frontend, run the Pallas/JAX import guard before importing JAX when that bridge is active, and keep a backend receipt for the handoff. The checked-in reader-safe surfaces are Torch XLA and PJRT reality, call_jax bridge runtime sample, and Pallas bridge receipt sample.

Why a shared TPU substrate still leaves distinct ownership lines across PJRT, torch_xla, JAX, and libtpu, and where the main failure boundaries appear in practice.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

libtpu, PJRT, JAX, and ownership boundaries

Published April 19, 2026•7 min read•David Gornshtein

The easy story is that TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up software is one stack, so mixed tooling should behave like one system. In practice, the boundaries are more distinct: libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample may be shared substrate, but frontend ownership, runtime policy, cache policy, and backend verification still need to stay separate. At first touch, the split is simple: PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: Torch XLA / PJRT reality Example: TPU backend ownership note is the framework-facing device/runtime API, libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample is the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up-specific implementation beneath it, and PJRT_DEVICE=TPU is launcher intent rather than runtime proof. If you need the sharding side of the same boundary, pair this with XLA SPMD sharding annotations; if you need the mixed-frontend lane, pair it with libtpu and JAX interaction and Torch XLA and PJRT reality.

That separation was not aesthetic. It was forced by failure boundaries that kept reappearing in code paths, documentation, and runtime outputs.

The quickest checked-in map for the same split is TPU backend ownership overview, TPU runtime probe sample, XLA compile/runtime controls sample, XLA backend fallback sample, JAX bridge call surface, and Pallas bridge receipt sample. They keep runtime proof, pre-import policy, and backend handoff separate in the same reader-first way this article does.

The stack is shared, but the owners are not

The layering is straightforward:

MegaCpp PyTorch code
  -> torch_xla
  -> PJRT runtime
  -> libtpu
  -> TPU hardware

One practical complication is that JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: Pallas on TPU is imported directly for some TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up kernel paths, while the main model path remains PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations-owned. In other words, one process can contain more than one frontend authority even when the lower TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up runtime substrate is shared.

Device ownership is stricter than the layered diagram makes it look. One TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up chip is not a shared process slot. After one process successfully attaches through PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: Torch XLA / PJRT reality Example: TPU backend ownership note and libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample, a second process trying to claim the same chip usually fails with an already in use by another process signature rather than joining the lane. That is a runtime-ownership boundary before it is a framework-compatibility boundary.

The launch surfaces make this explicit. The TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up attention lane can stay on native trace_pallasQuick term guidetrace_pallasThe native PyTorch/XLA custom-kernel lane that traces a Pallas kernel into a payload the XLA side can keep without crossing into a generic JAX bridge call.GroundingAbout: Pallas on TPU Example: trace_pallas scalar-prefetch sample Example: XLA Pallas bridge receipt or widen into a narrower call_jaxQuick term guidecall_jaxThe Torch/XLA bridge lane that hands one narrowed TPU operation to JAX instead of moving the whole program into a JAX-owned frontend path.GroundingExample: XLA call_jax bridge Example: call_jax bridge runtime Reference: libtpu and JAX interaction bridge when needed, while a separate fallback lane stays pure PyTorch. That is not one interchangeable software stack. It is one shared substrate with multiple frontend-owned entry points, which is why Pallas kernels on TPU and Pallas FlashAttention with logit softcap on TPU v6e sit next to this article instead of inside a generic runtime note.

Why the ownership split became necessary

These TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up failures were usually not generic "TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up is broken" failures. They were boundary failures where one layer made an assumption that another layer did not honor.

One recurring issue captured the risk directly: TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up autodetect could still force XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations on a broken or non-runtime host, and the right fix was to fail fast on a real runtime probe instead of trusting fallback signals. That is an ownership lesson. Device detection belongs to runtime truth, not to optimistic frontend inference.

One runtime record shows the same lesson from a different angle. It records torch, torch_xla, and jaxQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: Pallas on TPU versions together, then logs which backend actually ran: xla_flash_pallas, xla_flash_pallas_softcap, or xla_splash_via_trace_pallas. That is the right model for a mixed TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up environment. The run has to record which frontend asked for work, which runtime contract was active, and which backend actually answered. Profiler and receipts is useful here because it treats those backend names as receipt fields instead of as isolated strings.

The failure boundaries that mattered in practice

1. Runtime intent is not runtime proof

PJRT_DEVICE=TPU expresses intent. It does not prove that the runtime path is healthy. Library presence, host startup order, and fallback logic can still push execution toward CPU or fail before a real TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up program exists. That failure belongs at the runtime-ownership layer.

The recognizable signatures are asymmetric. JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: Pallas on TPU often falls back to CPU when the runtime is missing or unusable. A strict PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations lane is more likely to abort or emit libtpu.so already in use by another process when the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up cannot satisfy the declared contract. Those are different troubleshooting branches, not one generic TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up failure.

2. JAX-backed kernel paths have different owners than pure torch_xla paths

The attention lane makes the split explicit. A native trace_pallasQuick term guidetrace_pallasThe native PyTorch/XLA custom-kernel lane that traces a Pallas kernel into a payload the XLA side can keep without crossing into a generic JAX bridge call.GroundingAbout: Pallas on TPU Example: trace_pallas scalar-prefetch sample Example: XLA Pallas bridge receipt path stays inside the PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations-owned lane, while call_jaxQuick term guidecall_jaxThe Torch/XLA bridge lane that hands one narrowed TPU operation to JAX instead of moving the whole program into a JAX-owned frontend path.GroundingExample: XLA call_jax bridge Example: call_jax bridge runtime Reference: libtpu and JAX interaction widens the handoff on purpose. Those are different ownership domains with different failure modes, even though they meet on the same TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up runtime underneath.

3. Cache ownership is separate from frontend ownership

The training stack carries both torch_xla.runtime.initialize_cache(...) and JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: Pallas on TPU compilation-cache configuration. In practice, cache configuration needs to happen before computation starts. That is another boundary line: one shared accelerator does not imply one shared cache contract. PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations and JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: Pallas on TPU each carry their own cache-entry policy even when both eventually target the same TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up backend. The operational lesson is close to graph recompilation hell: if cache setup is ambiguous, a runtime problem quickly masquerades as a model problem.

The checked-in counterparts for that startup story are XLA compile/runtime controls sample, TPU runtime probe sample, and call_jax bridge runtime sample: pre-compute cache setup, runtime proof, and the narrower bridge lane are all separated on purpose.

4. Some failures terminate inside libtpu-owned behavior, not OpenXLA-owned behavior

One example around the 2 GB executable-proto limit is especially revealing. The split-proto fix landed for GPU in OpenXLA, but the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up cache path still went through closed-source libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample and therefore did not inherit the same fix. That is the clearest ownership boundary in the stack: even with PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: Torch XLA / PJRT reality Example: TPU backend ownership note and XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations above it, some TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up runtime behavior is still effectively owned by libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample.

ABI drift creates the same kind of trap from the other side. When the frontend and libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample disagree about what belongs inside execute options, the failure appears at the PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: Torch XLA / PJRT reality Example: TPU backend ownership note boundary long before model code has a chance to fail in a readable way. In practice, torch_xla, JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: Pallas on TPU, and libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample updates have to be treated as one compatibility bundle, not as independent package refreshes.

Which responsibilities remain at the application layer

The code and runtime behavior converge on four ownership buckets.

Frontend ownership

The code keeps a PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations-owned main path and treats JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: Pallas on TPU-backed kernels as explicit opt-in surfaces. That prevents a JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: Pallas on TPU-dependent feature from masquerading as a generic TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up capability.

Runtime-policy ownership

The launch layer owns PJRT_DEVICE, XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations flag profiles, and startup cache configuration. Those settings are treated as launch policy, not shell noise, because import order and early runtime state change behavior materially.

Backend verification

The provenance record exists because statements like "Pallas ran" or "Splash ran" are not specific enough. The run has to record which backend actually executed.

Failure-boundary ownership

When a fix belongs in launch policy, kernel routing, cache setup, or backend substitution, it is best handled there instead of pretending the entire problem is a generic TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up-runtime fault.

What "latest libtpu" really means

There is one more point that public TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up discussions often flatten away: "use the latest libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample" is not the same statement as "all frontend/runtime combinations are compatible with the latest libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample."

The useful public-safe version of the lesson is simpler and stronger: validated TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up lanes are bundle-specific. A newer libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample build can be part of one proven frontend/runtime bundle and still sit outside another bundle's tested boundary. That is exactly the kind of ownership split that gets lost when people describe TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up software as one blob.

So the practical question is not "what is the newest libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample package?" The practical question is "which torch_xla and JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: Pallas on TPU frontend pairings have been proven against which libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample and PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: Torch XLA / PJRT reality Example: TPU backend ownership note boundary?"

Practical debugging order

If a TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up setup involves libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample, PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: Torch XLA / PJRT reality Example: TPU backend ownership note, torch_xla, and JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: Pallas on TPU in one environment, a practical debugging order is:

Which frontend owned the failing path: pure PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations, or a JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: Pallas on TPU-backed kernel surface?
Is the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up already owned by another process?
Which runtime policy was selected before imports: PJRT_DEVICE, XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations flags, and cache path?
Which backend actually ran according to the execution record?
Does the failure terminate in application routing, in the frontend bridge, or in a libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.GroundingTPU backend ownership note XLA runtime probe sample-owned runtime or cache boundary?

That is the useful meaning of ownership here. The substrate is shared. The accountability is not.

In practice, keep troubleshooting inside the layer that owns the failure: already in use stays at device ownership, CPU fallback stays at runtime proof and backend selection, and execute-options crashes stay at version and ABI bundle compatibility.

FAQ

Frequently asked questions

Does PJRT_DEVICE=TPU prove the TPU lane is healthy?+

No. It expresses intent, not proof. The runtime still needs to pass a real probe, and the execution record still needs to confirm which backend actually ran. If the logs say CPU fallback or already in use, stay at the runtime boundary instead of assuming the model path ever reached TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels. execution.

What kind of PJRT or libtpu mismatch is this article warning about?+

The practical seam is execute-options ABI drift. If the frontend and libtpuQuick term guidelibtpuThe TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend. stop agreeing about which fields PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu. carries at execute time, failures surface before model code does. A useful reader-safe example is newer execute-options layouts carrying fields such as struct_size or non-donatable-input metadata that older runtimes may not expect. That is why this article treats TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels. release updates as a validated bundle question, not as a piecemeal package-refresh question.

Can PyTorch/XLA and JAX treat persistent cache as one shared bucket?+

Not safely by default. The ownership seam is the same one this article keeps drawing elsewhere: cache policy has to be chosen before imports and before first compute, and the PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here. side should not silently inherit whatever JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes. happened to do first. On shared mounts, the safer pattern is separate writable roots or one writer warming a shared cache while the other workers stay read-only. If every process and every frontend writes into the same path, the symptom can look like slow PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu. startup or random recompiles even though the real bug is cache ownership. Torch XLA and PJRT reality, Compile time vs runtime tradeoffs, and XLA compile/runtime controls sample are the shortest local companions for that debugging branch.

Why does the JAX bridge need import-order discipline?+

Because call_jaxQuick term guidecall_jaxThe Torch/XLA bridge lane that hands one narrowed TPU operation to JAX instead of moving the whole program into a JAX-owned frontend path. is not just another local helper. It crosses from a PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.-owned lane into a JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes./Pallas-owned kernel surface, so import order becomes part of runtime ownership. Keep PyTorch/XLA as the main frontend, run the Pallas/JAX import guard before importing JAX when that bridge is active, and keep a backend receipt for the handoff. The checked-in reader-safe surfaces are Torch XLA and PJRT reality, call_jax bridge runtime sample, and Pallas bridge receipt sample.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

libtpu

The TPU backend library that pairs with PJRT/XLA and owns device-side execution underneath the frontend.

Grounding

PJRT

The TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.

Grounding

JAX

A separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.

Grounding

XLA

The compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.

Grounding

TPU

Google's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.

Grounding

call_jax

The Torch/XLA bridge lane that hands one narrowed TPU operation to JAX instead of moving the whole program into a JAX-owned frontend path.

Grounding

trace_pallas

The native PyTorch/XLA custom-kernel lane that traces a Pallas kernel into a payload the XLA side can keep without crossing into a generic JAX bridge call.

Grounding

mark_sharding(...)

PyTorch/XLA's explicit tensor-placement annotation API: attach a mesh plus partition spec to a tensor so one TPU XLA program lowers with stable owned placement.

Grounding

NamedSharding

JAX's frontend sharding object that pairs a mesh with a PartitionSpec; similar goal to PyTorch/XLA placement annotations, but not the same frontend API.

Grounding

PartitionSpec

The tuple-style sharding layout that says which tensor axis maps to which mesh axis and which axes stay replicated.

Grounding

Topic hubs

Topic Hub

TPU v6e and XLA Runtime Surfaces

A curated reading order for TPU work: bring-up, PJRT and Torch/XLA boundaries, SPMD sharding, and the kernel/runtime traps that made TPU performance non-obvious.

David Gornshtein • MegaCppMore posts →

libtpu, PJRT, JAX, and ownership boundaries

The stack is shared, but the owners are not

Why the ownership split became necessary

The failure boundaries that mattered in practice

1. Runtime intent is not runtime proof

2. JAX-backed kernel paths have different owners than pure torch_xla paths

3. Cache ownership is separate from frontend ownership

4. Some failures terminate inside libtpu-owned behavior, not OpenXLA-owned behavior

Which responsibilities remain at the application layer

Frontend ownership

Runtime-policy ownership

Backend verification

Failure-boundary ownership

What "latest libtpu" really means

Practical debugging order

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

TPU v6e and XLA Runtime Surfaces