Topic Hub

TPU v6e and XLA Runtime Surfaces

A curated reading order for TPU work: bring-up, PJRT and Torch/XLA boundaries, SPMD sharding, and the kernel/runtime traps that made TPU performance non-obvious.

These articles fit together as one lane: first get the host and PJRT story right, then understand the XLA sharding contract, and only then move into attention kernels and breakage matrices.

tpu

xla

pjrt

spmd

pallas

Curated set

Articles in reading order

Why this hub

Best if you want the TPU lane as an engineering system rather than a benchmark screenshot.

Runtime Contract

These explain what owns the TPU runtime and where Torch/XLA starts to matter.

01
April 18, 2026•7 min read•MegaCpp Engineering
TPU v6e Host Bringup
What makes a TPU v6e host bringup credible: pinned setup, environment restore, validation ladders, and durable runtime notes.
The host-side bring-up sequence that makes the rest of the TPU material legible.
TPU
XLA
Bringup
Training
Read article
02
April 19, 2026•7 min read•David Gornshtein
libtpu, PJRT, JAX, and ownership boundaries
Why a shared TPU substrate still leaves distinct ownership lines across PJRT, torch_xla, JAX, and libtpu, and where the main failure boundaries appear in practice.
The ownership-boundary explanation for where libtpu, PJRT, and the surrounding stack actually divide responsibility.
libtpu
PJRT
JAX
Torch XLA
Read article
03
April 18, 2026•5 min read•David Gornshtein
libtpu and JAX interaction: shared runtime, separate ownership
How PyTorch/XLA, JAX, PJRT, and libtpu relate on TPU without collapsing distinct layers into one vague runtime claim.
The narrower companion piece once the ownership story turns into concrete JAX and runtime behavior.
libtpu
JAX
Torch XLA
PJRT
Read article
04
April 18, 2026•5 min read•David Gornshtein
Torch XLA and PJRT reality: what actually matters
A grounded look at the current TPU stack: PJRT contracts, SPMD setup order, reduction semantics, and the failure modes that still shape training and evaluation.
The shortest accurate explanation of what PJRT and Torch/XLA actually decide for the training lane.
Torch XLA
PJRT
XLA
TPU
Read article
05
April 19, 2026•8 min read•David Gornshtein
Torch/XLA 2.11 expectations vs TPU reality
What MegaCpp expected from the Torch/XLA 2.11 line on TPU, what the shipped stack actually looked like in practice, and how that changed our bringup strategy.
A useful historical companion when version drift and older TPU runtime assumptions keep showing up in newer debugging threads.
TPU
XLA
PyTorch
Torch XLA
Read article
06
April 19, 2026•7 min read•David Gornshtein
Torch 2.12 TPU/XLA breakage matrix: wheel pain, cache misses, and the workarounds that actually mattered
A repo-grounded account of where the TPU/XLA stack broke, which failures needed upstream-facing patches, and which ones were better handled as explicit MegaCpp runtime policy.
The practical compatibility map once versions, wheels, and caches start drifting.
PyTorch
Torch 2 12
XLA
TPU
Read article

Compile, Breakage, and Performance Surfaces

These explain why the TPU lane feels non-obvious even after the runtime ownership map is clear.

07
April 18, 2026•9 min read•David Gornshtein
TPU v6e Performance Deep Dive: Real MFU, Sharding Topology, and the Things That Pretended to Help
How a TPU v6e lane actually spent time, why topology and compile amortization mattered so much, and which optimizations did not survive measurement.
The best top-level performance readback once compile amortization and runtime throughput start fighting each other.
TPU
V6e
Performance
MFU
Read article
08
April 18, 2026•2 min read•David Gornshtein
Graph recompilation hell: shape drift, graph contracts, and why TPU runs slow down without crashing
A walkthrough of the most common TPU recompilation failure mode: changing shapes, unstable graph contracts, and weak runtime discipline.
The most direct explanation of how the TPU lane can lose performance without obviously changing the model itself.
XLA
TPU
Recompilation
Graph
Read article
09
April 18, 2026•10 min read•David Gornshtein
OOM on v6e: Why Memory Pressure Looked Different on TPU
What TPU v6e out-of-memory failures taught us, why the obvious fixes were often wrong, and how the lane eventually measured memory honestly.
The memory-side companion once TPU failures are capacity or compile-shape problems instead of simple launch bugs.
TPU
V6e
Oom
Memory
Read article
10
April 18, 2026•9 min read•David Gornshtein
XLA vs CUDA: The Decision Matrix For Our Two Training Stacks
Where we keep one model definition, where the kernels diverge, what determinism we can give on each, how comms differ between NCCL and XLA collectives, and the operator surface that has to stay portable.
A strategic comparison of the TPU/XLA lane against the CUDA path when stack ownership becomes a first-order decision.
XLA
CUDA
TPU
NVIDIA
Read article

Sharding and Kernel Follow-Through

Once the runtime is understood, these pieces show how the TPU path stays efficient and observable.

11
April 18, 2026•3 min read•David Gornshtein
XLA SPMD sharding annotations we actually rely on
Why explicit mark_sharding annotations matter on TPU XLA, what should be pinned explicitly, and why propagation is not a substitute for a stable sharding contract.
The concrete sharding annotations the stack actually relies on instead of generic XLA theory.
XLA
SPMD
TPU
Sharding
Read article
12
April 18, 2026•2 min read•David Gornshtein
ZeRO-3-shaped sharding on the XLA backend: what transfers from FSDP2 and what does not
How to think about TPU XLA sharding honestly: keep the ZeRO-3 memory goal, drop the assumption that TPU uses the same eager FSDP2 wrapper model as CUDA.
How the TPU backend maps the ZeRO-3-shaped sharding story and where the translation stops.
TPU
XLA
SPMD
FSDP2
Read article
13
April 18, 2026•3 min read•David Gornshtein
XLA-safe AdamW and TPU runtime flags on v6e
How to keep optimizer math graph-friendly on TPU, treat runtime flags as explicit launch policy, and recalibrate after stack changes.
The flag and optimizer-control surface that matters once TPU experiments start differing for reasons outside the model graph itself.
TPU
V6e
XLA
Adamw
Read article
14
April 18, 2026•3 min read•David Gornshtein
Block-sparse attention on TPU v6e: block masks, MXU-friendly tiles, and stable contracts
How to frame block-sparse attention on TPU honestly: explicit mask contracts, MXU-aligned tile choices, and a preference for stable sparse layouts over data-dependent retracing.
The kernel-side readback once the runtime and sharding contracts are in place.
TPU
XLA
Sparse Attention
Pallas
Read article
15
April 18, 2026•2 min read•David Gornshtein
Attention sinks and telemetry on TPU: measure without turning observability into the bug
Why TPU telemetry has to be gated carefully: scalar reads can become host-device syncs, so sink and outlier tracking must be designed as explicit low-cadence instrumentation.
The observability companion once TPU instrumentation starts distorting the very runtime you are trying to understand.
TPU
Telemetry
XLA
Attention
Read article

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.

TPU v6e and XLA Runtime Surfaces

Runtime Contract

TPU v6e Host Bringup

libtpu, PJRT, JAX, and ownership boundaries

libtpu and JAX interaction: shared runtime, separate ownership

Torch XLA and PJRT reality: what actually matters

Torch/XLA 2.11 expectations vs TPU reality

Torch 2.12 TPU/XLA breakage matrix: wheel pain, cache misses, and the workarounds that actually mattered

Compile, Breakage, and Performance Surfaces

TPU v6e Performance Deep Dive: Real MFU, Sharding Topology, and the Things That Pretended to Help

Graph recompilation hell: shape drift, graph contracts, and why TPU runs slow down without crashing

OOM on v6e: Why Memory Pressure Looked Different on TPU

XLA vs CUDA: The Decision Matrix For Our Two Training Stacks

Sharding and Kernel Follow-Through

XLA SPMD sharding annotations we actually rely on

ZeRO-3-shaped sharding on the XLA backend: what transfers from FSDP2 and what does not

XLA-safe AdamW and TPU runtime flags on v6e

Block-sparse attention on TPU v6e: block masks, MXU-friendly tiles, and stable contracts

Attention sinks and telemetry on TPU: measure without turning observability into the bug

Adjacent topic hubs

GB10 and Blackwell Bring-Up

Modal Training and Benchmark Operations

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

MLA Integration, Dispatch, and Weight Absorption

Evaluation, Benchmarks, and Verifier Loops

Megatron Parallelism and Layout Boundaries

TPU Sparse Attention and Pallas Kernels

H200 Training and Kernel Bring-Up

C++ Data Pipelines and Corpus Packaging

MoE, Routing, and Distributed Model Splits