Topic Hub

TPU v6e and XLA Runtime Surfaces

A curated reading order for TPU work: bring-up, PJRT and Torch/XLA boundaries, SPMD sharding, and the kernel/runtime traps that made TPU performance non-obvious.

These articles fit together as one lane: first get the host and PJRT story right, then understand the XLA sharding contract, and only then move into attention kernels and breakage matrices.

tpu
xla
pjrt
spmd
pallas
Curated set
15
Articles in reading order
Why this hub

Best if you want the TPU lane as an engineering system rather than a benchmark screenshot.

Runtime Contract

These explain what owns the TPU runtime and where Torch/XLA starts to matter.

  1. 01
    April 18, 20267 min readMegaCpp Engineering

    TPU v6e Host Bringup

    What makes a TPU v6e host bringup credible: pinned setup, environment restore, validation ladders, and durable runtime notes.

    The host-side bring-up sequence that makes the rest of the TPU material legible.

    TPU
    XLA
    Bringup
    Training
  2. 02
    April 19, 20267 min readDavid Gornshtein

    libtpu, PJRT, JAX, and ownership boundaries

    Why a shared TPU substrate still leaves distinct ownership lines across PJRT, torch_xla, JAX, and libtpu, and where the main failure boundaries appear in practice.

    The ownership-boundary explanation for where libtpu, PJRT, and the surrounding stack actually divide responsibility.

    libtpu
    PJRT
    JAX
    Torch XLA
  3. 04
    April 18, 20265 min readDavid Gornshtein

    Torch XLA and PJRT reality: what actually matters

    A grounded look at the current TPU stack: PJRT contracts, SPMD setup order, reduction semantics, and the failure modes that still shape training and evaluation.

    The shortest accurate explanation of what PJRT and Torch/XLA actually decide for the training lane.

    Torch XLA
    PJRT
    XLA
    TPU
  4. 05
    April 19, 20268 min readDavid Gornshtein

    Torch/XLA 2.11 expectations vs TPU reality

    What MegaCpp expected from the Torch/XLA 2.11 line on TPU, what the shipped stack actually looked like in practice, and how that changed our bringup strategy.

    A useful historical companion when version drift and older TPU runtime assumptions keep showing up in newer debugging threads.

    TPU
    XLA
    PyTorch
    Torch XLA

Compile, Breakage, and Performance Surfaces

These explain why the TPU lane feels non-obvious even after the runtime ownership map is clear.

  1. 09
    April 18, 202610 min readDavid Gornshtein

    OOM on v6e: Why Memory Pressure Looked Different on TPU

    What TPU v6e out-of-memory failures taught us, why the obvious fixes were often wrong, and how the lane eventually measured memory honestly.

    The memory-side companion once TPU failures are capacity or compile-shape problems instead of simple launch bugs.

    TPU
    V6e
    Oom
    Memory
  2. 10
    April 18, 20269 min readDavid Gornshtein

    XLA vs CUDA: The Decision Matrix For Our Two Training Stacks

    Where we keep one model definition, where the kernels diverge, what determinism we can give on each, how comms differ between NCCL and XLA collectives, and the operator surface that has to stay portable.

    A strategic comparison of the TPU/XLA lane against the CUDA path when stack ownership becomes a first-order decision.

    XLA
    CUDA
    TPU
    NVIDIA

Sharding and Kernel Follow-Through

Once the runtime is understood, these pieces show how the TPU path stays efficient and observable.

  1. 11
    April 18, 20263 min readDavid Gornshtein

    XLA SPMD sharding annotations we actually rely on

    Why explicit mark_sharding annotations matter on TPU XLA, what should be pinned explicitly, and why propagation is not a substitute for a stable sharding contract.

    The concrete sharding annotations the stack actually relies on instead of generic XLA theory.

    XLA
    SPMD
    TPU
    Sharding
  2. 13
    April 18, 20263 min readDavid Gornshtein

    XLA-safe AdamW and TPU runtime flags on v6e

    How to keep optimizer math graph-friendly on TPU, treat runtime flags as explicit launch policy, and recalibrate after stack changes.

    The flag and optimizer-control surface that matters once TPU experiments start differing for reasons outside the model graph itself.

    TPU
    V6e
    XLA
    Adamw

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.