TPU v6e Host Bringup
What makes a TPU v6e host bringup credible: pinned setup, environment restore, validation ladders, and durable runtime notes.
The host-side bring-up sequence that makes the rest of the TPU material legible.
A curated reading order for TPU work: bring-up, PJRT and Torch/XLA boundaries, SPMD sharding, and the kernel/runtime traps that made TPU performance non-obvious.
These articles fit together as one lane: first get the host and PJRT story right, then understand the XLA sharding contract, and only then move into attention kernels and breakage matrices.
Best if you want the TPU lane as an engineering system rather than a benchmark screenshot.
These explain what owns the TPU runtime and where Torch/XLA starts to matter.
What makes a TPU v6e host bringup credible: pinned setup, environment restore, validation ladders, and durable runtime notes.
The host-side bring-up sequence that makes the rest of the TPU material legible.
Why a shared TPU substrate still leaves distinct ownership lines across PJRT, torch_xla, JAX, and libtpu, and where the main failure boundaries appear in practice.
The ownership-boundary explanation for where libtpu, PJRT, and the surrounding stack actually divide responsibility.
How PyTorch/XLA, JAX, PJRT, and libtpu relate on TPU without collapsing distinct layers into one vague runtime claim.
The narrower companion piece once the ownership story turns into concrete JAX and runtime behavior.
A grounded look at the current TPU stack: PJRT contracts, SPMD setup order, reduction semantics, and the failure modes that still shape training and evaluation.
The shortest accurate explanation of what PJRT and Torch/XLA actually decide for the training lane.
What MegaCpp expected from the Torch/XLA 2.11 line on TPU, what the shipped stack actually looked like in practice, and how that changed our bringup strategy.
A useful historical companion when version drift and older TPU runtime assumptions keep showing up in newer debugging threads.
A repo-grounded account of where the TPU/XLA stack broke, which failures needed upstream-facing patches, and which ones were better handled as explicit MegaCpp runtime policy.
The practical compatibility map once versions, wheels, and caches start drifting.
These explain why the TPU lane feels non-obvious even after the runtime ownership map is clear.
How a TPU v6e lane actually spent time, why topology and compile amortization mattered so much, and which optimizations did not survive measurement.
The best top-level performance readback once compile amortization and runtime throughput start fighting each other.
A walkthrough of the most common TPU recompilation failure mode: changing shapes, unstable graph contracts, and weak runtime discipline.
The most direct explanation of how the TPU lane can lose performance without obviously changing the model itself.
What TPU v6e out-of-memory failures taught us, why the obvious fixes were often wrong, and how the lane eventually measured memory honestly.
The memory-side companion once TPU failures are capacity or compile-shape problems instead of simple launch bugs.
Where we keep one model definition, where the kernels diverge, what determinism we can give on each, how comms differ between NCCL and XLA collectives, and the operator surface that has to stay portable.
A strategic comparison of the TPU/XLA lane against the CUDA path when stack ownership becomes a first-order decision.
Once the runtime is understood, these pieces show how the TPU path stays efficient and observable.
Why explicit mark_sharding annotations matter on TPU XLA, what should be pinned explicitly, and why propagation is not a substitute for a stable sharding contract.
The concrete sharding annotations the stack actually relies on instead of generic XLA theory.
How to think about TPU XLA sharding honestly: keep the ZeRO-3 memory goal, drop the assumption that TPU uses the same eager FSDP2 wrapper model as CUDA.
How the TPU backend maps the ZeRO-3-shaped sharding story and where the translation stops.
How to keep optimizer math graph-friendly on TPU, treat runtime flags as explicit launch policy, and recalibrate after stack changes.
The flag and optimizer-control surface that matters once TPU experiments start differing for reasons outside the model graph itself.
How to frame block-sparse attention on TPU honestly: explicit mask contracts, MXU-aligned tile choices, and a preference for stable sparse layouts over data-dependent retracing.
The kernel-side readback once the runtime and sharding contracts are in place.
Why TPU telemetry has to be gated carefully: scalar reads can become host-device syncs, so sink and outlier tracking must be designed as explicit low-cadence instrumentation.
The observability companion once TPU instrumentation starts distorting the very runtime you are trying to understand.
Keep exploring
These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.
A curated GB10 and Blackwell reading path: consumer-versus-datacenter tensor paths, driver-visible false positives, arch-patch repros, and the serving or precision choices that survived contact with the hardware.
A curated Modal reading path: when the rented-GPU surface was useful, what broke on multi-GPU launches, how receipts were recorded, and how we kept the lane debuggable.
A curated Mamba3 reading path: why MegaCpp kept a hybrid stack, how the kernels evolved across CUDA, TileLang, and TPU, and where the runtime wins actually held.
A curated MLA reading path: the weight-absorption contract, Megatron-safe integration boundaries, dispatch and FP8 edges, and the adapter surfaces that keep MLA connected to the rest of the stack.
A curated evaluation reading path: verifier-first harnesses, ablation structure, benchmark receipts, and the evidence rules that keep comparisons from collapsing into anecdotes.
A curated Megatron reading path: the parallelism map, what actually splits, how NVIDIA and TPU wrappers differ, and the migration surfaces around NAM56R-style layouts.
A curated TPU sparse-attention reading path: block-sparse contracts, Pallas kernel choices, SPMD sharding, and the runtime surfaces that keep long-context TPU work stable.
A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.
A curated archive for the C++ data path: corpus selection, semantic enrichment, packaging into training artifacts, and the file-level durability choices that keep the pipeline sane.
A curated path through the expert stack: what the specialist path changed, how routing works, and how the parallelism map constrains the model layout.