Entity Hub

TPU Sparse Attention and Pallas Kernels

A curated TPU sparse-attention reading path: block-sparse contracts, Pallas kernel choices, SPMD sharding, and the runtime surfaces that keep long-context TPU work stable.

This hub is narrower than the general TPU/XLA archive. Start with the block-sparse and Pallas kernel notes, then move into sharding and planner surfaces, and finally the operational pieces that keep the TPU sparse-attention lane observable and stable.

sparse-attention

pallas

flash-attention

softcap

doc-masking

block-sparse

Curated set

Articles in reading order

Why this hub

Best if you care specifically about sparse attention, Pallas, and long-context TPU kernel work rather than the TPU stack as a whole.

Kernel Surfaces

These are the core sparse-attention and Pallas notes to read first.

01
April 18, 2026•3 min read•David Gornshtein
Block-sparse attention on TPU v6e: block masks, MXU-friendly tiles, and stable contracts
How to frame block-sparse attention on TPU honestly: explicit mask contracts, MXU-aligned tile choices, and a preference for stable sparse layouts over data-dependent retracing.
The block-mask and MXU-friendly kernel contract that anchors the rest of this cluster.
TPU
XLA
Sparse Attention
Pallas
Read article
02
April 18, 2026•6 min read•David Gornshtein
Pallas kernels on TPU v6e: what we ship and what we deleted
Where Pallas beats the XLA lowering on TPU v6e, where it loses, the debugging workflow that keeps us sane, and the kernel deltas we kept versus the ones we reverted.
What the Pallas lane kept, what it deleted, and why the TPU kernel surface narrowed over time.
Pallas
TPU
V6e
JAX
Read article
03
April 18, 2026•5 min read•David Gornshtein
Pallas FlashAttention with logit softcap on TPU v6e
Why softcap attention on TPU needs a dedicated kernel surface: fuse the nonlinearity, keep masking contract-friendly, and avoid turning a stability trick into a second full pass over the score matrix.
The flash-attention and softcap companion piece once Pallas is already familiar.
Pallas
TPU
V6e
Flash Attention
Read article

Planner and Sharding Follow-Through

Once the kernels are legible, these explain how the TPU layout stays stable.

04
April 19, 2026•4 min read•David Gornshtein
Clustered sparse on TPU: the planner stages
How MegaCpp decomposes clustered sparse TPU attention into planner stages, legality checks, and backend dispatch rather than treating sparse attention as one giant kernel.
The planner-stage explanation for the clustered sparse TPU path.
TPU
Pallas
Sparse Attention
Kernels
Read article
05
April 18, 2026•3 min read•David Gornshtein
XLA SPMD sharding annotations we actually rely on
Why explicit mark_sharding annotations matter on TPU XLA, what should be pinned explicitly, and why propagation is not a substitute for a stable sharding contract.
The concrete sharding annotations the TPU sparse-attention lane actually relies on.
XLA
SPMD
TPU
Sharding
Read article
06
April 18, 2026•7 min read•David Gornshtein
Vocab and Tokenizer Plumbing on TPU: What XLA SPMD Makes You Decide Up Front
Vocab-size constraints under XLA, the padding choices that keep the compile cache stable, sharded embedding init under SPMD, and the per-specialist platform vocab story.
The tokenizer and vocab decisions you have to lock down before the TPU sparse path stays compile-safe.
TPU
V6e
XLA
SPMD
Read article

Operational Companions

These are the nearby runtime notes that keep the sparse lane debuggable.

07
April 18, 2026•2 min read•David Gornshtein
Attention sinks and telemetry on TPU: measure without turning observability into the bug
Why TPU telemetry has to be gated carefully: scalar reads can become host-device syncs, so sink and outlier tracking must be designed as explicit low-cadence instrumentation.
How to keep observability useful without turning the TPU lane into a measurement artifact.
TPU
Telemetry
XLA
Attention
Read article
08
April 18, 2026•9 min read•David Gornshtein
Transformer Engine replacements on TPU: keeping one model definition across paths
Transformer Engine is an NVIDIA Hopper and Blackwell story. On TPU v6e it does not exist. This is the layer-spec abstraction and the XLA-friendly substitutes that let one model definition ship across both paths.
The TPU-side model-definition substitutions that keep the sparse or Pallas path compatible with the rest of the stack.
TPU
V6e
XLA
Transformer Engine
Read article
09
April 18, 2026•11 min read•David Gornshtein
Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles
How the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without triggering compile cache misses, and how that contract lifts into the main path.
The packed-row data companion once long-context TPU execution has to stay compile-stable end to end.
TPU
XLA
Data Pipeline
Structure Aware
Read article

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.

TPU Sparse Attention and Pallas Kernels

Kernel Surfaces

Block-sparse attention on TPU v6e: block masks, MXU-friendly tiles, and stable contracts

Pallas kernels on TPU v6e: what we ship and what we deleted

Pallas FlashAttention with logit softcap on TPU v6e

Planner and Sharding Follow-Through

Clustered sparse on TPU: the planner stages

XLA SPMD sharding annotations we actually rely on

Vocab and Tokenizer Plumbing on TPU: What XLA SPMD Makes You Decide Up Front

Operational Companions

Attention sinks and telemetry on TPU: measure without turning observability into the bug

Transformer Engine replacements on TPU: keeping one model definition across paths

Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles

Adjacent topic hubs

GB10 and Blackwell Bring-Up

Modal Training and Benchmark Operations

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

MLA Integration, Dispatch, and Weight Absorption

Evaluation, Benchmarks, and Verifier Loops

Megatron Parallelism and Layout Boundaries

H200 Training and Kernel Bring-Up

TPU v6e and XLA Runtime Surfaces

C++ Data Pipelines and Corpus Packaging

MoE, Routing, and Distributed Model Splits