Entity Hub

TPU Sparse Attention and Pallas Kernels

A curated TPU sparse-attention reading path: block-sparse contracts, Pallas kernel choices, SPMD sharding, and the runtime surfaces that keep long-context TPU work stable.

This hub is narrower than the general TPU/XLA archive. Start with the block-sparse and Pallas kernel notes, then move into sharding and planner surfaces, and finally the operational pieces that keep the TPU sparse-attention lane observable and stable.

sparse-attention
pallas
flash-attention
softcap
doc-masking
block-sparse
Curated set
9
Articles in reading order
Why this hub

Best if you care specifically about sparse attention, Pallas, and long-context TPU kernel work rather than the TPU stack as a whole.

Kernel Surfaces

These are the core sparse-attention and Pallas notes to read first.

  1. 02
    April 18, 20266 min readDavid Gornshtein

    Pallas kernels on TPU v6e: what we ship and what we deleted

    Where Pallas beats the XLA lowering on TPU v6e, where it loses, the debugging workflow that keeps us sane, and the kernel deltas we kept versus the ones we reverted.

    What the Pallas lane kept, what it deleted, and why the TPU kernel surface narrowed over time.

    Pallas
    TPU
    V6e
    JAX
  2. 03
    April 18, 20265 min readDavid Gornshtein

    Pallas FlashAttention with logit softcap on TPU v6e

    Why softcap attention on TPU needs a dedicated kernel surface: fuse the nonlinearity, keep masking contract-friendly, and avoid turning a stability trick into a second full pass over the score matrix.

    The flash-attention and softcap companion piece once Pallas is already familiar.

    Pallas
    TPU
    V6e
    Flash Attention

Planner and Sharding Follow-Through

Once the kernels are legible, these explain how the TPU layout stays stable.

  1. 04
    April 19, 20264 min readDavid Gornshtein

    Clustered sparse on TPU: the planner stages

    How MegaCpp decomposes clustered sparse TPU attention into planner stages, legality checks, and backend dispatch rather than treating sparse attention as one giant kernel.

    The planner-stage explanation for the clustered sparse TPU path.

    TPU
    Pallas
    Sparse Attention
    Kernels
  2. 05
    April 18, 20263 min readDavid Gornshtein

    XLA SPMD sharding annotations we actually rely on

    Why explicit mark_sharding annotations matter on TPU XLA, what should be pinned explicitly, and why propagation is not a substitute for a stable sharding contract.

    The concrete sharding annotations the TPU sparse-attention lane actually relies on.

    XLA
    SPMD
    TPU
    Sharding

Operational Companions

These are the nearby runtime notes that keep the sparse lane debuggable.

  1. 08
    April 18, 20269 min readDavid Gornshtein

    Transformer Engine replacements on TPU: keeping one model definition across paths

    Transformer Engine is an NVIDIA Hopper and Blackwell story. On TPU v6e it does not exist. This is the layer-spec abstraction and the XLA-friendly substitutes that let one model definition ship across both paths.

    The TPU-side model-definition substitutions that keep the sparse or Pallas path compatible with the rest of the stack.

    TPU
    V6e
    XLA
    Transformer Engine
  2. 09
    April 18, 202611 min readDavid Gornshtein

    Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles

    How the v6_enriched packed-rows pipeline feeds per-token structure IDs, chunk boundaries, and call edges into the XLA dataloader on TPU v6e without triggering compile cache misses, and how that contract lifts into the main path.

    The packed-row data companion once long-context TPU execution has to stay compile-stable end to end.

    TPU
    XLA
    Data Pipeline
    Structure Aware

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.