MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20269 min readDavid Gornshtein

TPU

V6e

Performance

MFU

XLA

TPU v6e Performance Deep Dive: Real MFU, Sharding Topology, and the Things That Pretended to Help

Q: What hot-path code pattern should make a TPU receipt suspicious?

Treat .item() and .nonzero() on live TPU tensors as receipt blockers unless the run proves they stay outside the compiled training path. They can turn a shape-stable benchmark into a host-sync or fallback benchmark, so prefer fixed-shape masks, runtime boundary tensors, and receipt-side logging instead. The local companion reads are Graph recompilation hell, FSDP2 on XLA TPU, and the XLA compile/runtime controls sample.

How a TPU v6e lane actually spent time, why topology and compile amortization mattered so much, and which optimizations did not survive measurement.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Published April 18, 2026•9 min read•David Gornshtein

TPU v6e Performance Deep Dive: Real MFU, Sharding Topology, and the Things That Pretended to Help

The important TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries v6e lesson here was not a single heroic kernel win. The durable story was that topology, compile amortization, and workload shape determined whether the lane had enough useful work per chip. The public TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries and PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations docs point to the same pattern: honest TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries performance depends on stable sharding, feasible context length, and clear separation between genuine throughput gains and “wins” caused by changed workload or broken runtime paths. The cross-backend version of that same honesty standard is XLA vs CUDA stack decisions: if the active path changed, the comparison changed too, and MFU here means steady-state model FLOP utilization after the compile path is under control rather than a cold-start number.

TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries performance is easy to romanticize because a good steady-state number can look dramatic. It is also easy to misread because TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lanes make compile and sharding behavior part of end-to-end runtime in a way many GPU writeups understate. The TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries documents behind this batch are valuable because they keep both truths visible. The host-side setup boundary from TPU v6e Host Bringup and the sharding contract in XLA SPMD sharding annotations are part of that same interpretability story.

There is no single clean headline like “v6e is fast” or “v6e is bad for hybrids.” What matters is the interacting set of constraints: context length, tensor-parallel topology, compile reuse, sparse or dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend choice, and whether the exact run stayed on the intended XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations path. That is a messier story, but it is the one operators can actually use.

Topology set the ceiling before micro-optimizations did

The right framing starts with validated context ceilings and explicit TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding layouts. Before asking whether some local optimization helped, you need to know whether the topology left enough real work per chip. That is especially important on TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries, where long context and sharding geometry directly affect how much useful batch and sequence work each chip actually carries.

A topology that technically fits but starves each chip of useful batch or sequence work will always look worse than a cleaner layout with fewer coordination costs. That is not a moral failure of the accelerator. It is the expected result of asking the topology to carry a workload it does not host efficiently.

Condition	What usually happens on v6e
Enough work per chip, stable compile reuse	throughput can look healthy and repeatable
Memory-feasible but communication-heavy topology	MFU drops and coordination dominates
Long context beyond validated range without SP	run becomes topology- or memory-limited before kernels matter
Unstable compile or fallback path	reported throughput stops reflecting the intended lane

That table sounds obvious, but it is the part many postmortems skip. They jump from number to number without first naming the topology regime that produced each number.

Compile amortization is part of TPU performance, not an afterthought

This is one reason runtime debugging matters so much to performance interpretation. Failures around reduction paths, static-grad materialization, fallback behavior, or sharding drift may read like correctness notes, but they are also performance notes. If a lane silently changes execution strategy or loses shape guarantees, any measured throughput becomes harder to trust.

The lesson is straightforward: before celebrating MFU, verify that the lane is on a stable compiled path and that cache reuse is doing what you think it is doing. Otherwise you are benchmarking a moving target.

Workload shape matters more than slogan-level comparisons

The TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lane in this project spans dense, sparse, and hybrid experiments. That matters because not all TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries results should be collapsed into one “the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lane.” A dense baseline, a hybrid attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-state-space lane, and a sparse-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns experiment can all be valid TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries workloads while stressing completely different parts of the stack.

Comparing those workloads by tokens per second as if they were interchangeable is a category error.

The hybrid pattern notation helps here just as it does on GPU. If a run uses more E positions, more recurrent behavior, or a different long-context plan, that changes the meaning of the performance result. Pattern-aware reporting is therefore more honest than generic throughput bragging.

Long context was a first-order systems constraint

Long-context limits deserve to be stated explicitly. Once sequence length pushes the layout into a communication-heavy or memory-fragile regime, the practical boundary is no longer set by one kernel. It is set by topology, context partitioning, and how much stable work remains per chip.

Long-context ambition on TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries is not just a kernel problem. It is a memory-shape and topology problem. If the sequence length forces a layout that cuts effective work per chip too aggressively, utilization drops even if the math kernels themselves are competent. That is why long-context planning belongs in the same conversation as MFU.

It also explains why some optimizations “pretended to help.” A tweak that improves a smaller or shallower workload may not survive when the real target is a long-context hybrid run. The workload shape changed, so the bottleneck moved.

Sparse and XLA-safe paths changed what was measurable

MegaCpp TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries work is notable for trying XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations-safe or TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries-native implementations rather than assuming CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-first logic will port cleanly. That approach matters because TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries performance should not be narrated as “the same algorithm, different hardware” when the runtime path is genuinely different.

If a TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries-safe path avoids gather/scatter or uses a different masking strategy, the performance result is describing a different systems contract, not just a different device.

This is not a problem. It is normal engineering. But it means the report should say so clearly. Otherwise the reader cannot tell whether the gain came from hardware, topology, algorithmic substitution, or all three.

What did not survive measurement

A good deep dive should also say what did not matter enough. The stable lesson from the docs is that second-order tweaks rarely outranked topology and compile reuse. If per-chip work was too low, or compile amortization was poor, many smaller wins were simply not large enough to change the operator-facing story.

That does not make those smaller optimizations worthless. It just means they were not the main plot. The main plot was whether the lane fit into a stable, reusable, memory-feasible topology with enough real work per chip.

This is the part that “pretended to help.” A local speedup can look exciting in isolation and still fail to move the full training lane once topology and compile dominate again.

A grounded TPU launch shape

The recurring TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries launch style is straightforward: explicit tensor-parallel choice, explicit sequence length, explicit total batch target, and TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries-safe backend flags. A representative shape looks like this:

A representative TPU benchmark launch pinned total batch size, tensor parallel degree, the current kernel path, compile mode, and the XLA flash-attention switch in one reproducible command.

The exact launcher is less important than the structure. TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries results in this stack are always tied to a topology and backend story. If either story changes, the number has to be reinterpreted.

How to read a TPU performance result correctly

A useful TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries v6e performance note should answer five questions.

What topology was used?
Was compile amortized or was this effectively a cold-start measurement?
What sequence length and workload family were active?
Did the run stay on the intended XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations-safe path without silent fallback?
Did the gain survive when moved back to the target workload shape?

If those answers are missing, the result is hard to operationalize. If they are present, even a modest gain can be more valuable because you know what it means.

The durable lesson

The durable lesson from the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries v6e work is not “optimize harder.” It is “measure the right regime.” Stable throughput on TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries comes from pairing the right workload with the right topology and compile behavior. Long-context ambitions have to respect validated memory and sharding limits. XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations-safe paths need to be verified as the active paths. And small wins should be treated skeptically until they survive the real workload.

That is a better standard than chasing one flattering MFU screenshot. It is also exactly what the repo materials encourage. They keep topology, compile reuse, sequence length, and backend choice in the same frame. That is what makes the results believable.

Why validated ceilings are more valuable than isolated peaks

One of the strongest habits in TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries planning is the emphasis on validated ceilings. Knowing that a particular context length or topology is repeatably feasible is often more operationally valuable than a single faster run at a friendlier shape. That may sound conservative, but it is the right bias for a lane where compile and sharding behavior can shift the regime so easily.

A validated ceiling answers a planning question: what is the largest shape we can rely on? An isolated peak often answers only a marketing question: what is the nicest number we saw once? For TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries work, the planning question is usually the more important one because downstream recipe choices depend on it. Context expansion, sparse-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns experiments, and hybrid schedules all need a believable operating envelope.

This is why explicit context-limit notes are more important than they might first appear. They turn TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries performance from an anecdote into a scheduling input.

The honest TPU report is specific about what changed

A good v6e report therefore names the real source of improvement. Was it a better TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding layout? More compile reuse? A workload with less costly family composition? An XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations-safe path that avoided expensive gather or scatter? The repo materials repeatedly suggest that this specificity is the difference between useful evidence and noise.

That specificity also helps prevent over-generalization. A result that is excellent for a shorter-context, cleaner-shape lane may still fail to carry over to the exact NAM-style hybrid workload you eventually care about. The report should say so directly. That is not weakness. It is how teams avoid spending the next week trying to port a win that never actually targeted the production regime.

In that sense, the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries deep dive is not merely about performance. It is about evidence quality. The better the report names topology, compile state, and workload shape, the more likely the next optimization step will start from reality rather than wishful thinking.

The surrounding TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries story is easier to interpret if you read TPU v6e Host Bringup for the runtime surface and XLA SPMD sharding annotations for the placement contract that makes the numbers meaningful.

FAQ

Frequently asked questions

What should a TPU performance receipt include besides MFU or tokens per second?+

The topology, sequence length, compile posture, active backend path, and whether the result stayed inside the largest repeatable shape the lane actually sustained. The quickest checked-in proof surfaces are XLA compile/runtime controls sample, TPU runtime probe sample, and XLA startup calibration records.

Why should dense and sparse or hybrid TPU MFU receipts stay separate?+

Because they do not stress the same runtime contract. A dense lane, a block-sparse lane, and a hybrid long-context lane can all be valid TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels. workloads while moving the bottleneck between per-chip work, sharding traffic, and compile reuse. Folding them into one MFU headline hides the topology regime that actually set the result. The workload-specific reading path is Block-sparse attention on TPU for the sparse lane and XLA vs CUDA stack decisions for the cross-backend boundary.

What hot-path code pattern should make a TPU receipt suspicious?+

Treat .item() and .nonzero() on live TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels. tensors as receipt blockers unless the run proves they stay outside the compiled training path. They can turn a shape-stable benchmark into a host-sync or fallback benchmark, so prefer fixed-shape masks, runtime boundary tensors, and receipt-side logging instead. The local companion reads are Graph recompilation hell, FSDP2 on XLA TPU, and the XLA compile/runtime controls sample.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

XLA SPMD

The explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Grounding

XLA

The compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.

Grounding

TPU

Google's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.

Grounding

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

Grounding

Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Topic hubs

Topic Hub

Evaluation, Benchmarks, and Verifier Loops

A curated evaluation reading path: verifier-first harnesses, ablation structure, benchmark receipts, and the evidence rules that keep comparisons from collapsing into anecdotes.

Topic Hub

TPU v6e and XLA Runtime Surfaces

A curated reading order for TPU work: bring-up, PJRT and Torch/XLA boundaries, SPMD sharding, and the kernel/runtime traps that made TPU performance non-obvious.

David Gornshtein • MegaCppMore posts →

TPU v6e Performance Deep Dive: Real MFU, Sharding Topology, and the Things That Pretended to Help

TPU v6e Performance Deep Dive: Real MFU, Sharding Topology, and the Things That Pretended to Help

Topology set the ceiling before micro-optimizations did

Compile amortization is part of TPU performance, not an afterthought

Workload shape matters more than slogan-level comparisons

Long context was a first-order systems constraint

Sparse and XLA-safe paths changed what was measurable

What did not survive measurement

A grounded TPU launch shape

How to read a TPU performance result correctly

The durable lesson

Why validated ceilings are more valuable than isolated peaks

The honest TPU report is specific about what changed

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

Evaluation, Benchmarks, and Verifier Loops

TPU v6e and XLA Runtime Surfaces