MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202610 min readDavid Gornshtein

TPU

V6e

Oom

Memory

XLA

OOM on v6e: Why Memory Pressure Looked Different on TPU

Q: Why can an expert block make aggregate memory look safer than it is?

Because the E slot is a routing surface as well as a parameter surface. XLA AllToAll splits one tensor dimension across the participating processes and concatenates the exchanged parts along another dimension, so expert routing can move pressure into collective buffers and local expert-side intermediates before an aggregate mesh number looks alarming. The companion Expert Parallel and MoE Sharding article covers that routing contract from the expert-parallel side.

What TPU v6e out-of-memory failures taught us, why the obvious fixes were often wrong, and how the lane eventually measured memory honestly.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Published April 18, 2026•10 min read•David Gornshtein

OOM on v6e: Why Memory Pressure Looked Different on TPU

OOM on TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries v6e was not a slightly different version of GPU OOM. The lane needed chip-level visibility, a deterministic retry ladder, and much less trust in GPU-origin intuitions. The winning move was not “reduce batch until it fits.” It was “measure memory on the right unit, preserve the exact model shape in the report, and shrink the right dimension in the right order.”

TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries memory failures are frustrating because they often look generic at first. The process dies, the trace is noisy, and the first instinct is to do what worked on GPU: shrink batch size, toggle checkpointing, and hope the next run lands. The local TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries docs and runtime notes argue for a more disciplined view. On the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lane, compiler decisions, SPMD layout, and per-chip pressure can dominate the story. If you only watch aggregate memory, you can easily debug the wrong problem.

Why TPU OOM felt different from GPU OOM

The first conceptual shift was to stop asking “did the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries device run out of memory?” and start asking “which physical chip hit the limit, under what shape, and after which compile/layout decision?” The TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries setup notes emphasize exactly that split between high-level runtime state and the lower-level reality of per-chip pressure.

That distinction also helps keep the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries story separate from the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 story. The validated H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 notes focus on filesystem placement, fused-package discipline, and multi-GPU runtime invariants. The TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lane has a different first layer of truth: PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: libtpu / PJRT ownership boundaries About: Torch XLA / PJRT reality Example: TPU backend ownership note, SPMD lowering, and chip-local pressure. Treating them as the same class of memory problem is one of the easiest ways to waste time.

Question	Wrong default answer	Better answer
How much memory is in use?	One aggregate SPMD number	Per-physical-chip `bytes_used` versus `bytes_limit`
Why did step zero fail?	“The model is too big”	One chip or one layout crossed the limit under this exact shape
What should shrink next?	Whatever dimension is easiest to type	The first dimension in the retry ladder that reduces the active pressure surface

That difference sounds administrative until you actually hit repeated OOMs. Once the team started treating the physical chip as the right unit of observation, two things changed immediately. First, the auto-fit logic became more trustworthy because it was no longer guessing from a blended number. Second, humans stopped overreacting to failures that looked global but were actually localized.

The TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries documents are useful here because they separate TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries truths clearly. There is a documented preferred wheel lineage and a live validated TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lane. That separation matters for memory work too. If the runtime stack differs, the memory profile can differ. Treating all v6e failures as one bucket is a good way to lose days.

The retry ladder mattered more than the first fix

Once compile windows are expensive, random retries become a tax. That is why the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lane needed an ordered response rather than ad hoc operator guesses. The purpose of a retry ladder is not elegance. It is to avoid paying the compile cost for a sequence of bad instincts.

The local docs and scripts imply a practical hierarchy: preserve the core experimental question if possible, but shrink the dimension that most directly reduces the pressure you are seeing. Sometimes that means device batch size. Sometimes it means sequence length. Sometimes it means abandoning a topology that is not honest for the hardware budget.

That choice is easier when the report preserves exact model shape. A dense NAM52 lane, a hybrid NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample lane, and a pattern such as AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample do not put pressure on memory in the same way. If the report only says “large run on v6e,” the retry ladder becomes generic faster than it should.

oom_retry_ladder:
  1: reduce_device_batch_size
  2: reduce_sequence_length
  3: disable_nonessential_feature_tax
  4: lower_parallelism_or_change_topology
  5: stop_and_record_report

That block is inferred policy, not a literal checked-in config. It captures the behavior the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries notes are arguing for: OOM handling should be ordered, scripted, and explainable.

The most important line in that ladder is the last one. There is a point where repeated shrinking stops being recovery and starts being denial about the lane’s viability. A good OOM workflow records the frontier instead of pretending there is always one more harmless knob turn.

The retry ladder also needs a window boundary. The local TPU startup retry classifier and XLA startup calibration records keep fallback narrow on purpose: classify compile-time HBM failures early, retry only during startup and the first post-step0 compile window, then remember the failed signature so the next launch does not rediscover the same bad shape first. That turns the ladder into a startup-frontier search instead of an endless degrade loop.

Why GPU heuristics were often wrong on TPU

GPU OOM diagnosis is shaped by a different set of instincts: allocator fragmentation, extension workspace spikes, flash-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns workspace, compiler cache behavior, or delayed activation rematerialization changes. Some of those intuitions still matter on TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries, but they are not the safest first guess.

On TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries, graph partitioning and SPMD layout decisions can change where memory pressure lives. That means the same model shape can behave differently depending on how the runtime lowered and partitioned it. This is why the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lane needed its own measurement logic rather than a copy of GPU dashboards.

The public XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations docs put that distinction in allocator terms: an HBM OOM can be a compile-time failure because the planned static allocations do not fit the device's HBM. PyTorch/XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.GroundingAbout: XLA vs CUDA stack decisions Reference: Torch XLA / PJRT reality Reference: XLA SPMD sharding annotations's memory API then gives the practical readback shape, bytes_used, bytes_limit, and peak_bytes_used, while the Cloud TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries v6e docs explain why the local boundary is tight by listing 32 GB of HBM per chip. That combination is the public-safe reason to compare chip-local bytes_used against chip-local bytes_limit before inventing GPU-style fragmentation fixes.

The trap is that an average meshQuick term guidemeshThe named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense.GroundingAbout: XLA SPMD sharding annotations Example: 3D parallelism sample Reference: FSDP2 on XLA TPU number can still look healthy while one chip is already over the edge. Static layout can add padding or reshape buffers before execution, and hybrid blocks can move long-lived state or communication-heavy tensors when the sharding changes. That is why the local XLA memory calibration catalog records code, hardware, model, parallelism, and feature state together before it ranks the next startup candidate.

It also explains why exact model notation must survive in OOM reports. If the model is NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample or uses a pattern like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample, that is not decorative context. It tells the reader whether the lane includes attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy regions, Mamba regions, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack regions, and recurrent tail behavior. Those block families do not stress memory in the same way. Saying “the big hybrid model OOMed” throws away the shape information that could help explain the pressure.

There is a second benefit to that naming discipline: it protects cross-run comparison. TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries OOM investigations are often spread over days because compile-heavy retries are expensive. If one report uses the exact lane label and the next report uses softened prose, engineers can end up comparing non-equivalent runs without noticing.

The block glossary remains useful on TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries for exactly this reason:

Token	Meaning	Typical memory concern
`A`	attention block	attention activations and long-context surfaces
`M`	Mamba block	state tensors and scan-related residency
`E`	expert block	routed-token hot spots and expert-side intermediates
`R`	recurrent block	recurrent state and tail-structure changes

A precise OOM report should preserve both the runtime shape and the architectural shape. Without both, remediation becomes guesswork.

The right measurement unit was the chip

The decisive change in the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries memory story was moving from a vague “device memory” concept to per-chip accounting. The local TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries notes describe a need for chip-level bytes_used and bytes_limit, and that is the right abstraction. Once the team could see which chip was actually near the edge, broad fixes gave way to narrower reasoning.

That kind of observability changes behavior more than people expect. Without it, operators reach for the same levers in the same order every time. With it, they can ask better questions:

Is one chip peaking far earlier than its peers?
Did a topology change move the pressure instead of reducing it?
Is the problem tied to long sequence length or to a particular block family?
Did a compile or layout change alter residency without any visible model-code change?

That is a far healthier loop than “reduce something and rerun.” It turns OOM work into diagnosis instead of ritual.

It also makes escalation more honest. Once pressure is localized to one chip and one shape, the team can decide whether the next move belongs in launcher geometry, model shape, or runtime tooling. Before that, every discussion stays broad and unproductive.

Why exact receipts changed operator behavior

The measurement culture around throughput and run records carries over directly to memory work. If the team is already disciplined about preserving model names, feature sets, and lane topology in a report, then OOM reports become much more actionable. They stop being stories and start being frontiers.

This is especially important because a step-zero OOM is easy to summarize lazily. “Did not fit” is not enough. What did not fit, under which wheel lineage, on which validated lane, with which exact model shape, after which reduction attempts? The local TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries docs push toward keeping those details together, and that is the right habit.

That matters because TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries OOM is often expensive to reproduce. A long compile window followed by a step-zero failure is not something you want to rediscover just because the earlier report forgot whether the lane used the NAM52 dense shape or the NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample hybrid shape.

The same logic also keeps TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries and GPU lanes from bleeding into each other conceptually. The H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 notes are very explicit about CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 runtime invariants, root-volume hygiene, and validated launcher environments. The TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries notes are explicit about PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: libtpu / PJRT ownership boundaries About: Torch XLA / PJRT reality Example: TPU backend ownership note, wheel lineage, and chip-level truth. Keeping those receipts separate is how the team avoids applying the wrong fix to the wrong platform.

What survived from the v6e OOM work

Several lessons survived and seem durable.

First, chip-level memory reporting is mandatory. Anything less is too lossy for serious OOM diagnosis on TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries.

Second, OOM recovery needs an ordered ladder. Compile-heavy experimentation punishes random retries.

Third, GPU heuristics should be treated as hypotheses on TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries, not as defaults. They may still help, but they do not get priority automatically.

Fourth, keep the distinction between the documented preferred lineage and the currently validated host state in the memory report. A drifted environment can produce an OOM story that looks fundamental when it is really environmental.

Fifth, exact naming should stay inside the report. NAM52, NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample, and pattern strings are operational details because they preserve the architectural shape behind the failure.

Finally, an honest frontier is better than a heroic myth. If a given topology does not fit on v6e under the real runtime and real stack, the useful output is a recorded limit, not a vague promise that one more small tweak will surely fix it.

What A Cross-Platform Training Stack Should Inherit

A cross-platform training stack should inherit the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lesson even when running on different hardware: measure the bottleneck on the unit that actually fails, keep the lane description exact, and script the retry behavior instead of relying on operator memory. Those are not TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries-only ideas. TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries just made the cost of ignoring them impossible to hide.

There is a deeper mixed-platform lesson here too. When one project spans TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries research and GPU production lanes, the louder platform tends to dictate debugging style. The v6e memory work is a reminder that shared discipline matters more than shared folklore: exact measurement, exact receipts, and platform-specific units of truth.

FAQ

Frequently asked questions

Why must the retry ladder stay mesh-wide instead of letting one rank degrade locally?+

Because distributed training still needs one shared shape contract. If one rank or ordinal shrinks batch, sequence, or topology on its own while the rest of the meshQuick term guidemeshThe named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense. keeps the old plan, the job stops being one experiment and the next collective boundary can fail for a reason that looks unrelated to the original OOM. The useful split is to keep the readback local and per-chip, but apply the next retry step globally so every rank recompiles the same reduced shape.

Why can an expert block make aggregate memory look safer than it is?+

Because the E slot is a routing surface as well as a parameter surface. XLAQuick term guideXLAThe compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here. AllToAll splits one tensor dimension across the participating processes and concatenates the exchanged parts along another dimension, so expert routing can move pressure into collective buffers and local expert-side intermediates before an aggregate meshQuick term guidemeshThe named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense. number looks alarming. The companion Expert Parallel and MoE Sharding article covers that routing contract from the expert-parallel side.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

mesh

The named device grid that defines which logical axis maps to which TPU or distributed-device axis before sharding annotations make sense.

Grounding

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Grounding

AEMEAEMEAEMR

A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.

Grounding

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Grounding

PJRT

The TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.

Grounding

XLA

The compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.

Grounding

TPU

Google's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Topic hubs

Topic Hub

TPU v6e and XLA Runtime Surfaces

A curated reading order for TPU work: bring-up, PJRT and Torch/XLA boundaries, SPMD sharding, and the kernel/runtime traps that made TPU performance non-obvious.

David Gornshtein • MegaCppMore posts →

OOM on v6e: Why Memory Pressure Looked Different on TPU

OOM on v6e: Why Memory Pressure Looked Different on TPU

Why TPU OOM felt different from GPU OOM

The retry ladder mattered more than the first fix

Why GPU heuristics were often wrong on TPU

The right measurement unit was the chip

Why exact receipts changed operator behavior

What survived from the v6e OOM work

What A Cross-Platform Training Stack Should Inherit

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

TPU v6e and XLA Runtime Surfaces