Topic Hub

MoE, Routing, and Distributed Model Splits

A curated path through the expert stack: what the specialist path changed, how routing works, and how the parallelism map constrains the model layout.

This hub starts with the expert path at a system level, then narrows into routing and parallel layout. It is meant to answer how MoE actually changed the MegaCpp stack, not just what MoE means in the abstract.

moe

routing

expert-parallel

distributed-training

megatron

Curated set

Articles in reading order

Why this hub

Best if you are trying to connect expert routing decisions to real distributed-training and Megatron boundaries.

System View

Read these first to understand what changed when the specialist path became real.

01
April 18, 2026•9 min read•MegaCpp Engineering
Specialists: What the Expert Path Actually Changed in the Stack
A grounded look at specialist or expert paths using the real routing flags, expert-parallel notes, and standalone MoE receipts from the codebase.
The best top-level explanation of what the expert path changed in the stack and why it was worth the complexity.
MoE
Experts
Specialist Models
Routing
Read article
02
April 18, 2026•10 min read•Boris Tamarkin
The MoE Routing We Actually Shipped
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
The routing article to read before looking at sharding or parallel layouts.
MoE
Token Choice
Null Experts
Routing
Read article
03
April 18, 2026•10 min read•MegaCpp Engineering
Expert Parallel and MoE Sharding: Capacity Is Cheap, Routing Is Not
A grounded walkthrough of expert parallelism in the MegaCpp stack, based on the recipe files, layer definitions, schedule plans, and bug reports that shape how MoE runs actually behave.
The capacity-versus-routing tradeoff at the point where the distributed system starts to matter.
Expert Parallel
MoE
Distributed Training
Sharding
Read article

Parallel Layout and Implementation Edges

These complete the picture by showing how MoE lands in the real execution stack.

04
April 18, 2026•10 min read•Engineering Team
EP, PP, TP, CP, SP, DP: The Parallelism Map We Actually Use
What data, tensor, sequence, context, pipeline, and expert parallelism each own, how they compose, and where the real integration risks still live.
The parallelism map used by the stack; read this before chasing individual split names in code or launchers.
Distributed Training
Expert Parallel
Pipeline Parallel
Tensor Parallel
Read article
05
April 18, 2026•10 min read•Engineering Team
What Megatron Can and Cannot Split
A grounded look at split-friendly and split-hostile model surfaces: TP, SP, PP, EP, recurrent state, side embeddings, and why some boundaries remain architectural rather than automatic.
A useful boundary document for what the Megatron lane can express without custom seams.
Megatron
Tensor Parallel
Pipeline Parallel
MoE
Read article
06
April 18, 2026•12 min read•David Gornshtein
Fused MoE and DeepEP on NVIDIA: the dispatch layer we ship
How MegaCpp dispatches MoE tokens on H200 and GB10: DeepEP NVSHMEM all-to-all on NVLink and IB, fused expert GEMM, expert sharding, drop policies, and how the kernel layer interacts with our eight-specialist routing.
The concrete NVIDIA dispatch path once the expert design is no longer theoretical.
MoE
Deep Ep
NVSHMEM
All To All
Read article
07
April 18, 2026•9 min read•MegaCpp Engineering
Kernel Catalog and Impact: Why the Runtime Needed a Real Map
A grounded tour of the kernel catalog across attention, sparse MLA, MoE, MTP, and dispatch/combine paths, with emphasis on why naming the kernel family and backend contract changed system-level decisions.
The cross-family kernel map once MoE, MLA, and dense paths need to be compared in one grounded catalog.
Kernels
H200
MoE
Attention
Read article

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.

MoE, Routing, and Distributed Model Splits

System View

Specialists: What the Expert Path Actually Changed in the Stack

The MoE Routing We Actually Shipped

Expert Parallel and MoE Sharding: Capacity Is Cheap, Routing Is Not

Parallel Layout and Implementation Edges

EP, PP, TP, CP, SP, DP: The Parallelism Map We Actually Use

What Megatron Can and Cannot Split

Fused MoE and DeepEP on NVIDIA: the dispatch layer we ship

Kernel Catalog and Impact: Why the Runtime Needed a Real Map

Adjacent topic hubs

GB10 and Blackwell Bring-Up

Modal Training and Benchmark Operations

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

MLA Integration, Dispatch, and Weight Absorption

Evaluation, Benchmarks, and Verifier Loops

Megatron Parallelism and Layout Boundaries

TPU Sparse Attention and Pallas Kernels

H200 Training and Kernel Bring-Up

TPU v6e and XLA Runtime Surfaces

C++ Data Pipelines and Corpus Packaging