Topic Hub

MoE, Routing, and Distributed Model Splits

A curated path through the expert stack: what the specialist path changed, how routing works, and how the parallelism map constrains the model layout.

This hub starts with the expert path at a system level, then narrows into routing and parallel layout. It is meant to answer how MoE actually changed the MegaCpp stack, not just what MoE means in the abstract.

moe
routing
expert-parallel
distributed-training
megatron
Curated set
7
Articles in reading order
Why this hub

Best if you are trying to connect expert routing decisions to real distributed-training and Megatron boundaries.

System View

Read these first to understand what changed when the specialist path became real.

  1. 01
    April 18, 20269 min readMegaCpp Engineering

    Specialists: What the Expert Path Actually Changed in the Stack

    A grounded look at specialist or expert paths using the real routing flags, expert-parallel notes, and standalone MoE receipts from the codebase.

    The best top-level explanation of what the expert path changed in the stack and why it was worth the complexity.

    MoE
    Experts
    Specialist Models
    Routing
  2. 02
    April 18, 202610 min readBoris Tamarkin

    The MoE Routing We Actually Shipped

    Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

    The routing article to read before looking at sharding or parallel layouts.

    MoE
    Token Choice
    Null Experts
    Routing
  3. 03
    April 18, 202610 min readMegaCpp Engineering

    Expert Parallel and MoE Sharding: Capacity Is Cheap, Routing Is Not

    A grounded walkthrough of expert parallelism in the MegaCpp stack, based on the recipe files, layer definitions, schedule plans, and bug reports that shape how MoE runs actually behave.

    The capacity-versus-routing tradeoff at the point where the distributed system starts to matter.

    Expert Parallel
    MoE
    Distributed Training
    Sharding

Parallel Layout and Implementation Edges

These complete the picture by showing how MoE lands in the real execution stack.

  1. 04
    April 18, 202610 min readEngineering Team

    EP, PP, TP, CP, SP, DP: The Parallelism Map We Actually Use

    What data, tensor, sequence, context, pipeline, and expert parallelism each own, how they compose, and where the real integration risks still live.

    The parallelism map used by the stack; read this before chasing individual split names in code or launchers.

    Distributed Training
    Expert Parallel
    Pipeline Parallel
    Tensor Parallel
  2. 05
    April 18, 202610 min readEngineering Team

    What Megatron Can and Cannot Split

    A grounded look at split-friendly and split-hostile model surfaces: TP, SP, PP, EP, recurrent state, side embeddings, and why some boundaries remain architectural rather than automatic.

    A useful boundary document for what the Megatron lane can express without custom seams.

    Megatron
    Tensor Parallel
    Pipeline Parallel
    MoE
  3. 06
    April 18, 202612 min readDavid Gornshtein

    Fused MoE and DeepEP on NVIDIA: the dispatch layer we ship

    How MegaCpp dispatches MoE tokens on H200 and GB10: DeepEP NVSHMEM all-to-all on NVLink and IB, fused expert GEMM, expert sharding, drop policies, and how the kernel layer interacts with our eight-specialist routing.

    The concrete NVIDIA dispatch path once the expert design is no longer theoretical.

    MoE
    Deep Ep
    NVSHMEM
    All To All
  4. 07
    April 18, 20269 min readMegaCpp Engineering

    Kernel Catalog and Impact: Why the Runtime Needed a Real Map

    A grounded tour of the kernel catalog across attention, sparse MLA, MoE, MTP, and dispatch/combine paths, with emphasis on why naming the kernel family and backend contract changed system-level decisions.

    The cross-family kernel map once MoE, MLA, and dense paths need to be compared in one grounded catalog.

    Kernels
    H200
    MoE
    Attention

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.