The MoE Routing We Actually Shipped
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM ensemble.

The MoE Routing We Actually Shipped
Mixture-of-experts routing has a literature problem. A lot of recent papers describe routers that are elegant at batch size 1024 and broken at batch size 1, and a lot of production writeups bury which variant they actually serve behind. For the MegaCpp SLM ensemble, we had to pick one router, debug it under load, and keep it stable across training, eval, and autoregressive inference on a code model that sees long, boilerplate-heavy C++ tokens. This is the routing we shipped, the variants we ran against it, and the null-expert work that gave us real compute savings without breaking causality.
Token Choice vs Expert Choice, For Real
Every production MoE model we checked ships Token Choice. DeepSeek V3 and R1, Kimi K2, Qwen 3 and 3.5, Nemotron v3 Nano, MiniMax M1 and 2.5, Mixtral 8x7B and 8x22B, DBRX, Grok-1, Snowflake Arctic, OLMoE, Switch Transformer, GShard. The scoring function differs (sigmoid in newer models, softmax in older ones), the expert count ranges from 8 to 384, the top-k ranges from 1 to 8, the load-balance story differs across aux-loss, aux-free bias, and hierarchical group routing. What is common is the direction of selection: each token picks its top-k experts, not the other way around.
Expert Choice is the plausible alternative in training. Each expert picks its top-C tokens from the batch; load balance is perfect by construction; XLA is happy because C is static. It is also broken at autoregressive inference: with batch size 1 and sequence length 1, each expert's "top-C from the batch" is "top-C from a single token," which collapses to zero or one token per expert, and the train-inference mismatch is the whole ball game. OLMoE's ablation went in the same direction as the production tally: Token Choice wins on quality when you measure the serving distribution, not just training throughput.
We started in Expert Choice on an earlier NAM-24 3B research run because it was convenient. It is no longer convenient. The ensemble ships Token Choice.
Our Baseline Router
The shipped configuration for our 4B-8B specialists, per expert layer, sits close to the DeepSeek V3 / Kimi K2 family shape:
- 64 routed experts, 1 shared expert.
- Top-k = 6, selected per token.
- Sigmoid scoring (independent per-expert logits, no zero-sum softmax competition).
- Aux-free bias correction in the spirit of DeepSeek V3's
noaux_tc: a per-expert bias is added to the router's selection score and dynamically pushed toward uniform load, but the bias does not enter the gating weights applied to expert outputs. - One always-on shared SwiGLU expert, outside the routing pool, so every token gets a baseline compute path regardless of where the router sends it.
- Routed scaling factor ~2.5, to keep the dynamic range of routed outputs comparable to the shared expert after top-k summation.
Three routing modes exist in the code (soft, expert_choice,
token_choice). Only token_choice is live in the shipped presets.
soft is the default softmax Token Choice path we keep for comparability
and for regression tests. expert_choice stays as a research-only option
for training-only ablations; our serving configs reject it explicitly.
Why Sigmoid, Not Softmax
Softmax scoring is the older default and forces experts to compete for
a normalized probability mass per token. That sounds like a feature and
is usually a subtle bug for code: two experts that are both genuinely
well-suited to the same #include line end up dragging each other's
scores down, which makes the router more brittle to small routing noise
and more aggressive about concentrating tokens on a single winner.
Sigmoid scoring makes each expert's selection independent. Combined with
an aux-free bias that keeps aggregate load uniform, the signal the router
actually learns is "is this expert a good fit for this token," not "is
this expert the single best fit relative to 63 others."
The measurable consequence for a code model is smoother gating:
top-k membership changes less between consecutive tokens in the same
function, which is exactly what we want, because consecutive tokens in
the same function typically belong to the same structural context and
should route to overlapping expert sets.
Null Experts: How We Actually Used Them
The most useful routing paper we read during this design was the Meta
null-experts work (Token Choice only, by construction). The idea is
simple: expand the router's output from N logits to N+1, duplicate that
single "null" logit M times to reach a candidate pool of N+M slots, pick
top-k_max from that pool, and skip compute for any token-expert slot
that landed on a null. The parameter that matters is rho, the fraction
of real experts in the candidate pool; rho = 0.5 with M = N was the
paper's sweet spot and is where we started.
The key bookkeeping: if you want the model's expected number of real
experts per token to stay at your existing top-k, you size k_max to
ceil(top_k / rho). With our shipped top_k = 6 and rho = 0.5,
that means k_max = 12: each token selects 12 slots from the
N+M = 128 candidate pool, and roughly 6 of those are real experts.
When rho = 1.0, k_max = top_k and the behavior is backward-compatible
with a non-null Token Choice router. Our flag semantics match that:
--moe_top_k is the desired expected count of real experts per token,
and k_max is computed internally from rho.
We rolled null experts out in three phases, matching the paper's implementation priority but on our schedule:
- Soft routing with masking. Expand the router to N+1, duplicate the null logit, mask null selections after top-k, renormalize the gate weights over surviving real experts. All experts still run; there is no compute saving at this phase. The point is to validate the training signal and confirm that the router can learn to choose null on tokens where extra compute is genuinely wasted.
- Gather/scatter Token Choice with variable per-expert token counts. This is where the compute saving actually lands. On GPU we used the grouped path; on TPU we stay with padded static shapes inside the scan to keep XLA happy.
- Ablation over
rhoin{0.5, 0.67, 0.75}on our C++ mix. The question was whether code tokens are heterogeneous enough for null routing to help. Hypothesis: yes, because roughly 40% of C++ tokens are low-entropy boilerplate - includes, closing braces, semicolons, namespace lines, forward declarations. Those are exactly the tokens where we want the router to pick all-null and let the shared SwiGLU expert carry the generation.
Concrete savings on our 24 expert-layer specialist, with N=64, rho=0.5,
E[K]=6: roughly 50% reduction in routed expert FLOPs per token, shared
expert unchanged, net MoE compute savings around 35-40%. That is before
any quality degradation, which we did not measure at rho=0.5; it stayed
within noise of the non-null baseline on our internal C++ eval suite.
The Shared Expert Is the Safety Net
This is worth stating on its own. In the null-experts regime, a token that routes entirely to null still gets the shared expert's output and the residual stream. That changes what the router is actually learning. It is no longer "which six experts should process this token"; it is "are any of the routed experts useful here, or should the shared expert carry it alone?" For boilerplate C++, the second question has a clear answer, and the null-expert path lets the router express it.
Two corollaries. First, the shared expert must stay a first-class component and never be treated as an optimization to remove. Second, we do not combine null experts with ablations that disable the shared expert; that combination collapses the safety net and makes the null signal look worse than it is.
Null-Experts Debugging: The Non-Obvious Failures
Three failure modes were not in the paper but showed up in our implementation:
- Gate renormalization after masking. If you mask null slots after top-k
and forget to renormalize the gate weights over the surviving real
experts, a token that lands heavily on null slots has its effective
contribution scaled down proportionally, which looks like random
quality loss that correlates with
rho. The fix is a per-token renormalization over surviving slots; the observable signature is that pre-fix eval gets worse asrhodecreases, which is the wrong direction. - Aux-loss interaction. A standard load-balance aux loss (alpha * N * sum(f_i * P_i)) computed over the expanded N+M pool rewards the router for sending everything to null, because null is perfectly balanced by construction. The aux loss must be computed over real experts only, and it must see the renormalized real-expert gate distribution, not the pre-mask N+M distribution.
- Router z-loss stability. We kept the OLMoE-style z-loss at ~1e-3. Without it, sigmoid scoring plus sparse null routing occasionally drove router logits to saturating magnitudes on tokens that landed mostly-null, which produced the gradient equivalent of a dead-expert spiral.
The practical guardrails we keep in training telemetry: log real-expert
usage entropy per layer, log null-fraction per layer per step, and log
per-expert bias trajectories. A healthy run shows entropy dropping
smoothly into a plateau, null-fraction settling near 1 - rho, and
biases oscillating in a narrow band. A sick run shows null-fraction
drifting toward 1 or collapsing toward 0 within a few thousand steps;
either direction is a bug, not a feature.
Gating Stability Without Aux-Loss Tuning
DeepSeek V3's aux-free bias correction (noaux_tc) is the production
mechanism we copied, with a small modification. The bias adjusts
selection scores only; gating weights used to combine expert outputs
see only the raw router logits after top-k. That separation matters for
two reasons. First, it decouples load-balance pressure from the learned
gate magnitudes, which keeps the router from drifting into a regime
where aggressive balance correction distorts gate weights and therefore
expert gradients. Second, it makes the bias safe to update with a much
simpler controller than a true auxiliary loss: our update is a clipped
EMA toward uniform load per layer per step, which is boring, stable, and
not entangled with the main optimizer's learning rate schedule.
We kept Qwen 3's dual balance story (global expert balance plus local token balance) in reserve, behind a flag, and never needed it. The per-expert bias correction plus z-loss plus the null-experts safety net were enough.
Comparisons We Ran
The routers we compared on small-scale training runs before picking the shipped configuration:
- Softmax Token Choice with standard aux loss, top-k = 6, 64 experts. Stable but more concentration on a few experts over long training.
- Sigmoid Token Choice with aux-free bias, top-k = 6, 64 experts. Flatter expert usage, smaller gating variance between adjacent tokens. This is the baseline we shipped before null experts.
- Sigmoid Token Choice with aux-free bias, null experts at
rho = 0.5, top-k = 6 expected real,k_max = 12. Same quality as the previous row within noise, ~40% less routed FLOPs. Shipped. - Expert Choice with static C per expert. Competitive training throughput, unacceptable at inference. Research-only, not shipped.
- Hierarchical group routing in the Nemotron style (top-1 over 8 groups of 16, then top-6 within group). Not worse, not meaningfully better on our C++ mix at this scale; the extra routing stage was not worth the code complexity for our specialists.
What We Did Not Ship
A few routing ideas looked interesting and did not make the cut.
Structure-aware routing - using offline AST or call-graph metadata to bias the router toward experts specialized on specific structural contexts - is on the roadmap but outside this post. It belongs with the structure-aware attention work and the enriched-parquet training contract.
Large top-k values (>= 8) at fine-grained expert counts (128-384).
DeepSeek V3 and Kimi K2 ship there; our specialists are smaller and the
marginal quality at top-k = 8 did not justify the extra all-to-all
communication for our serving layout.
Expert Choice on serving. The train-inference mismatch is not a hyperparameter; it is a semantics change. There is no serving-time rescue for Expert Choice on a single-token decode step, and nothing we tried on the inference side made it worth the training-time perfect load balance.
The Short Version
Token Choice, sigmoid scoring, aux-free bias correction, 64 routed
experts plus 1 shared, top-k = 6, null experts at rho = 0.5 for
k_max = 12, z-loss at 1e-3, per-token gate renormalization after null
masking, aux loss computed over real experts only, shared expert always
on, Expert Choice explicitly rejected at serving. The routing that
survives production pressure is the routing that stays causal, stays
cheap on boilerplate, and stays honest about what it is doing on the
token that is actually being decoded.
References (filenames only): MOE_ROUTING_COMPARISON.md,
null_experts_analysis.md.