ICML 2026 · Seoul, South Korea

DTop-p MoE

Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training
Can Jin*1   Hongwu Peng*2   Mingcan Xiang23   Qixin Zhang4   Xiangchi Yuan5   Amit Hasan2
Ohi Dibua2   Yifan Gong2   Yan Kang2†   Dimitris N. Metaxas1†
1Rutgers University · 2Adobe Research · 3UMass Amherst · 4Nanyang Technological University · 5Georgia Tech  |  *Equal Contribution · Equal Advising
Presented by Can Jin  ·  Rutgers University  ·  can.jin@rutgers.edu
Background

Sparse Mixture-of-Experts (MoE)

Scaling Foundation Models with dense compute is prohibitively expensive. Sparse MoE activates only a small subset of experts per token — decoupling total parameters from compute cost.

Top-k routing
Select a fixed number of k experts per token. Predictable compute, but the same capacity for every token — ignores token difficulty and layer-specific needs.
Top-p routing
Select experts until cumulative probability exceeds threshold p (nucleus sampling). Adaptive per token — but how reliable is it?

This talk: can Top-p routing be made both compute-controllable and more effective than Top-k?

DTop-p MoE — ICML 2026Presented by Can Jin2 / 16
Motivation

Fixed-Threshold Top-p is Unstable

Figure 2
❶ Uncontrolled compute
Activated-expert count fluctuates unpredictably during training — incompatible with strict pre-training FLOP budgets (OOM risk).
❷ Hypersensitive to p
p = 0.40 over-activates (>12 experts); p = 0.35 only matches Top-k. Gains are marginal and tuning is costly.
DTop-p MoE — ICML 2026Presented by Can Jin3 / 16
Key Idea

Make the Threshold a Control Setpoint

💡 Insight
Top-p is naturally adaptive, but the threshold gets no gradient (it only binarizes the expert mask). So treat target sparsity as a setpoint and make the threshold effectively learnable via feedback control.

Contributions:

  • Analysis: fixed-threshold Top-p gives only marginal gains over Top-k with uncontrolled cost.
  • DTop-p MoE: a PI controller + Dynamic Routing Normalization → adaptive routing under a strict global budget (plus a layer-wise variant).
  • Comprehensive study on NLP & CV — better performance and scaling at matched FLOPs.
DTop-p MoE — ICML 2026Presented by Can Jin4 / 16
Method

DTop-p MoE — Overview

Figure 1

A PI controller adjusts the global probability threshold to hit a target expert count; Dynamic Routing Normalization lets each layer choose its own sparsity under that budget.

DTop-p MoE — ICML 2026Presented by Can Jin5 / 16
Method ①

PI Controller — Learnable Sparsity

Track average activated experts at per batch; define sparsity error et = (Tat) / N. A discrete Proportional-Integral law updates the threshold:

Equation 6
Proportional term
Reacts immediately to the current deviation from target T.
Integral term
Accumulates past errors → removes steady-state bias, so at converges to T.

Relies on the monotonicity of nucleus sampling: raising p ⇒ more experts. No gradient needed for the threshold.

DTop-p MoE — ICML 2026Presented by Can Jin6 / 16
Method ②

Dynamic Routing Normalization

A single global threshold assumes uniform logit statistics across depth. Instead, normalize each layer's logits and apply a learnable per-layer scale θl:

Equation 7
  • Large θlsharper distribution → fewer experts; small θlflatter → more experts.
  • Each layer learns a distinct sparsity pattern while still respecting the single global budget.
DTop-p MoE — ICML 2026Presented by Can Jin7 / 16
Results · NLP

Training Dynamics (100B tokens)

Figure 3
DTop-p reaches the best train/val loss on MoE-1.3B-6.9B-64E8A and locks the activated-expert count to T = 8 within ~1B tokens — fixed Top-p overshoots and oscillates.
DTop-p MoE — ICML 2026Presented by Can Jin8 / 16
Results · NLP

Downstream Benchmarks (13 datasets)

Table 2
+1.9%
average gain over Top-k MoE
at matched average FLOPs
  • Best average across the 13 zero/few-shot tasks (50.9 vs 49.0 / 49.3).
  • Large gains on reasoning: SVAMP +5.7, COPA +3.0 over Top-k.
DTop-p MoE — ICML 2026Presented by Can Jin9 / 16
Results · Vision

Generalizes to Diffusion Transformers

Figure 4
On a 0.9B → 3.4B 64E8A MoE DiT trained on 2T pixel tokens, DTop-p again achieves the lowest validation loss while holding precise sparsity — the method is not NLP-specific.
DTop-p MoE — ICML 2026Presented by Can Jin10 / 16
Analysis

Precise & Adaptive Sparsity Control

Figure 6
Precise
Converges to T = 8 with low variance (σ ≈ 1); fixed Top-p drifts with σ ≈ 4.
Adaptive (hierarchical)
Learns to use fewer experts in shallow layers, more in deep layers — emergent depth-wise specialization.
DTop-p MoE — ICML 2026Presented by Can Jin11 / 16
Analysis

Ablation: Both Components Matter

Figure 7
PI controller — without it, sparsity is unregulated and drifts.
DRN — adaptively rescales layer logits; best loss only with both combined.
DTop-p MoE — ICML 2026Presented by Can Jin12 / 16
Analysis

Robust Scaling (0.4B → 2.4B)

Table 8

DTop-p wins at every model size — and the advantage over Top-k widens with scale. It also benefits more from finer expert granularity and more training data.

DTop-p MoE — ICML 2026Presented by Can Jin13 / 16
Recap

The Full Training Loop

Algorithm 1
  • Forward: per layer, normalize logits (DRN) → nucleus-sample experts at threshold pt.
  • Feedback: measure at, compute error, update pt+1 via PI.
  • Cost: only a lightweight scalar signal on top of standard MoE training.
DTop-p MoE — ICML 2026Presented by Can Jin14 / 16
Conclusion

Takeaways

+1.9%
avg. over Top-k
(NLP, matched FLOPs)
σ ≈ 1
stable activated-
expert count
LLM + DiT
wins on both
NLP & vision
  • DTop-p makes Top-p routing compute-controllable via PI control — no gradient on the threshold.
  • Dynamic Routing Normalization unlocks per-layer adaptive sparsity under one global budget.
  • Beats Top-k & fixed Top-p with robust scaling across granularity, model & data size.
DTop-p MoE — ICML 2026Presented by Can Jin15 / 16

Thank You!  Questions?

DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training
Presented by Can Jin · Rutgers University  |  can.jin@rutgers.edu
ICML 2026 · PMLR 306  |  Rutgers University & Adobe Research
→ / Space  next  ·  ←  prev  ·  F  fullscreen