ICML
2026
Seoul, Korea

DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training

Can Jin*1  Hongwu Peng*2  Mingcan Xiang23  Qixin Zhang4  Xiangchi Yuan5  Amit Hasan2  Ohi Dibua2  Yifan Gong2  Yan Kang2†  Dimitris N. Metaxas1†
1Rutgers University  ·  2Adobe Research  ·  3UMass Amherst  ·  4Nanyang Technological University  ·  5Georgia Institute of Technology  |  *Equal Contribution  ·  Equal Advising
PMLR 306, 2026
can.jin@rutgers.edu
Work done during
internship at Adobe Research
Motivation

Sparse Mixture-of-Experts (MoE) scales model capacity by activating only a few experts per token, decoupling parameters from compute cost. Two routing schemes dominate pre-training:

  • Top-k — fixed number of experts per token; ignores token difficulty & layer needs.
  • Fixed Top-p — selects experts until cumulative probability exceeds threshold p.
Problem: Fixed Top-p is Unstable
Figure 2

  • ❶ Uncontrolled compute: activated-expert count fluctuates unpredictably — incompatible with strict pre-training budgets.
  • ❷ Hypersensitive to p: p=0.40 over-activates (>12 experts); p=0.35 only matches Top-k — gains are marginal.
Key Insight

Top-p is inherently adaptive — confident tokens use fewer experts, uncertain tokens recruit more — but the threshold receives no gradient (it only binarizes the mask), so it cannot be learned by SGD.

💡Idea: treat target sparsity as a control setpoint and make the threshold effectively learnable, so average activated experts converge to a budget T.
Contributions
  • Analysis showing fixed-threshold Top-p gives only marginal gains over Top-k with uncontrolled cost.
  • DTop-p MoE: a PI-controlled, sparsity-controllable dynamic Top-p with Dynamic Routing Normalization (+ a layer-wise variant).
  • Comprehensive study on NLP & CV: better performance and scaling at matched FLOPs.
Method — DTop-p MoE
Figure 1: overview
① PI Controller — Learnable Sparsity

A Proportional-Integral controller adjusts the global threshold p between batches from the sparsity error et=(T−at)/N:

Equation 6

The proportional term reacts to deviation; the integral term removes steady-state bias, driving activated experts at → target T.

② Dynamic Routing Normalization
Equation 7

A learnable per-layer scale θl rescales normalized logits, letting each layer sharpen/flatten its routing — enabling distinct sparsity per depth under one global threshold.

Training Procedure
Algorithm 1
NLP — Training Dynamics (100B tokens)
Figure 3

MoE-1.3B-6.9B-64E8A (1.3B active / 6.9B total). DTop-p reaches the best loss and locks sparsity to T=8, unlike fixed Top-p which overshoots.

NLP — Inference Benchmarks (13 datasets)
Table 2
CV — DiT Pre-training (2T pixel tokens)
Figure 4
Precise & Adaptive Sparsity Control
Figure 6

DTop-p converges to T=8 with low variance (σ≈1); fixed Top-p drifts with σ≈4. It activates fewer experts in shallow layers, more in deep layers.

Ablation — PI Controller & DRN
Figure 7

Both components are needed: PI enforces the budget; DRN adaptively rescales layer logits. Together they give the best loss and stable threshold.

Scaling — Model Size (0.4B → 2.4B)
Table 8
Conclusions
+1.9%
avg. over Top-k (NLP, matched FLOPs)
σ≈1
stable activated-expert count
2
modalities: LLM & DiT
  • DTop-p reconciles token-adaptive routing with strict compute control via PI control — no gradient needed for the threshold.
  • Beats Top-k & fixed Top-p on LLMs and DiTs; robust scaling across granularity, model & dataset size.