ICML 2026 · Seoul, South Korea · PMLR 306

DTop-p MoE

Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training
Can Jin*1  Hongwu Peng*2  Mingcan Xiang2,3  Qixin Zhang4  Xiangchi Yuan5  Amit Hasan2  Ohi Dibua2  Yifan Gong2  Yan Kang2,†  Dimitris N. Metaxas1,†
1Rutgers University · 2Adobe Research · 3UMass Amherst · 4Nanyang Technological University · 5Georgia Institute of Technology
*Equal Contribution  ·  Equal Advising  ·  Work done during an internship at Adobe Research  ·  Presented by Can Jin
DTop-p MoE overview

A Proportional-Integral (PI) controller makes the global Top-p threshold effectively learnable to hit a target expert count, while Dynamic Routing Normalization lets each layer choose its own sparsity under that single global budget.

Abstract

Sparse Mixture-of-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-k routing imposes a rigid sparsity pattern that ignores the intrinsic variance in token difficulty and layer-specific computational needs. Top-p routing is more adaptive because it selects experts until their cumulative routing probability reaches a threshold, allowing confident tokens to use fewer experts and ambiguous tokens to recruit more. However, we demonstrate that existing naive Top-p implementations with fixed global probability thresholds provide only marginal gains over Top-k, suffer from hyper-parameter sensitivity, and result in uncontrolled computational costs. In this paper, we propose DTop-p, a sparsity-controllable dynamic routing mechanism that learns the Top-p probability threshold with a Proportional-Integral controller and uses dynamic routing normalization to support layer-wise expert selection under a global sparsity constraint. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that DTop-p consistently outperforms both Top-k and fixed Top-p baselines while matching the average FLOPs of Top-k MoE. Our analysis confirms that DTop-p exhibits strong scaling properties across expert granularity, total expert capacity, model size, and dataset size, offering a robust and efficient MoE framework for foundation model pre-training.

TL;DR

Make Top-p routing compute-controllable and more effective than Top-k — with a control-theory twist.

+1.9%
over Top-k MoE
Higher average across 13 NLP benchmarks at matched average FLOPs.
Precise sparsity
A PI controller drives the activated-expert count to a target T with low variance (σ ≈ 1).
Adaptive by depth
Dynamic Routing Normalization learns fewer experts in shallow layers, more in deep layers.
LLM + DiT
Consistent wins on both language and diffusion pre-training, with robust scaling.

Method

The Top-p threshold is non-differentiable, so it cannot be learned by SGD. DTop-p instead treats target sparsity as a control setpoint.

① PI Controller — learnable threshold
Equation 6

From the sparsity error et = (Tat)/N, the proportional term reacts to deviation and the integral term removes steady-state bias — so the average activated experts converge to the budget T. No gradient on the threshold.

② Dynamic Routing Normalization
Equation 7

A learnable per-layer scale θl rescales normalized logits: larger θl sharpens (fewer experts), smaller flattens (more experts). Each layer learns a distinct sparsity pattern under one global threshold.

Results

NLP — MoE-1.3B-6.9B-64E8A trained on 100B DCLM tokens.

Figure 3 — NLP training curves
Table 2 — NLP benchmarks
50.9
best average over 13 datasets
vs. 49.0 (Top-k) and 49.3 (fixed Top-p) — a +1.9% gain at matched FLOPs.

DTop-p reaches the best train/val loss and locks the activated-expert count to T = 8 within ~1B tokens, while fixed Top-p overshoots and oscillates.

Computer Vision — a 64E8A MoE Diffusion Transformer (0.9B active / 3.4B total) trained on 2T pixel tokens.

Figure 4 — CV training curves

Analysis

Precise control, adaptive allocation, and the contribution of each component.

Figure 6 — activation statistics

DTop-p converges to T = 8 with low variance (σ ≈ 1); fixed Top-p drifts with σ ≈ 4.

Figure 7 — ablation
Table 8 — model-size scaling

Both the PI controller and DRN are needed for the best result (left); the advantage over Top-k widens with model scale from 0.4B to 2.4B (right).

Resources

Everything for the ICML 2026 presentation.

BibTeX

@inproceedings{jin2026dtopp,
  title     = {{DTop-$p$ MoE}: Sparsity-Controlled Dynamic Top-$p$ MoE
               for Foundation Model Pre-training},
  author    = {Jin, Can and Peng, Hongwu and Xiang, Mingcan and Zhang, Qixin
               and Yuan, Xiangchi and Hasan, Amit and Dibua, Ohi and Gong, Yifan
               and Kang, Yan and Metaxas, Dimitris N.},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  series    = {Proceedings of Machine Learning Research},
  volume    = {306},
  year      = {2026},
  publisher = {PMLR},
  address   = {Seoul, South Korea}
}