DTop-p MoE — ICML 2026

A Proportional-Integral (PI) controller makes the global Top-p threshold effectively learnable to hit a target expert count, while Dynamic Routing Normalization lets each layer choose its own sparsity under that single global budget.

Abstract

Sparse Mixture-of-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-k routing imposes a rigid sparsity pattern that ignores the intrinsic variance in token difficulty and layer-specific computational needs. Top-p routing is more adaptive because it selects experts until their cumulative routing probability reaches a threshold, allowing confident tokens to use fewer experts and ambiguous tokens to recruit more. However, we demonstrate that existing naive Top-p implementations with fixed global probability thresholds provide only marginal gains over Top-k, suffer from hyper-parameter sensitivity, and result in uncontrolled computational costs. In this paper, we propose DTop-p, a sparsity-controllable dynamic routing mechanism that learns the Top-p probability threshold with a Proportional-Integral controller and uses dynamic routing normalization to support layer-wise expert selection under a global sparsity constraint. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that DTop-p consistently outperforms both Top-k and fixed Top-p baselines while matching the average FLOPs of Top-k MoE. Our analysis confirms that DTop-p exhibits strong scaling properties across expert granularity, total expert capacity, model size, and dataset size, offering a robust and efficient MoE framework for foundation model pre-training.

TL;DR

Make Top-p routing compute-controllable and more effective than Top-k — with a control-theory twist.

+1.9%

over Top-k MoE

Higher average across 13 NLP benchmarks at matched average FLOPs.

Precise sparsity

A PI controller drives the activated-expert count to a target T with low variance (σ ≈ 1).

Adaptive by depth

Dynamic Routing Normalization learns fewer experts in shallow layers, more in deep layers.

LLM + DiT

Consistent wins on both language and diffusion pre-training, with robust scaling.

Method

The Top-p threshold is non-differentiable, so it cannot be learned by SGD. DTop-p instead treats target sparsity as a control setpoint.

① PI Controller — learnable threshold

From the sparsity error e_t = (T−a_t)/N, the proportional term reacts to deviation and the integral term removes steady-state bias — so the average activated experts converge to the budget T. No gradient on the threshold.

② Dynamic Routing Normalization

A learnable per-layer scale θ_l rescales normalized logits: larger θ_l sharpens (fewer experts), smaller flattens (more experts). Each layer learns a distinct sparsity pattern under one global threshold.

Results

NLP — MoE-1.3B-6.9B-64E8A trained on 100B DCLM tokens.

50.9

best average over 13 datasets

vs. 49.0 (Top-k) and 49.3 (fixed Top-p) — a +1.9% gain at matched FLOPs.

DTop-p reaches the best train/val loss and locks the activated-expert count to T = 8 within ~1B tokens, while fixed Top-p overshoots and oscillates.

Computer Vision — a 64E8A MoE Diffusion Transformer (0.9B active / 3.4B total) trained on 2T pixel tokens.

Analysis

Precise control, adaptive allocation, and the contribution of each component.

DTop-p converges to T = 8 with low variance (σ ≈ 1); fixed Top-p drifts with σ ≈ 4.

Both the PI controller and DRN are needed for the best result (left); the advantage over Top-k widens with model scale from 0.4B to 2.4B (right).

BibTeX

@inproceedings{jin2026dtopp,
  title     = {{DTop-$p$ MoE}: Sparsity-Controlled Dynamic Top-$p$ MoE
               for Foundation Model Pre-training},
  author    = {Jin, Can and Peng, Hongwu and Xiang, Mingcan and Zhang, Qixin
               and Yuan, Xiangchi and Hasan, Amit and Dibua, Ohi and Gong, Yifan
               and Kang, Yan and Metaxas, Dimitris N.},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  series    = {Proceedings of Machine Learning Research},
  volume    = {306},
  year      = {2026},
  publisher = {PMLR},
  address   = {Seoul, South Korea}
}