DTop-p MoE – ICML 2026 Poster

Motivation

Sparse Mixture-of-Experts (MoE) scales model capacity by activating only a few experts per token, decoupling parameters from compute cost. Two routing schemes dominate pre-training:

Top-k — fixed number of experts per token; ignores token difficulty & layer needs.
Fixed Top-p — selects experts until cumulative probability exceeds threshold p.

Problem: Fixed Top-p is Unstable

❶ Uncontrolled compute: activated-expert count fluctuates unpredictably — incompatible with strict pre-training budgets.
❷ Hypersensitive to p: p=0.40 over-activates (>12 experts); p=0.35 only matches Top-k — gains are marginal.

Key Insight

Top-p is inherently adaptive — confident tokens use fewer experts, uncertain tokens recruit more — but the threshold receives no gradient (it only binarizes the mask), so it cannot be learned by SGD.

💡Idea: treat target sparsity as a control setpoint and make the threshold effectively learnable, so average activated experts converge to a budget T.

Contributions

Analysis showing fixed-threshold Top-p gives only marginal gains over Top-k with uncontrolled cost.
DTop-p MoE: a PI-controlled, sparsity-controllable dynamic Top-p with Dynamic Routing Normalization (+ a layer-wise variant).
Comprehensive study on NLP & CV: better performance and scaling at matched FLOPs.

Method — DTop-p MoE

① PI Controller — Learnable Sparsity

A Proportional-Integral controller adjusts the global threshold p between batches from the sparsity error e_t=(T−a_t)/N:

The proportional term reacts to deviation; the integral term removes steady-state bias, driving activated experts a_t → target T.

② Dynamic Routing Normalization

A learnable per-layer scale θ_l rescales normalized logits, letting each layer sharpen/flatten its routing — enabling distinct sparsity per depth under one global threshold.

Training Procedure

NLP — Training Dynamics (100B tokens)

MoE-1.3B-6.9B-64E8A (1.3B active / 6.9B total). DTop-p reaches the best loss and locks sparsity to T=8, unlike fixed Top-p which overshoots.

NLP — Inference Benchmarks (13 datasets)

CV — DiT Pre-training (2T pixel tokens)

Precise & Adaptive Sparsity Control

DTop-p converges to T=8 with low variance (σ≈1); fixed Top-p drifts with σ≈4. It activates fewer experts in shallow layers, more in deep layers.

Ablation — PI Controller & DRN

Both components are needed: PI enforces the budget; DRN adaptively rescales layer logits. Together they give the best loss and stable threshold.

Scaling — Model Size (0.4B → 2.4B)

Conclusions

+1.9%

avg. over Top-k (NLP, matched FLOPs)

σ≈1

stable activated-expert count

modalities: LLM & DiT

DTop-p reconciles token-adaptive routing with strict compute control via PI control — no gradient needed for the threshold.
Beats Top-k & fixed Top-p on LLMs and DiTs; robust scaling across granularity, model & dataset size.