ICML 2026 · Seoul, South Korea

DTop-p MoE

Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training

Can Jin^*1 Hongwu Peng^*2 Mingcan Xiang²³ Qixin Zhang⁴ Xiangchi Yuan⁵ Amit Hasan²
Ohi Dibua² Yifan Gong² Yan Kang^2† Dimitris N. Metaxas^1†

¹Rutgers University · ²Adobe Research · ³UMass Amherst · ⁴Nanyang Technological University · ⁵Georgia Tech | ^*Equal Contribution · ^†Equal Advising

Presented by Can Jin · Rutgers University · can.jin@rutgers.edu

Background

Sparse Mixture-of-Experts (MoE)

Scaling Foundation Models with dense compute is prohibitively expensive. Sparse MoE activates only a small subset of experts per token — decoupling total parameters from compute cost.

Top-k routing

Select a fixed number of k experts per token. Predictable compute, but the same capacity for every token — ignores token difficulty and layer-specific needs.

Top-p routing

Select experts until cumulative probability exceeds threshold p (nucleus sampling). Adaptive per token — but how reliable is it?

This talk: can Top-p routing be made both compute-controllable and more effective than Top-k?

DTop-p MoE — ICML 2026Presented by Can Jin2 / 16

Motivation

Fixed-Threshold Top-p is Unstable

❶ Uncontrolled compute

Activated-expert count fluctuates unpredictably during training — incompatible with strict pre-training FLOP budgets (OOM risk).

❷ Hypersensitive to p

p = 0.40 over-activates (>12 experts); p = 0.35 only matches Top-k. Gains are marginal and tuning is costly.

DTop-p MoE — ICML 2026Presented by Can Jin3 / 16

Key Idea

Make the Threshold a Control Setpoint

💡 Insight

Top-p is naturally adaptive, but the threshold gets no gradient (it only binarizes the expert mask). So treat target sparsity as a setpoint and make the threshold effectively learnable via feedback control.

Contributions:

Analysis: fixed-threshold Top-p gives only marginal gains over Top-k with uncontrolled cost.
DTop-p MoE: a PI controller + Dynamic Routing Normalization → adaptive routing under a strict global budget (plus a layer-wise variant).
Comprehensive study on NLP & CV — better performance and scaling at matched FLOPs.

DTop-p MoE — ICML 2026Presented by Can Jin4 / 16

Method

DTop-p MoE — Overview

A PI controller adjusts the global probability threshold to hit a target expert count; Dynamic Routing Normalization lets each layer choose its own sparsity under that budget.

DTop-p MoE — ICML 2026Presented by Can Jin5 / 16

Method ①

PI Controller — Learnable Sparsity

Track average activated experts a_t per batch; define sparsity error e_t = (T − a_t) / N. A discrete Proportional-Integral law updates the threshold:

Proportional term

Reacts immediately to the current deviation from target T.

Integral term

Accumulates past errors → removes steady-state bias, so a_t converges to T.

Relies on the monotonicity of nucleus sampling: raising p ⇒ more experts. No gradient needed for the threshold.

DTop-p MoE — ICML 2026Presented by Can Jin6 / 16

Method ②

Dynamic Routing Normalization

A single global threshold assumes uniform logit statistics across depth. Instead, normalize each layer's logits and apply a learnable per-layer scale θ_l:

Large θ_l → sharper distribution → fewer experts; small θ_l → flatter → more experts.
Each layer learns a distinct sparsity pattern while still respecting the single global budget.

DTop-p MoE — ICML 2026Presented by Can Jin7 / 16

Results · NLP

Training Dynamics (100B tokens)

DTop-p reaches the best train/val loss on MoE-1.3B-6.9B-64E8A and locks the activated-expert count to T = 8 within ~1B tokens — fixed Top-p overshoots and oscillates.

DTop-p MoE — ICML 2026Presented by Can Jin8 / 16

Results · NLP

Downstream Benchmarks (13 datasets)

+1.9%

average gain over Top-k MoE
at matched average FLOPs

Best average across the 13 zero/few-shot tasks (50.9 vs 49.0 / 49.3).
Large gains on reasoning: SVAMP +5.7, COPA +3.0 over Top-k.

DTop-p MoE — ICML 2026Presented by Can Jin9 / 16

Results · Vision

Generalizes to Diffusion Transformers

On a 0.9B → 3.4B 64E8A MoE DiT trained on 2T pixel tokens, DTop-p again achieves the lowest validation loss while holding precise sparsity — the method is not NLP-specific.

DTop-p MoE — ICML 2026Presented by Can Jin10 / 16

Analysis

Precise & Adaptive Sparsity Control

Precise

Converges to T = 8 with low variance (σ ≈ 1); fixed Top-p drifts with σ ≈ 4.

Adaptive (hierarchical)

Learns to use fewer experts in shallow layers, more in deep layers — emergent depth-wise specialization.

DTop-p MoE — ICML 2026Presented by Can Jin11 / 16

Analysis

Ablation: Both Components Matter

PI controller — without it, sparsity is unregulated and drifts.

DRN — adaptively rescales layer logits; best loss only with both combined.

DTop-p MoE — ICML 2026Presented by Can Jin12 / 16

Analysis

Robust Scaling (0.4B → 2.4B)

DTop-p wins at every model size — and the advantage over Top-k widens with scale. It also benefits more from finer expert granularity and more training data.

DTop-p MoE — ICML 2026Presented by Can Jin13 / 16

Recap

The Full Training Loop

Forward: per layer, normalize logits (DRN) → nucleus-sample experts at threshold p_t.
Feedback: measure a_t, compute error, update p_t+1 via PI.
Cost: only a lightweight scalar signal on top of standard MoE training.

DTop-p MoE — ICML 2026Presented by Can Jin14 / 16

Conclusion

Takeaways

+1.9%

avg. over Top-k
(NLP, matched FLOPs)

σ ≈ 1

stable activated-
expert count

LLM + DiT

wins on both
NLP & vision

DTop-p makes Top-p routing compute-controllable via PI control — no gradient on the threshold.
Dynamic Routing Normalization unlocks per-layer adaptive sparsity under one global budget.
Beats Top-k & fixed Top-p with robust scaling across granularity, model & data size.

DTop-p MoE — ICML 2026Presented by Can Jin15 / 16

Thank You! Questions?

DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training

Presented by Can Jin · Rutgers University | can.jin@rutgers.edu
ICML 2026 · PMLR 306 | Rutgers University & Adobe Research