Scaling Foundation Models with dense compute is prohibitively expensive. Sparse MoE activates only a small subset of experts per token — decoupling total parameters from compute cost.
This talk: can Top-p routing be made both compute-controllable and more effective than Top-k?

Contributions:

A PI controller adjusts the global probability threshold to hit a target expert count; Dynamic Routing Normalization lets each layer choose its own sparsity under that budget.
Track average activated experts at per batch; define sparsity error et = (T − at) / N. A discrete Proportional-Integral law updates the threshold:

Relies on the monotonicity of nucleus sampling: raising p ⇒ more experts. No gradient needed for the threshold.
A single global threshold assumes uniform logit statistics across depth. Instead, normalize each layer's logits and apply a learnable per-layer scale θl:







DTop-p wins at every model size — and the advantage over Top-k widens with scale. It also benefits more from finer expert granularity and more training data.
