Sparse Mixture-of-Experts (MoE) scales model capacity by activating only a few experts per token, decoupling parameters from compute cost. Two routing schemes dominate pre-training:

Top-p is inherently adaptive — confident tokens use fewer experts, uncertain tokens recruit more — but the threshold receives no gradient (it only binarizes the mask), so it cannot be learned by SGD.

A Proportional-Integral controller adjusts the global threshold p between batches from the sparsity error et=(T−at)/N:

The proportional term reacts to deviation; the integral term removes steady-state bias, driving activated experts at → target T.

A learnable per-layer scale θl rescales normalized logits, letting each layer sharpen/flatten its routing — enabling distinct sparsity per depth under one global threshold.


MoE-1.3B-6.9B-64E8A (1.3B active / 6.9B total). DTop-p reaches the best loss and locks sparsity to T=8, unlike fixed Top-p which overshoots.



DTop-p converges to T=8 with low variance (σ≈1); fixed Top-p drifts with σ≈4. It activates fewer experts in shallow layers, more in deep layers.

Both components are needed: PI enforces the budget; DRN adaptively rescales layer logits. Together they give the best loss and stable threshold.
