The Diminishing Returns of Early-Exit Decoding in Modern LLMs
Research shows that newer LLMs have diminishing effectiveness for early-exit decoding techniques due to improved architectures that reduce layer redundancy. The study finds that dense transformers outperform Mixture-of-Experts models for early-exit, with larger models (20B+ parameters) and base pretrained models showing the highest early-exit potential.