Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference
Researchers propose a sleep-like mechanism for transformer language models that periodically consolidates context into persistent fast weights, reducing the computational burden of long sequences. The method shifts heavy computation offline while maintaining fast inference speeds, showing significant improvements on reasoning tasks that standard transformers struggle with.