AINeutralarXiv – CS AI · 6h ago6/10
🧠
Budgeted Attention Allocation: Cost-Conditioned Compute Control for Efficient Transformers
Researchers present Budgeted Attention Allocation, a mechanism that allows a single transformer model to operate at multiple efficiency-accuracy tradeoffs by dynamically gating attention heads based on computational budgets. The approach achieves measurable speedups (1.2-1.28x) on CPU benchmarks while maintaining competitive accuracy across multiple datasets, enabling flexible deployment scenarios without retraining.