AINeutralarXiv โ CS AI ยท 14h ago6/10
๐ง
A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs
A-IO addresses critical memory-bound bottlenecks in LLM deployment on NPU platforms like Ascend 910B by tackling the 'Model Scaling Paradox' and limitations of current speculative decoding techniques. The research reveals that static single-model deployment strategies and kernel synchronization overhead significantly constrain inference performance on heterogeneous accelerators.