y0news
#ai-inference2 articles
2 articles
AINeutralarXiv โ€“ CS AI ยท 4h ago3
๐Ÿง 

SLA-Aware Distributed LLM Inference Across Device-RAN-Cloud

Researchers tested distributed AI inference across device, edge, and cloud tiers in a 5G network, finding that sub-second AI response times required for embodied AI are challenging to achieve. On-device execution took multiple seconds, while RAN-edge deployment with quantized models could meet 0.5-second deadlines, and cloud deployment achieved 100% success for 1-second deadlines.

$NEAR
AIBullisharXiv โ€“ CS AI ยท 4h ago4
๐Ÿง 

Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

Researchers propose Semantic Parallelism, a new framework called Sem-MoE that significantly improves efficiency of large language model inference by optimizing how AI models distribute computational tasks across multiple devices. The system reduces communication overhead between devices by 'collocating' frequently-used model components with their corresponding data, achieving superior throughput compared to existing solutions.