AIBullisharXiv – CS AI · 15h ago7/10
🧠
Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection
Researchers introduce Qrita, an efficient algorithm for Top-k and Top-p sampling in large language models that uses pivot-based truncation instead of sorting. The method achieves 1.4x throughput improvements with 50% less memory usage while maintaining identical output to traditional sorting approaches, and has been adopted as the default sampler in vLLM.