Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection
Researchers introduce Qrita, an efficient algorithm for Top-k and Top-p sampling in large language models that uses pivot-based truncation instead of sorting. The method achieves 1.4x throughput improvements with 50% less memory usage while maintaining identical output to traditional sorting approaches, and has been adopted as the default sampler in vLLM.
Qrita addresses a fundamental bottleneck in LLM inference: the computational overhead of sampling from probability distributions across massive vocabularies. Top-k and Top-p sampling are essential techniques for controlling LLM output diversity and quality, but their efficient implementation has proven challenging on GPUs. Traditional sorting-based approaches consume substantial memory and computation, while stochastic alternatives compromise determinism—a critical requirement for reproducible AI systems.
The research builds on decades of algorithm optimization work, applying pivot-based selection methods originally developed for general-purpose sorting problems to the specific constraints of transformer inference. By introducing Gaussian-based sigma-truncation, Qrita dramatically narrows the search space before applying quaternary pivot search, effectively reducing algorithmic complexity. The duplication-handling mechanism ensures output determinism while cutting pivot iterations in half.
For the AI infrastructure industry, this optimization carries significant implications. vLLM's integration of Qrita as the default sampler affects every deployment using this popular execution engine, impacting inference latency and cost efficiency across thousands of applications. The 1.4x throughput improvement directly translates to reduced operational expenses for cloud providers and faster user-facing response times. Memory savings of 50% enable larger batch sizes or deployment on more modest hardware, democratizing access to LLM inference.
The broader significance lies in optimization trends within AI infrastructure. As transformer models plateau in size, performance gains increasingly come from algorithmic improvements rather than architectural breakthroughs. Qrita exemplifies this shift—a focused engineering contribution that delivers measurable production benefits. The open-source availability through vLLM ensures rapid adoption across the ecosystem, establishing new efficiency baselines for the industry.
- →Qrita improves LLM sampling throughput by up to 1.4x while reducing memory consumption by 50% compared to existing approaches
- →The algorithm uses pivot-based truncation with Gaussian-based sigma-truncation instead of sorting, eliminating stochastic approximation trade-offs
- →Qrita has been adopted as the default Top-k and Top-p sampler in vLLM's GPU execution path, affecting production inference across the ecosystem
- →The approach guarantees deterministic output identical to sorting-based algorithms while improving computational efficiency
- →Open-source implementation available on GitHub enables rapid industry adoption and benchmarking