y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#inference-speed News & Analysis

9 articles tagged with #inference-speed. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles
AIBullisharXiv โ€“ CS AI ยท 2d ago7/10
๐Ÿง 

MEMENTO: Teaching LLMs to Manage Their Own Context

Researchers introduce MEMENTO, a method enabling large language models to compress their reasoning into dense summaries (mementos) organized into blocks, reducing KV cache usage by 2.5x and improving throughput by 1.75x while maintaining accuracy. The technique is validated across multiple model families using OpenMementos, a new dataset of 228K annotated reasoning traces.

AIBullisharXiv โ€“ CS AI ยท Mar 277/10
๐Ÿง 

SWAA: Sliding Window Attention Adaptation for Efficient and Quality Preserving Long Context Processing

Researchers propose SWAA (Sliding Window Attention Adaptation), a toolkit that enables efficient long-context processing in large language models by adapting full attention models to sliding window attention without expensive retraining. The solution achieves 30-100% speedups for long context inference while maintaining acceptable performance quality through four core strategies that address training-inference mismatches.

AIBullisharXiv โ€“ CS AI ยท Mar 46/102
๐Ÿง 

GPUTOK: GPU Accelerated Byte Level BPE Tokenization

Researchers developed GPUTOK, a GPU-accelerated tokenizer for large language models that processes text significantly faster than existing CPU-based solutions. The optimized version shows 1.7x speed improvement over tiktoken and 7.6x over HuggingFace's GPT-2 tokenizer while maintaining output quality.

AIBullisharXiv โ€“ CS AI ยท Mar 37/102
๐Ÿง 

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM-SALA introduces a 9B-parameter hybrid language model architecture that combines sparse and linear attention mechanisms to handle ultra-long contexts up to 1M tokens. The model achieves 3.5x faster inference than full-attention models while reducing training costs by 75% through a continual training framework that transforms existing Transformer models.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Researchers introduce VisionZip, a new method that reduces redundant visual tokens in vision-language models while maintaining performance. The technique improves inference speed by 8x and achieves 5% better performance than existing methods by selecting only informative tokens for processing.

AIBullisharXiv โ€“ CS AI ยท Mar 36/106
๐Ÿง 

One-Token Verification for Reasoning Correctness Estimation

Researchers introduce One-Token Verification (OTV), a new method that estimates reasoning correctness in large language models during a single forward pass, reducing computational overhead. OTV reduces token usage by up to 90% through early termination while improving accuracy on mathematical reasoning tasks compared to existing verification methods.

AIBullisharXiv โ€“ CS AI ยท Mar 26/1012
๐Ÿง 

Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents

Researchers developed a new discriminative AI model based on Qwen3-0.6B that can efficiently segment ultra-long documents up to 13k tokens for better information retrieval. The model achieves superior performance compared to generative alternatives while delivering two orders of magnitude faster inference on the Wikipedia WIKI-727K dataset.

AIBullishHugging Face Blog ยท Sep 106/105
๐Ÿง 

Block Sparse Matrices for Smaller and Faster Language Models

The article discusses block sparse matrices as a technique to create smaller and faster language models. This approach could significantly reduce computational requirements and memory usage in AI systems while maintaining performance.

AIBullishHugging Face Blog ยท Dec 204/104
๐Ÿง 

Speculative Decoding for 2x Faster Whisper Inference

The article title suggests a technical advancement in Whisper inference using speculative decoding to achieve 2x faster processing speeds. However, no article body content was provided to analyze the specific implementation or implications.