AIBullisharXiv – CS AI · Apr 137/10
🧠Researchers demonstrate that tree-structured sparse feed-forward layers can replace dense MLPs in large transformer models while maintaining performance, activating less than 5% of parameters per token. The work reveals an emergent auto-pruning mechanism where hard routing progressively converts dynamic sparsity into static structure, offering a scalable approach to reducing computational costs in language models beyond 1 billion parameters.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers propose an expert-wise mixed-precision quantization strategy for Mixture-of-Experts models that assigns bit-widths based on router gradient changes and neuron variance. The method achieves higher accuracy than existing approaches while reducing inference memory overhead on large-scale models like Switch Transformer and Mixtral with minimal computational overhead.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers demonstrate that large speech language models contain significant redundancy in their token representations, particularly in deeper layers. By introducing Affinity Pooling, a training-free token merging technique, they achieve 27.48% reduction in prefilling FLOPs and up to 1.7× memory savings while maintaining semantic accuracy, challenging the necessity of fully distinct tokens for acoustic processing.
AIBullisharXiv – CS AI · Apr 107/10
🧠Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.
AIBullisharXiv – CS AI · Apr 67/10
🧠Researchers introduce OSCAR, a training-free framework that reduces AI hallucinations in diffusion language models by using cross-chain entropy to detect uncertain token positions during generation. The system runs parallel denoising chains and performs targeted remasking with retrieved evidence to improve factual accuracy without requiring external hallucination classifiers.
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers demonstrate that large language models can perform reinforcement learning during inference through a new 'in-context RL' prompting framework. The method shows LLMs can optimize scalar reward signals to improve response quality across multiple rounds, achieving significant improvements on complex tasks like mathematical competitions and creative writing.
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers propose MTP-D, a self-distillation method that improves Multi-Token Prediction for Large Language Models, achieving 7.5% better acceptance rates and up to 220% inference speedup. The technique addresses key challenges in training multiple prediction heads while preserving main model performance.
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers introduce Bottlenecked Transformers, a new architecture that improves AI reasoning by up to 6.6 percentage points through periodic memory consolidation inspired by brain processes. The system uses a Cache Processor to rewrite key-value cache entries at reasoning step boundaries, achieving better performance on math reasoning benchmarks compared to standard Transformers.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce Orla, a new library that simplifies the development and deployment of LLM-based multi-agent systems by providing a serving layer that separates workflow execution from policy decisions. The library offers stage mapping, workflow orchestration, and memory management capabilities that improve performance and reduce costs compared to single-model baselines.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce RelayCaching, a training-free method that accelerates multi-agent LLM systems by reusing KV cache data from previous agents to eliminate redundant computation. The technique achieves over 80% cache reuse and reduces time-to-first-token by up to 4.7x while maintaining accuracy across mathematical reasoning, knowledge tasks, and code generation.
AIBullisharXiv – CS AI · Mar 177/10
🧠ICaRus introduces a novel architecture enabling multiple AI models to share identical Key-Value (KV) caches, addressing memory explosion issues in multi-model inference systems. The solution achieves up to 11.1x lower latency and 3.8x higher throughput by allowing cross-model cache reuse while maintaining comparable accuracy to task-specific fine-tuned models.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce FlashHead, a training-free replacement for classification heads in language models that delivers up to 1.75x inference speedup while maintaining accuracy. The innovation addresses a critical bottleneck where classification heads consume up to 60% of model parameters and 50% of inference compute in modern language models.
🧠 Llama
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers have developed Pyramid MoA, a new framework that optimizes large language model inference costs by using a hierarchical router system that escalates queries to more expensive models only when necessary. The system achieves up to 62.7% cost savings while maintaining Oracle-level accuracy on various benchmarks including coding and mathematical reasoning tasks.
🧠 Llama
AIBullisharXiv – CS AI · Mar 127/10
🧠Researchers introduce MoE-SpAc, a new framework for efficient Mixture-of-Experts model inference on edge devices that achieves 42% improvement over existing baselines. The system uses speculative decoding as a memory management tool and demonstrates 4.04x average speedup across benchmarks.
AIBullisharXiv – CS AI · Mar 127/10
🧠Researchers developed Adaptive Activation Cancellation (AAC), a real-time framework that reduces hallucinations in large language models by identifying and suppressing problematic neural activations during inference. The method requires no fine-tuning or external knowledge and preserves model capabilities while improving factual accuracy across multiple model scales including LLaMA 3-8B.
🏢 Perplexity
AINeutralarXiv – CS AI · Mar 117/10
🧠Research analyzes FP4 quantization sensitivity across different layers in large language models using NVFP4 and MXFP4 formats on Qwen2.5 models. The study finds MLP projection layers are most sensitive to quantization, while attention layers show substantial robustness to FP4 precision reduction.
AINeutralarXiv – CS AI · Mar 117/10
🧠Researchers have developed ALADIN, a framework for analyzing accuracy-latency trade-offs in AI accelerators for embedded systems. The tool enables evaluation of quantized neural networks without requiring deployment on target hardware, potentially reducing development time and costs for AI chip designers.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers introduce Efficient Draft Adaptation (EDA), a framework that significantly reduces the cost of adapting draft models for speculative decoding when target LLMs are fine-tuned. EDA achieves superior performance through decoupled architecture, data regeneration, and smart sample selection while requiring substantially less training resources than full retraining.
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers have developed a new technique called activation steering to reduce reasoning biases in large language models, particularly the tendency to confuse content plausibility with logical validity. Their novel K-CAST method achieved up to 15% improvement in formal reasoning accuracy while maintaining robustness across different tasks and languages.
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers introduce COLD-Steer, a training-free framework that enables efficient control of large language model behavior at inference time using just a few examples. The method approximates gradient descent effects without parameter updates, achieving 95% steering effectiveness while using 50 times fewer samples than existing approaches.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers developed EvoPrune, a new method that prunes visual tokens during the encoding stage of Multimodal Large Language Models (MLLMs) rather than after encoding. The technique achieves 2x inference speedup with less than 1% performance loss on video datasets, addressing efficiency bottlenecks in AI models processing high-resolution images and videos.
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers introduce Multi-Sequence Verifier (MSV), a new technique that improves large language model performance by jointly processing multiple candidate solutions rather than scoring them individually. The system achieves better accuracy while reducing inference latency by approximately half through improved calibration and early-stopping strategies.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers introduce OSCAR, a new query-dependent online soft compression method for Retrieval-Augmented Generation (RAG) systems that reduces computational overhead while maintaining performance. The method achieves 2-5x speed improvements in inference with minimal accuracy loss across LLMs from 1B to 24B parameters.
🏢 Hugging Face
AIBullisharXiv – CS AI · Mar 47/103
🧠Nightjar is a new adaptive speculative decoding framework for large language models that dynamically adjusts to system load conditions. It achieves 27.29% higher throughput and up to 20.18% lower latency by intelligently enabling or disabling speculation based on workload demands.
AIBullisharXiv – CS AI · Mar 47/104
🧠Researchers propose 'best-of-∞' approach for large language models that uses majority voting with infinite samples, achieving superior performance but requiring infinite computation. They develop an adaptive generation scheme that dynamically selects the optimal number of samples based on answer agreement and extend the framework to weighted ensembles of multiple LLMs.