AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce TARAC, a training-free framework that mitigates hallucinations in Large Vision-Language Models by dynamically preserving visual attention across generation steps. The method achieves significant improvements—reducing hallucinated content by 25.2% and boosting perception scores by 10.65—while adding only ~4% computational overhead, making it practical for real-world deployment.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers developed a weak supervision framework to detect hallucinations in large language models by distilling grounding signals into transformer representations during training. Using substring matching, sentence embeddings, and LLM judges, they created a 15,000-sample dataset and trained five probing classifiers that achieve hallucination detection from internal activations alone at inference time, eliminating the need for external verification systems.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers introduce MoBiE, a novel binarization framework designed specifically for Mixture-of-Experts large language models that achieves significant efficiency gains through weight compression while maintaining model performance. The method addresses unique challenges in quantizing MoE architectures and demonstrates over 2× inference speedup with substantial perplexity reductions on benchmark models.
🏢 Perplexity
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers developed HeteroServe, a system that optimizes multimodal large language model inference by partitioning vision encoding and language generation across different GPU tiers. The approach reduces data transfer requirements and achieves 31-40% cost savings while improving throughput by up to 54% compared to existing systems.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce FreeKV, a training-free optimization framework that dramatically improves KV cache retrieval efficiency for large language models with long context windows. The system achieves up to 13x speedup compared to existing methods while maintaining near-lossless accuracy through speculative retrieval and hybrid memory layouts.
$NEAR
AINeutralarXiv – CS AI · Mar 37/104
🧠Researchers analyzed 20 Mixture-of-Experts (MoE) language models to study local routing consistency, finding a trade-off between routing consistency and local load balance. The study introduces new metrics to measure how well expert offloading strategies can optimize memory usage on resource-constrained devices while maintaining inference speed.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers developed a new scaling law for large language models that optimizes both accuracy and inference efficiency by examining architectural factors like hidden size, MLP-to-attention ratios, and grouped-query attention. Testing over 200 models from 80M to 3B parameters, they found optimized architectures achieve 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.
AIBullisharXiv – CS AI · Feb 277/102
🧠Researchers introduce S2O, a new sparse attention method that uses online permutation and early stopping to dramatically improve AI model efficiency. The technique achieves 3.81x end-to-end speedup on Llama-3.1-8B with 128K context while maintaining accuracy.
AIBullisharXiv – CS AI · 3d ago6/10
🧠Researchers introduce OC-VTP, a lightweight vision token pruning method for Vision Language Models that reduces computational overhead by selectively retaining the most representative visual tokens without requiring model fine-tuning. The approach maintains inference accuracy across all pruning ratios while providing computational efficiency gains and interpretability benefits.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce ECHO, a novel test-time reinforcement learning algorithm that addresses rollout collapse and noisy pseudo-labels through entropy-confidence hybrid optimization. The method improves sampling efficiency and training robustness across mathematical and visual reasoning benchmarks while performing better under limited computational budgets.
AIBullisharXiv – CS AI · 4d ago6/10
🧠Researchers present a new quantization method for large video diffusion models that achieves 59.3% memory reduction while maintaining near-baseline quality. The technique addresses challenges in compressing Wan2.2-I2V's mixture-of-experts architecture by using timestep-aware and expert-specific calibration strategies.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers propose a critique-and-routing controller for multi-agent LLM systems that iteratively refines outputs through sequential decision-making rather than one-shot routing. The method uses reinforcement learning with agent-utilization constraints to achieve performance approaching the strongest agent while reducing computational calls by over 75%, advancing coordination efficiency in heterogeneous AI systems.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers have released LLMSYS-HPOBench, the first comprehensive benchmark suite for hyperparameter optimization in real-world LLM systems, containing 364,450 configurations across 932 settings with multiple fidelity factors and cost metrics. The dataset addresses gaps in existing AutoML benchmarks by capturing the unprecedented complexity of optimizing both AI and non-AI components in production language model systems.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers present a communication-theoretic framework that unifies LLM reliability techniques (retry, majority voting, self-consistency) under classical information theory, introducing a cost-aware router that achieves 56% lower costs than fixed approaches while maintaining quality. The work demonstrates that no single reliability technique dominates across all tasks, supporting dynamic per-task allocation strategies.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers introduce COAST, a novel pruning framework for vision-language models that reduces visual tokens by 77.8% while maintaining 98.64% performance and achieving 2.15x speedup. Unlike existing methods that discard low-attention tokens, COAST uses adaptive semantic routing to preserve contextually essential information, preventing 'Visual Aphasia'—a failure mode where models lose visual grounding.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers propose VecCISC, an optimization framework for weighted majority voting in large language models that reduces computational costs by 47% while maintaining accuracy. The method filters redundant or hallucinated reasoning traces using semantic similarity before evaluation, addressing the expensive overhead of confidence-scoring multiple candidate answers.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers introduce Temporal Token Fusion (TTF), a training-free compression technique that reduces visual tokens in video-language models by 67% while maintaining 99.5% accuracy. The method addresses the critical bottleneck of LLM prefill costs in video understanding by identifying and fusing redundant tokens across video frames using local similarity matching.
AIBullisharXiv – CS AI · May 116/10
🧠Fluxion, a new hybrid CPU-GPU system, optimizes long-context inference by efficiently managing key-value caches split between host and GPU memory. The approach delivers 1.5x-3.7x speedup over existing baselines while maintaining near-baseline accuracy, addressing a critical bottleneck in modern large language model deployment.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers introduce CA-SQL, an advanced Text-to-SQL pipeline that dynamically allocates computational resources based on task complexity to improve LLM reasoning. The method achieves state-of-the-art performance on the BIRD benchmark's challenging tier using only GPT-4o-mini, outperforming larger models and demonstrating the efficiency gains possible through intelligent inference-time optimization.
🧠 GPT-4
AIBullisharXiv – CS AI · May 116/10
🧠Researchers introduce MemSearcher, an AI agent framework that optimizes how large language models handle multi-turn interactions by maintaining compact memory instead of concatenating full conversation history. The approach uses a novel multi-context GRPO training method and demonstrates superior performance while maintaining stable token counts, reducing computational overhead.
AIBullisharXiv – CS AI · May 16/10
🧠BoostLoRA introduces a gradient-boosting framework that enables parameter-efficient fine-tuning adapters to grow their effective rank iteratively, allowing ultra-low-parameter models to match or exceed full fine-tuning performance across mathematical reasoning, code generation, and protein classification tasks. The method merges adapters with zero inference overhead while maintaining minimal per-round parameter costs.
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers demonstrate that reward-weighted classifier-free guidance (RCFG) can dynamically adjust autoregressive model outputs to optimize arbitrary reward functions at test time without retraining. Applied to molecular generation, this approach enables real-time optimization of competing objectives and accelerates reinforcement learning convergence when used as a teacher for policy distillation.
AIBullisharXiv – CS AI · Apr 156/10
🧠Researchers introduce HintMR, a hint-assisted reasoning framework that improves mathematical problem-solving in small language models by using a separate hint-generating model to provide contextual guidance through multi-step problems. This collaborative two-model system demonstrates significant accuracy improvements over standard prompting while maintaining computational efficiency.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers analyzed how LLM verifiers assess solution correctness in test-time scaling scenarios, revealing that verification effectiveness varies significantly with problem difficulty, generator strength, and verifier capability. The study demonstrates that weak generators can nearly match stronger ones post-verification and that verifier scaling alone cannot solve fundamental verification challenges.
🧠 GPT-4
AIBullisharXiv – CS AI · Apr 146/10
🧠Researchers propose Tool-Internalized Reasoning (TInR), a framework that embeds tool knowledge directly into Large Language Models rather than relying on external tool documentation during reasoning. The TInR-U model uses a three-phase training pipeline combining knowledge alignment, supervised fine-tuning, and reinforcement learning to improve reasoning efficiency and performance across various tasks.