AINeutralarXiv – CS AI · 3d ago7/10
🧠Researchers identify a critical failure mode in large reasoning models where they detect insufficient information but still produce unsupported answers instead of abstaining. The proposed Judge-Then-Solve (JTS) framework trains models to make explicit answerability commitments before reasoning, significantly improving safe abstention rates and inference efficiency.
AIBullisharXiv – CS AI · 3d ago7/10
🧠PrunePath is a new structured sparsification framework that optimizes feed-forward networks in language models by replacing traditional pruning methods with a softmax-normalized routing system. The approach converts model sparsity into practical hardware efficiency gains, demonstrated through memory savings and faster decoding speeds via custom Triton kernels.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers conducted an extensive empirical study evaluating FP8, INT8, and INT4 quantization formats across the Llama-3.1 model family, finding that FP8 is effectively lossless while INT4 weight-only quantization performs surprisingly well. The findings provide practical deployment guidelines for optimizing the accuracy-performance trade-off in large language model inference at scale.
🧠 Llama
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers develop a systematic approach to quantization-aware training for large language models using 8-bit floating-point formats, identifying and solving two critical failure modes—amax saturation and catastrophic forgetting—that don't surface in standard training metrics. Their solution achieves near-lossless performance with only 0.43% degradation on benchmark tasks, advancing practical LLM deployment efficiency.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers propose STARS, a training framework that stabilizes Looped Language Models (LoopLMs) to enable reliable test-time scaling through latent reasoning. The method uses Jacobian Spectral Radius Regularization to constrain neural states toward stable fixed points, addressing a critical problem where model performance peaks then collapses with increased recurrence depth.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce RuPLaR, a novel compression framework that enables Large Language Models to generate latent reasoning tokens in a single training stage, eliminating inefficiencies of traditional multi-step Chain-of-Thought approaches. The method achieves 11.1% accuracy improvement over existing latent CoT systems while using minimal tokens, demonstrating significant progress in efficient LLM reasoning.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers propose LEAD, a new method that makes large reasoning AI models more efficient by dynamically balancing accuracy and output length during training. Unlike existing approaches using static constraints, LEAD adapts per-problem length targets and reward calibration in real-time, achieving better accuracy and shorter outputs across mathematical reasoning benchmarks.
🏢 OpenAI🧠 o1
AIBullisharXiv – CS AI · May 127/10
🧠Echo-LoRA introduces a parameter-efficient fine-tuning method that injects cross-layer representations from deeper neural network layers into shallow LoRA modules during training, achieving 3-5.7% performance improvements on reasoning tasks without adding inference costs. The technique discards its auxiliary training path post-deployment, maintaining the efficiency benefits of standard LoRA while delivering measurable capability gains.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers propose RDKV, a novel compression technique that jointly optimizes eviction and quantization of the Key-Value cache in large language models to reduce memory bottlenecks during inference. The method achieves 4.5x decode speedup and 1.9x peak memory reduction on 128K context lengths while maintaining 97.81% accuracy, addressing a critical performance constraint in LLM deployment.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers demonstrate that Mixture of Experts (MoE) models contain substantial underutilized sparsity within individual experts that can be exploited without modifying model parameters. By implementing intra-expert activation sparsity in vLLM, they achieve up to 2.5x speedup in MoE layer execution, offering a practical optimization path for efficient large language model deployment.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers propose Kaczmarz Linear Attention (KLA), an improved algorithm for long-context language modeling that replaces empirically-learned coefficients with mathematically-derived key-norm-normalized step sizes. KLA outperforms existing linear attention baselines like Gated DeltaNet while maintaining computational efficiency and enabling stable processing of up to 65K token contexts.
🏢 Perplexity
AIBullisharXiv – CS AI · May 117/10
🧠Switchcraft is a new AI model router specifically designed for agentic tool calling that selects the lowest-cost model while maintaining correctness. The system achieves 82.9% accuracy matching top models while reducing inference costs by 84%, demonstrating that larger models don't consistently outperform smaller ones on function-calling tasks.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce CASPO, a framework that improves reasoning reliability in large language models by aligning token-level confidence with step-wise logical correctness through preference optimization. The method achieves better performance than tree-search approaches without requiring separate reward models, while introducing CaT inference that dynamically prunes uncertain reasoning branches with minimal computational overhead.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem from head-wise weight averaging to output-aware layer-wise matrix multiplication. The method achieves 2× accuracy loss reduction under extreme compression while maintaining performance with just 5% of the original KV cache.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers present Trajectory-Shaped Discrete Flow Matching (TS-DFM), a technique that improves text generation efficiency by using an energy-based guidance system during training to select better token transformation paths. The method enables a compact student model to achieve 32% lower perplexity than a 1,024-step teacher while running 128x faster at just 8 steps, setting new benchmarks for discrete generation tasks.
🏢 Perplexity
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce Adaptive Reparameterized Time (ART), a reinforcement learning approach that optimizes timestep scheduling for diffusion models to improve sample generation efficiency. The method reduces computational costs while maintaining image quality, with demonstrated improvements on benchmark datasets and cross-dataset transferability.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce Recursive Agent Optimization (RAO), a reinforcement learning method enabling AI agents to spawn and delegate tasks to themselves recursively. This approach allows agents to handle longer contexts, solve harder problems through divide-and-conquer strategies, and achieve better training efficiency with reduced computational time.
AIBullisharXiv – CS AI · May 77/10
🧠TSCG is a deterministic compiler that converts JSON tool schemas into structured text optimized for language model interpretation, solving a critical failure point in agentic AI systems. The technology restores accuracy in smaller models (4B-14B) from near-zero to 84%+ on production-scale tool catalogs while reducing token consumption by 52-57%, shipping as a lightweight TypeScript package.
🏢 OpenAI🏢 Anthropic🧠 GPT-5
AIBullisharXiv – CS AI · May 77/10
🧠RetentiveKV introduces an entropy-driven optimization method for multimodal large language models that achieves 5x KV cache compression and 1.5x decoding acceleration by reformulating token eviction as continuous memory evolution rather than discrete pruning. The approach addresses limitations of existing compression methods by accounting for visual tokens that gain importance later in decoding and preserving spatial continuity of visual information.
AIBullisharXiv – CS AI · May 47/10
🧠Researchers present a decision-making framework to optimize when large language models should call external tools like web search. The study reveals that models often misjudge their actual need for tool use, and proposes lightweight estimators trained on hidden states to improve tool-calling decisions, demonstrating performance gains across multiple tasks.
AIBullisharXiv – CS AI · May 17/10
🧠Researchers propose a Compile-and-Execute architecture that reduces LLM-driven web automation costs from $150 to under $0.10 per workflow by decoupling reasoning from execution. Instead of continuous inference loops, a single LLM call generates a deterministic JSON blueprint that a lightweight runtime executes without additional model queries, achieving 80-94% zero-shot success rates.
AIBullisharXiv – CS AI · May 17/10
🧠Researchers present a unified system for optimizing KV cache memory management in large-scale GPU inference, addressing three critical inefficiencies through architecture-aware sizing, multi-tier memory hierarchy spanning CPU to NVMe storage, and predictive eviction policies. The approach achieves 70-84% cache hit rates and projects 1.4-2.1x improvements in latency and 1.7-2.9x throughput gains while reducing costs by 47% compared to existing solutions.
AIBullisharXiv – CS AI · Apr 157/10
🧠CascadeDebate introduces a novel multi-agent deliberation system for large language model cascades that dynamically allocates computational resources based on query difficulty. By inserting lightweight agent ensembles at escalation boundaries to resolve ambiguous cases internally, the system achieves up to 26.75% performance improvement while reducing unnecessary escalations to expensive models.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce TARAC, a training-free framework that mitigates hallucinations in Large Vision-Language Models by dynamically preserving visual attention across generation steps. The method achieves significant improvements—reducing hallucinated content by 25.2% and boosting perception scores by 10.65—while adding only ~4% computational overhead, making it practical for real-world deployment.
AIBullisharXiv – CS AI · Apr 147/10
🧠A new study demonstrates that quantization significantly outperforms rank reduction for compressing KV caches in transformer inference, achieving 4-364 PPL improvements across multiple models. The research shows that preserving all dimensions while reducing precision is structurally superior to discarding dimensions, with INT4 quantization matching FP16 accuracy while enabling 75% total KV reduction.