AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce RuPLaR, a novel compression framework that enables Large Language Models to generate latent reasoning tokens in a single training stage, eliminating inefficiencies of traditional multi-step Chain-of-Thought approaches. The method achieves 11.1% accuracy improvement over existing latent CoT systems while using minimal tokens, demonstrating significant progress in efficient LLM reasoning.
AIBullisharXiv – CS AI · May 117/10
🧠LARAG introduces a link-aware retrieval strategy that improves RAG systems by leveraging hyperlink structures already present in technical documentation, rather than treating documents as flat text collections. The approach achieves better answer quality with fewer computational resources, demonstrating that implicit graph-like retrieval through existing metadata can enhance AI system performance.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce MARL-Rad, a multi-agent reinforcement learning framework that optimizes AI agents specifically for radiology report generation rather than using fixed LLMs in pre-designed workflows. The system decomposes chest X-ray interpretation into specialized regional agents coordinated by a global integrator, achieving state-of-the-art clinical performance on benchmark datasets with clinician validation.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce MatryoshkaLoRA, a novel training framework that improves upon Low-Rank Adaptation (LoRA) for efficient large language model fine-tuning by learning hierarchical low-rank representations through a strategically placed diagonal scaling matrix. The method enables dynamic rank selection with minimal accuracy loss and introduces AURAC, a new evaluation metric for hierarchical adapters, addressing a key limitation in current parameter-efficient fine-tuning approaches.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce WiCER, an iterative algorithm that solves the "compilation gap" in LLM Wiki systems—the problem of distilling raw documents into persistent knowledge artifacts without losing critical facts. The method recovers 80% of lost quality and reduces catastrophic failures by 55%, outperforming naive compilation approaches while maintaining sub-second latency advantages over traditional RAG systems.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem from head-wise weight averaging to output-aware layer-wise matrix multiplication. The method achieves 2× accuracy loss reduction under extreme compression while maintaining performance with just 5% of the original KV cache.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers present FinRAG-12B, a 12-billion parameter language model specifically optimized for banking applications that achieves GPT-4.1-level performance on citation grounding while maintaining safer refusal rates and operating at 20-50x lower cost. The model is already deployed across 40+ financial institutions with proven 7.1 percentage point improvements in query resolution.
🧠 GPT-4
AIBullisharXiv – CS AI · May 97/10
🧠Researchers propose Selective Eligibility Traces (S-trace), a new method for reinforcement learning that improves credit assignment in large language models by selectively identifying critical reasoning steps rather than uniformly crediting entire trajectories. The approach demonstrates performance gains of 0.49-3.16% across Qwen models while improving sample and token efficiency compared to existing critic-free algorithms.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers propose a novel reinforcement learning framework that automatically generates process-level supervision from outcome-only feedback, eliminating the need for costly external process supervision. This approach enables fine-grained credit assignment in reasoning tasks by having models identify and learn from their own failed trajectories.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduce LAWS, a self-certifying caching architecture for neural inference that builds a library of expert functions with formal error bounds, enabling efficient deployment across LLMs, robotics, and edge devices. The system generalizes both Mixture-of-Experts and KV prefix caching while providing mathematically verifiable performance guarantees without requiring ground truth validation.
AIBullisharXiv – CS AI · May 47/10
🧠Researchers introduce AdaMeZO, a new zeroth-order optimizer that combines the memory efficiency of MeZO with Adam-style moment estimation for fine-tuning large language models. The method achieves faster convergence than MeZO while reducing GPU memory requirements and requiring up to 70% fewer forward passes.
AIBullisharXiv – CS AI · May 47/10
🧠Researchers present a decision-making framework to optimize when large language models should call external tools like web search. The study reveals that models often misjudge their actual need for tool use, and proposes lightweight estimators trained on hidden states to improve tool-calling decisions, demonstrating performance gains across multiple tasks.
AIBearisharXiv – CS AI · Apr 207/10
🧠Researchers have discovered that FP16 floating-point precision causes systematic numerical divergence between KV-cached and cache-free inference in transformer models, producing 100% token divergence across multiple architectures. This challenges the long-held assumption that KV caching is numerically equivalent to standard computation, with controlled FP32 experiments confirming FP16 non-associativity as the causal mechanism.
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers introduced Ragged Paged Attention (RPA), a specialized inference kernel optimized for Google's TPUs that enables efficient large language model deployment. The innovation addresses the GPU-centric design of existing LLM serving systems by implementing fine-grained tiling and custom software pipelines, achieving up to 86% memory bandwidth utilization on TPU hardware.
🧠 Llama
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers present OSC, a hardware-efficient framework that addresses the challenge of deploying Large Language Models with 4-bit quantization by intelligently separating activation outliers into a high-precision processing path while maintaining low-precision computation for standard values. The technique achieves 1.78x speedup over standard 8-bit approaches while limiting accuracy degradation to under 2.2% on state-of-the-art models.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce AdaMCoT, a framework that improves multilingual reasoning in large language models by dynamically routing intermediate thoughts through optimal 'thinking languages' before generating target-language responses. The approach achieves significant performance gains in low-resource languages without requiring additional pretraining, addressing a key limitation in current multilingual AI systems.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers propose a case-based learning framework enabling LLM-based autonomous agents to extract and reuse knowledge from past tasks, improving performance on complex real-world problems. The method outperforms traditional zero-shot, few-shot, and prompt-based baselines across six task categories, with gains increasing as task complexity rises.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce ExecTune, a training methodology for optimizing black-box LLM systems where a guide model generates strategies executed by a core model. The approach improves accuracy by up to 9.2% while reducing inference costs by 22.4%, enabling smaller models like Claude Haiku to match larger competitors at significantly lower computational expense.
🧠 Claude🧠 Haiku🧠 Sonnet
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce Disco-RAG, a discourse-aware framework that enhances Retrieval-Augmented Generation (RAG) systems by explicitly modeling discourse structures and rhetorical relationships between retrieved passages. The method achieves state-of-the-art results on question answering and summarization tasks without fine-tuning, demonstrating that structural understanding of text significantly improves LLM performance on knowledge-intensive tasks.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers demonstrate that inserting sentence boundary delimiters in LLM inputs significantly enhances model performance across reasoning tasks, with improvements up to 12.5% on specific benchmarks. This technique leverages the natural sentence-level structure of human language to enable better processing during inference, tested across model scales from 7B to 600B parameters.
AIBullisharXiv – CS AI · Apr 107/10
🧠AgentOpt v0.1, a new Python framework, addresses client-side optimization for AI agents by intelligently allocating models, tools, and API budgets across pipeline stages. Using search algorithms like Arm Elimination and Bayesian Optimization, the tool reduces evaluation costs by 24-67% while achieving near-optimal accuracy, with cost differences between model combinations reaching up to 32x at matched performance levels.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers introduce MoBiE, a novel binarization framework designed specifically for Mixture-of-Experts large language models that achieves significant efficiency gains through weight compression while maintaining model performance. The method addresses unique challenges in quantizing MoE architectures and demonstrates over 2× inference speedup with substantial perplexity reductions on benchmark models.
🏢 Perplexity
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce StatePlane, a model-agnostic cognitive state management system that enables AI systems to maintain coherent reasoning over long interaction horizons without expanding context windows or retraining models. The system uses episodic, semantic, and procedural memory mechanisms inspired by cognitive psychology to overcome current limitations in large language models.
AINeutralarXiv – CS AI · Mar 117/10
🧠Research analyzes FP4 quantization sensitivity across different layers in large language models using NVFP4 and MXFP4 formats on Qwen2.5 models. The study finds MLP projection layers are most sensitive to quantization, while attention layers show substantial robustness to FP4 precision reduction.
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers propose Traversal-as-Policy, a method that distills AI agent execution logs into Gated Behavior Trees (GBTs) to create safer, more efficient autonomous agents. The approach significantly improves success rates while reducing safety violations and computational costs across multiple benchmarks.