169 articles tagged with #reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers introduce Slow-Fast Policy Optimization (SFPO), a new reinforcement learning framework that improves training stability and efficiency for large language model reasoning. SFPO outperforms existing methods like GRPO by up to 2.80 points on math benchmarks while requiring up to 4.93x fewer rollouts and 4.19x less training time.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง GlobalRAG is a new reinforcement learning framework that significantly improves multi-hop question answering by decomposing questions into subgoals and coordinating retrieval with reasoning. The system achieves 14.2% average improvements in performance metrics while using only 42% of the training data required by baseline models.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers developed plan conditioning, a training-free method that significantly improves diffusion language model reasoning by prepending short natural-language plans from autoregressive models. The technique improved performance by 11.6 percentage points on math problems and 12.8 points on coding tasks, bringing diffusion models to competitive levels with autoregressive models.
๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose EMBRAG, a new framework that combines large language models with knowledge graphs to improve reasoning accuracy and reduce hallucinations. The system generates multiple logical rules from queries and applies them in embedding space, achieving state-of-the-art performance on knowledge graph question-answering benchmarks.
AIBullisharXiv โ CS AI ยท Mar 166/10
๐ง Researchers introduce a new knowledge distillation framework that improves training of smaller AI models by using intermediate representations from large language models rather than their final outputs. The method shows consistent improvements across reasoning benchmarks, particularly when training data is limited, by providing cleaner supervision signals.
AIBullisharXiv โ CS AI ยท Mar 126/10
๐ง Researchers introduce CLIPO (Contrastive Learning in Policy Optimization), a new method that improves upon Reinforcement Learning with Verifiable Rewards (RLVR) for training Large Language Models. CLIPO addresses hallucination and answer-copying issues by incorporating contrastive learning to better capture correct reasoning patterns across multiple solution paths.
AIBullisharXiv โ CS AI ยท Mar 126/10
๐ง Researchers propose Dynamics-Predictive Sampling (DPS), a new method that improves reinforcement learning finetuning of large language models by predicting which training prompts will be most informative without expensive computational rollouts. The technique models each prompt's learning progress as a dynamical system and uses Bayesian inference to select better training data, reducing computational overhead while achieving superior reasoning performance.
AIBullisharXiv โ CS AI ยท Mar 116/10
๐ง Researchers introduce Latent-DARM, a framework that bridges discrete diffusion language models and autoregressive models to improve multi-agent AI reasoning capabilities. The system achieved significant improvements on reasoning benchmarks, increasing accuracy from 27% to 36% on DART-5 while using less than 2.2% of the token budget of state-of-the-art models.
AIBearisharXiv โ CS AI ยท Mar 116/10
๐ง Researchers have identified a critical flaw in Large Language Models (LLMs) where they prioritize moral reasoning over commonsense understanding, struggling to detect logical contradictions within moral dilemmas. The study introduces the CoMoral benchmark and reveals a 'narrative focus bias' where LLMs better identify contradictions attributed to secondary characters rather than primary narrators.
AIBullisharXiv โ CS AI ยท Mar 116/10
๐ง Researchers introduce RECODE, a new framework that improves visual reasoning in AI models by converting images into executable code for verification. The system generates multiple candidate programs to reproduce visuals, then selects and refines the most accurate reconstruction, significantly outperforming existing methods on visual reasoning benchmarks.
AIBullisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers introduce Answer-Then-Check, a novel safety alignment approach for large language models that enables them to evaluate response safety before outputting to users. The method uses a new 80K-sample dataset called Reasoned Safety Alignment (ReSA) and demonstrates improved jailbreak defense while maintaining general reasoning capabilities.
๐ข Hugging Face
AINeutralarXiv โ CS AI ยท Mar 96/10
๐ง This position paper argues against anthropomorphizing intermediate tokens generated by language models as 'reasoning traces' or 'thoughts'. The authors contend that treating these computational outputs as human-like thinking processes is misleading and potentially harmful to AI research and understanding.
AINeutralarXiv โ CS AI ยท Mar 66/10
๐ง Researchers introduce X-RAY, a new system for analyzing large language model reasoning capabilities through formally verified probes that isolate structural components of reasoning. The study reveals LLMs handle constraint refinement well but struggle with solution-space restructuring, providing contamination-free evaluation methods.
AINeutralarXiv โ CS AI ยท Mar 45/104
๐ง Researchers introduce HSSBench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on Humanities and Social Sciences tasks across multiple languages. The benchmark contains over 13,000 samples and reveals significant challenges for current state-of-the-art models in cross-disciplinary reasoning.
AIBullisharXiv โ CS AI ยท Mar 37/106
๐ง Researchers propose Draft-Thinking, a new approach to improve the efficiency of large language models' reasoning processes by reducing unnecessary computational overhead. The method achieves an 82.6% reduction in reasoning budget with only a 2.6% performance drop on mathematical problems, addressing the costly overthinking problem in current chain-of-thought reasoning.
AIBullisharXiv โ CS AI ยท Mar 36/107
๐ง LiTS is a new modular Python framework that enables LLM reasoning through tree search algorithms like MCTS and BFS. The framework demonstrates reusable components across different domains and reveals that LLM policy diversity, not reward quality, is the key bottleneck for effective tree search in infinite action spaces.
AINeutralarXiv โ CS AI ยท Mar 37/108
๐ง Researchers have developed DIVA-GRPO, a new reinforcement learning method that improves multimodal large language model reasoning by adaptively adjusting problem difficulty distributions. The approach addresses key limitations in existing group relative policy optimization methods, showing superior performance across six reasoning benchmarks.
AINeutralarXiv โ CS AI ยท Mar 36/108
๐ง Researchers released ASTRA-bench, a new benchmark for evaluating AI agents' ability to handle complex, multi-step reasoning with personal context and tool usage. Testing revealed that current state-of-the-art models like Claude-4.5-Opus and DeepSeek-V3.2 show significant performance degradation in high-complexity scenarios.
AIBullisharXiv โ CS AI ยท Mar 37/107
๐ง Researchers propose Ctrl-R, a new framework that improves large language models' reasoning abilities by systematically discovering and reinforcing diverse reasoning patterns through structured trajectory control. The method enables better exploration of complex reasoning behaviors and shows consistent improvements across mathematical reasoning tasks in both language and vision-language models.
AINeutralarXiv โ CS AI ยท Mar 36/107
๐ง Researchers introduced Pencil Puzzle Bench, a new framework for evaluating large language model reasoning capabilities using constraint-satisfaction problems. The benchmark tested 51 models across 300 puzzles, revealing significant performance improvements through increased reasoning effort and iterative verification processes.
AIBullisharXiv โ CS AI ยท Mar 36/107
๐ง Researchers propose ActMem, a novel memory framework for LLM agents that combines memory retrieval with active causal reasoning to handle complex decision-making scenarios. The framework transforms dialogue history into structured causal graphs and uses counterfactual reasoning to resolve conflicts between past states and current intentions, significantly outperforming existing baselines in memory-dependent tasks.
AIBullisharXiv โ CS AI ยท Mar 37/108
๐ง Researchers introduce CHIMERA, a compact 9K-sample synthetic dataset that enables smaller AI models to achieve reasoning performance comparable to much larger models. The dataset addresses key challenges in training reasoning-capable LLMs through automated generation and cross-validation across 8 scientific disciplines.
AIBullisharXiv โ CS AI ยท Mar 36/106
๐ง Researchers introduce One-Token Verification (OTV), a new method that estimates reasoning correctness in large language models during a single forward pass, reducing computational overhead. OTV reduces token usage by up to 90% through early termination while improving accuracy on mathematical reasoning tasks compared to existing verification methods.
AINeutralarXiv โ CS AI ยท Mar 36/1012
๐ง Researchers introduce Silo-Bench, a benchmark revealing that multi-agent LLM systems can exchange information effectively but fail to integrate distributed data for correct reasoning. The study shows coordination overhead increases with scale, challenging the assumption that adding more agents can solve context limitations.
AIBullisharXiv โ CS AI ยท Mar 36/106
๐ง Researchers developed TARSE, a new AI system for clinical decision-making that retrieves relevant medical skills and experiences from curated libraries to improve reasoning accuracy. The system performs test-time adaptation to align language models with clinically valid logic, showing improvements over existing medical AI baselines in question-answering benchmarks.