169 articles tagged with #reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv – CS AI · 1d ago7/10
🧠Researchers introduce AdaMCoT, a framework that improves multilingual reasoning in large language models by dynamically routing intermediate thoughts through optimal 'thinking languages' before generating target-language responses. The approach achieves significant performance gains in low-resource languages without requiring additional pretraining, addressing a key limitation in current multilingual AI systems.
AIBullisharXiv – CS AI · 1d ago7/10
🧠Researchers demonstrate that multi-token prediction (MTP) outperforms standard next-token prediction (NTP) for training language models on reasoning tasks like planning and pathfinding. Through theoretical analysis of simplified Transformers, they reveal that MTP enables a reverse reasoning process where models first identify end states then reconstruct paths backward, suggesting MTP induces more interpretable and robust reasoning circuits.
AINeutralarXiv – CS AI · 6d ago7/10
🧠Researchers document 'blind refusal'—a phenomenon where safety-trained language models refuse to help users circumvent rules without evaluating whether those rules are legitimate, unjust, or have justified exceptions. The study shows models refuse 75.4% of requests to break rules even when the rules lack defensibility and pose no safety risk.
🧠 GPT-5
AINeutralarXiv – CS AI · Apr 77/10
🧠Researchers introduce 'error verifiability' as a new metric to measure whether AI-generated justifications help users distinguish correct from incorrect answers. The study found that common AI improvement methods don't enhance verifiability, but two new domain-specific approaches successfully improved users' ability to assess answer correctness.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers developed LightThinker++, a new framework that enables large language models to compress intermediate reasoning thoughts and manage memory more efficiently. The system reduces peak token usage by up to 70% while improving accuracy by 2.42% and maintaining performance over extended reasoning tasks.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers introduce Cog-DRIFT, a new framework that improves AI language model reasoning by transforming difficult problems into easier formats like multiple-choice questions, then gradually training models on increasingly complex versions. The method shows significant performance gains of 8-10% on previously unsolvable problems across multiple reasoning benchmarks.
🧠 Llama
AINeutralarXiv – CS AI · Apr 77/10
🧠Researchers at arXiv have identified two key mechanisms behind reasoning hallucinations in large language models: Path Reuse and Path Compression. The study models next-token prediction as graph search, showing how memorized knowledge can override contextual constraints and how frequently used reasoning paths become shortcuts that lead to unsupported conclusions.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers propose Continuous Softened Retracing reSampling (CSRS) to improve the self-evolution of Multimodal Large Language Models by addressing biases in feedback mechanisms. The method uses continuous reward signals instead of binary rewards and achieves state-of-the-art results on mathematical reasoning benchmarks like MathVision using Qwen2.5-VL-7B.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers propose Online Label Refinement (OLR) to improve AI reasoning models' robustness under noisy supervision in Reinforcement Learning with Verifiable Rewards. The method addresses the critical problem of training language models when expert-labeled data contains errors, achieving 3-4% performance gains across mathematical reasoning benchmarks.
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers demonstrate that PLDR-LLMs trained at self-organized criticality exhibit enhanced reasoning capabilities at inference time. The study shows that reasoning ability can be quantified using an order parameter derived from global model statistics, with models performing better when this parameter approaches zero at criticality.
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers introduce Bottlenecked Transformers, a new architecture that improves AI reasoning by up to 6.6 percentage points through periodic memory consolidation inspired by brain processes. The system uses a Cache Processor to rewrite key-value cache entries at reasoning step boundaries, achieving better performance on math reasoning benchmarks compared to standard Transformers.
AINeutralarXiv – CS AI · Mar 177/10
🧠A comprehensive survey of 82 AI approaches to the ARC-AGI benchmark reveals consistent 2-3x performance drops across all paradigms when moving from version 1 to 2, with human-level reasoning still far from reach. While costs have fallen dramatically (390x in one year), AI systems struggle with compositional generalization, achieving only 13% on ARC-AGI-3 compared to near-perfect human performance.
🧠 GPT-5🧠 Opus
AIBearisharXiv – CS AI · Mar 177/10
🧠Researchers discovered that test-time reinforcement learning (TTRL) methods used to improve AI reasoning capabilities are vulnerable to harmful prompt injections that amplify both safety and harmfulness behaviors. The study shows these methods can be exploited through specially designed 'HarmInject' prompts, leading to reasoning degradation while highlighting the need for safer AI training approaches.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers have developed a novel method to enhance large language model reasoning capabilities using supervision from weaker models, achieving 94% of expensive reinforcement learning gains at a fraction of the cost. This weak-to-strong supervision paradigm offers a promising alternative to costly traditional methods for improving LLM reasoning performance.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers propose BIGMAS (Brain-Inspired Graph Multi-Agent Systems), a new architecture that organizes specialized LLM agents in dynamic graphs with centralized coordination to improve complex reasoning tasks. The system outperformed existing approaches including ReAct and Tree of Thoughts across multiple reasoning benchmarks, demonstrating that multi-agent design provides gains complementary to model-level improvements.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers have developed rationale-enhanced decoding (RED), a new inference-time strategy that improves chain-of-thought reasoning in large vision-language models. The method addresses the problem where LVLMs ignore generated rationales by harmonizing visual and rationale information during decoding, showing consistent improvements across multiple benchmarks.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduced SAGE, a multi-agent framework that improves large language model reasoning through self-evolution using four specialized agents. The system achieved significant performance gains on coding and mathematics benchmarks without requiring large human-labeled datasets.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce AutoTool, a new reinforcement learning approach that enables AI agents to automatically scale their reasoning capabilities for tool use. The method uses entropy-based optimization and supervised fine-tuning to help models efficiently determine appropriate thinking lengths for simple versus complex problems, achieving 9.8% accuracy improvements while reducing computational overhead by 81%.
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers used mechanistic interpretability techniques to demonstrate that transformer language models have distinct but interacting neural circuits for recall (retrieving memorized facts) and reasoning (multi-step inference). Through controlled experiments on Qwen and LLaMA models, they showed that disabling specific circuits can selectively impair one ability while leaving the other intact.
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers developed a new reinforcement learning approach for training diffusion language models that uses entropy-guided step selection and stepwise advantages to overcome challenges with sequence-level likelihood calculations. The method achieves state-of-the-art results on coding and logical reasoning benchmarks while being more computationally efficient than existing approaches.
AINeutralarXiv – CS AI · Mar 167/10
🧠Researchers developed a testing framework to evaluate how reliably AI agents maintain consistent reasoning when inputs are semantically equivalent but differently phrased. Their study of seven foundation models across 19 reasoning problems found that larger models aren't necessarily more robust, with the smaller Qwen3-30B-A3B achieving the highest stability at 79.6% invariant responses.
AIBullisharXiv – CS AI · Mar 127/10
🧠Researchers introduce Targeted Reasoning Unlearning (TRU), a new method for removing specific knowledge from large language models while preserving general capabilities. The approach uses reasoning-based targets to guide the unlearning process, addressing issues with previous gradient ascent methods that caused unintended capability degradation.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers introduce SATURN, a new reinforcement learning framework that uses Boolean Satisfiability (SAT) problems to improve large language models' reasoning capabilities. The framework addresses key limitations in existing RL approaches by enabling scalable task construction, automated verification, and precise difficulty control through curriculum learning.
AINeutralarXiv – CS AI · Mar 117/10
🧠Researchers introduce 'opaque serial depth' as a metric to measure how much reasoning large language models can perform without externalizing it through chain of thought processes. The study provides computational bounds for Gemma 3 models and releases open-source tools to calculate these bounds for any neural network architecture.
AIBullishMarkTechPost · Mar 97/10
🧠Google researchers have developed a new 'Bayesian' teaching method to improve Large Language Models' probabilistic reasoning capabilities. Current LLMs struggle with updating beliefs based on new evidence, falling short in logical reasoning tasks that require maintaining and updating probability assessments.
🏢 Google