#ai-reasoning News & Analysis

72 articles tagged with #ai-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

72 articles

AIBullishGoogle DeepMind Blog · Feb 127/108

🧠

Gemini 3 Deep Think: Advancing science, research and engineering

Gemini 3 Deep Think represents an updated specialized reasoning mode designed to tackle complex challenges in modern science, research, and engineering. The advancement focuses on enhanced problem-solving capabilities for technical and scientific applications.

AIBullishOpenAI News · Dec 117/108

🧠

Advancing science and math with GPT-5.2

OpenAI has released GPT-5.2, their most advanced model for mathematics and science applications, achieving state-of-the-art performance on benchmarks like GPQA Diamond and FrontierMath. The model demonstrates significant research capabilities, including solving open theoretical problems and generating reliable mathematical proofs.

AINeutralarXiv – CS AI · 2d ago6/10

🧠

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

Researchers introduce ProjectionBench, a novel evaluation framework that tests large language models' scientific discovery capabilities by progressively revealing information about research problems. The benchmark assesses both innovative reasoning with minimal context and grounded hypothesis generation with full experimental details across 45 materials science papers, finding that GPT-5.4 and Gemini 3.1 Pro achieve strong alignment with ground-truth conclusions.

🧠 GPT-5🧠 Gemini

AINeutralarXiv – CS AI · 2d ago6/10

🧠

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

Researchers propose an adaptive interview framework to improve how large language models simulate individual decision-making by gathering persona-relevant information through structured dialogue. The study finds that richer contextual information alone doesn't guarantee better accuracy; instead, LLMs only improve predictions (45.5% vs. 39.3%) when they actively ground decisions in user-specific evidence extracted during follow-up questions.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

ProvMind: Provenance-grounded reasoning for materials synthesis

Researchers introduce ProvMind, a framework for optimizing materials synthesis processes using provenance-grounded reasoning. The system combines process retrieval, compatibility scoring, and language models to achieve 52.84% accuracy on complex out-of-distribution benchmarks, outperforming standard AI approaches in materials science workflow optimization.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Multi-Agent Causal Discovery Using Large Language Models

Researchers introduce MAC, a multi-agent framework that combines statistical causal discovery with large language models to identify relationships between variables more accurately than existing methods. By using autonomous agent debate and adversarial reasoning, MAC outperforms both traditional statistical and single-agent LLM approaches across multiple benchmark datasets.

🧠 Gemini

AINeutralarXiv – CS AI · 4d ago6/10

🧠

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Researchers introduce OmniToM, a new benchmark for evaluating Theory of Mind capabilities in large language models by requiring explicit modeling of belief structures rather than just final answers. The benchmark reveals that current LLMs struggle with tracking actor-specific beliefs and understanding knowledge access, exposing fundamental limitations in social reasoning despite high performance on traditional end-point question answering tasks.

AINeutralarXiv – CS AI · May 126/10

🧠

Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?

Researchers introduced a benchmark testing whether vision-language model (VLM) agents can recognize themselves in mirrors, a cognitive capability that emerges only in some animal species. Results show self-identification through reflection occurs mainly in stronger VLMs, while weaker models fail to extract self-relevant information despite viewing their reflections, revealing that language-based self-reference alone does not guarantee grounded self-understanding.

AINeutralarXiv – CS AI · May 116/10

🧠

ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning

Researchers introduce ARMOR, an agentic framework that improves chemical reaction feasibility prediction by intelligently combining multiple AI tools rather than relying on single models. The system uses hierarchical tool organization and memory-augmented reasoning to resolve conflicting predictions, demonstrating significant performance gains especially when different tools disagree on outcomes.

AINeutralarXiv – CS AI · May 116/10

🧠

SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model

Researchers propose Structured Opponent Modeling (SOM), a two-stage framework using Structural Causal Models to improve how LLM-based agents predict and adapt to opponent behavior in multi-agent environments. The approach separates opponent model construction from prediction, enabling more accurate strategic decision-making in game-theoretic scenarios.

AINeutralarXiv – CS AI · May 116/10

🧠

Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing

Researchers introduce PostEDA-Bench, a hierarchical benchmark for evaluating LLM-based agents in Electronic Design Automation tasks, specifically targeting Design Rule Check (DRC) fixing and Power-Performance-Area (PPA) optimization. Testing eight LLMs across 145 tasks reveals significant performance gaps, with best success rates of 36.66% for complex DRC reasoning and only 20% for multi-objective PPA optimization, indicating substantial room for improvement in AI-assisted chip design automation.

AIBullisharXiv – CS AI · May 116/10

🧠

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

Researchers introduce CA-SQL, an advanced Text-to-SQL pipeline that dynamically allocates computational resources based on task complexity to improve LLM reasoning. The method achieves state-of-the-art performance on the BIRD benchmark's challenging tier using only GPT-4o-mini, outperforming larger models and demonstrating the efficiency gains possible through intelligent inference-time optimization.

🧠 GPT-4

AINeutralarXiv – CS AI · May 116/10

🧠

Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Researchers challenge recent claims that Chain-of-Thought (CoT) reasoning in language models is unfaithful when it omits prompt-injected hints. The study argues the Biasing Features metric conflates incompleteness with unfaithfulness, and demonstrates through multiple evaluation approaches that non-verbalized hints can still causally influence predictions, suggesting token constraints rather than model deception explain missing hint mentions.

AINeutralarXiv – CS AI · May 116/10

🧠

When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning

Researchers introduce SCALAR, an Actor-Critic-Judge framework that systematically evaluates how AI agents improve through human feedback on theoretical physics problems. The study reveals that multi-turn dialogue consistently outperforms single attempts, but the effectiveness of different feedback strategies depends heavily on the specific pairing of AI models used, with asymmetric model pairs benefiting most from structured critique.

AIBullisharXiv – CS AI · May 16/10

🧠

From Context to Skills: Can Language Models Learn from Context Skillfully?

Researchers introduce Ctx2Skill, a self-evolving framework that automatically discovers and refines natural-language skills for language models to better learn from complex contexts without manual annotation or external feedback. The system uses a multi-agent loop with a Challenger, Reasoner, and Judge to autonomously generate, test, and improve skills, showing consistent improvements across context learning benchmarks.

AIBullisharXiv – CS AI · May 16/10

🧠

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

Researchers present LLM+ASP, a framework combining large language models with Answer Set Programming to enable nonmonotonic reasoning without task-specific engineering. The system uses automated self-correction loops where an ASP solver provides structured feedback, demonstrating significant performance improvements over monotonic logic approaches across diverse reasoning benchmarks.

AINeutralarXiv – CS AI · Apr 156/10

🧠

EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture

Researchers present EMBER, a hybrid architecture combining spiking neural networks with large language models where the SNN acts as a persistent, biologically-inspired memory substrate that autonomously triggers LLM reasoning. The system demonstrates emergent autonomous behavior, initiating unprompted user contact after learning associations during idle periods, suggesting a fundamental shift in how AI systems could coordinate cognition and action.

AINeutralarXiv – CS AI · Apr 156/10

🧠

PrivacyReasoner: Can LLM Emulate a Human-like Privacy Mind?

Researchers introduce PrivacyReasoner, an LLM-based agent architecture that reconstructs individual privacy perspectives from online comment history to predict how specific people would perceive data practices. The system outperforms baseline models in predicting privacy concerns across AI, e-commerce, and healthcare domains by contextually activating relevant privacy beliefs.

AINeutralarXiv – CS AI · Apr 146/10

🧠

COMPOSITE-Stem

Researchers introduced COMPOSITE-STEM, a new benchmark containing 70 expert-written scientific tasks across physics, biology, chemistry, and mathematics to evaluate AI agents. The top-performing model achieved only 21% accuracy, indicating the benchmark effectively measures capabilities beyond current AI reach and addresses the saturation of existing evaluation frameworks.

AIBullisharXiv – CS AI · Apr 146/10

🧠

M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

Researchers introduce M³KG-RAG, a novel multimodal retrieval-augmented generation system that enhances large language models by integrating multi-hop knowledge graphs with audio-visual data. The approach improves reasoning depth and answer accuracy by filtering irrelevant information through a new grounding and pruning mechanism called GRASP.

$KG

AINeutralarXiv – CS AI · Apr 136/10

🧠

Model Space Reasoning as Search in Feedback Space for Planning Domain Generation

Researchers present a novel approach using agentic language model feedback frameworks to generate planning domains from natural language descriptions augmented with symbolic information. The method employs heuristic search over model space optimized by various feedback mechanisms, including landmarks and plan validator outputs, to improve domain quality for practical deployment.

AINeutralarXiv – CS AI · Apr 106/10

🧠

SymptomWise: A Deterministic Reasoning Layer for Reliable and Efficient AI Systems

SymptomWise introduces a deterministic reasoning framework that separates language understanding from diagnostic inference in AI-driven medical systems, combining expert-curated knowledge with constrained LLM use to improve reliability and reduce hallucinations. The system achieved 88% accuracy in placing correct diagnoses in top-five differentials on challenging pediatric neurology cases, demonstrating how structured approaches can enhance AI safety in critical domains.

AIBearisharXiv – CS AI · Apr 66/10

🧠

DeltaLogic: Minimal Premise Edits Reveal Belief-Revision Failures in Logical Reasoning Models

Researchers introduce DeltaLogic, a new benchmark that tests AI models' ability to revise their logical conclusions when presented with minimal changes to premises. The study reveals that language models like Qwen and Phi-4 struggle with belief revision even when they perform well on initial reasoning tasks, showing concerning inertia patterns where models fail to update conclusions when evidence changes.

AIBullisharXiv – CS AI · Mar 276/10

🧠

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Researchers introduce RC2, a reinforcement learning framework that improves multimodal AI reasoning by enforcing consistency between visual and textual representations. The system uses cycle-consistent training to resolve internal conflicts between modalities, achieving up to 7.6 point improvements in reasoning accuracy without requiring additional labeled data.

AIBullisharXiv – CS AI · Mar 176/10

🧠

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Researchers introduce VLA-Thinker, a new AI framework that enhances Vision-Language-Action models by enabling dynamic visual reasoning during robotic tasks. The system achieved a 97.5% success rate on LIBERO benchmarks through a two-stage training pipeline combining supervised fine-tuning and reinforcement learning.

← PrevPage 2 of 3Next →