AIBearisharXiv – CS AI · Apr 207/10
🧠Researchers demonstrate that enhancing LLM reasoning capabilities through reinforcement learning paradoxically increases tool hallucination—where models incorrectly invoke non-existent or inappropriate tools. The study reveals a fundamental trade-off where stronger reasoning correlates with higher hallucination rates, suggesting current AI agent development approaches may inherently compromise reliability for capability.
🏢 OpenAI
AINeutralarXiv – CS AI · Apr 157/10
🧠Researchers introduce REL, a benchmark framework that evaluates relational reasoning in large language models by measuring Relational Complexity (RC)—the number of entities that must be simultaneously bound to apply a relation. The study reveals that frontier LLMs consistently degrade in performance as RC increases, exposing a fundamental limitation in higher-arity reasoning that persists even with increased compute and in-context learning.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers developed the first real-world benchmark for evaluating whether large language models can infer causal relationships from complex academic texts. The study reveals that LLMs struggle significantly with this task, with the best models achieving only 0.535 F1 scores, highlighting a critical gap in AI reasoning capabilities needed for AGI advancement.
AIBearisharXiv – CS AI · Apr 147/10
🧠A new study reveals that large language models fail at counterfactual reasoning when policy findings contradict intuitive expectations, despite performing well on obvious cases. The research demonstrates that chain-of-thought prompting paradoxically worsens performance on counter-intuitive scenarios, suggesting current LLMs engage in 'slow talking' rather than genuine deliberative reasoning.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers challenge the assumption that longer reasoning chains always improve LLM performance, discovering that extended test-time compute leads to diminishing returns and 'overthinking' where models abandon correct answers. The study demonstrates that optimal compute allocation varies by problem difficulty, enabling significant efficiency gains without sacrificing accuracy.
AIBullisharXiv – CS AI · Apr 147/10
🧠FACT-E is a new evaluation framework that uses controlled perturbations to assess the faithfulness of Chain-of-Thought reasoning in large language models, addressing the problem of models generating seemingly coherent explanations with invalid intermediate steps. By measuring both internal chain consistency and answer alignment, FACT-E enables more reliable detection of flawed reasoning and selection of trustworthy reasoning trajectories for in-context learning.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers demonstrate that a large language model's diversity profile—how probability mass spreads across different solution approaches—should determine whether reasoning strategies prioritize breadth or depth exploration. Testing on Qwen and Olmo model families reveals that lightweight refinement signals work well for low-diversity aligned models but offer limited value for high-diversity base models, suggesting optimal inference strategies must be model-specific rather than universal.
AIBearisharXiv – CS AI · Apr 137/10
🧠Researchers have developed a 14-technique perturbation pipeline to test the robustness of large language models' reasoning capabilities on mathematical problems. Testing reveals that while frontier models maintain resilience, open-weight models experience catastrophic accuracy collapses up to 55%, and all tested models degrade when solving sequential problems in a single context window, suggesting fundamental architectural limitations in current reasoning systems.
🧠 Claude🧠 Opus
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers have developed Zipage, a new high-concurrency inference engine for large language models that uses Compressed PagedAttention to solve memory bottlenecks. The system achieves 95% performance of full KV inference engines while delivering over 2.1x speedup on mathematical reasoning tasks.
AIBullisharXiv – CS AI · Mar 37/105
🧠Researchers provide mathematical proof that implicit models can achieve greater expressive power through increased test-time computation, explaining how these memory-efficient architectures can match larger explicit networks. The study validates this scaling property across image reconstruction, scientific computing, operations research, and LLM reasoning domains.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduced ReasonOps, an unsupervised method for analyzing chain-of-thought traces from large language models that identifies seven universal reasoning operators (backtracking, inferring, hypothesizing, etc.) appearing consistently across 12 different LLM families. The framework enables model identification, correctness prediction, and early quality estimation without manual annotation, revealing that each model family has a distinctive reasoning fingerprint.
AIBullisharXiv – CS AI · May 296/10
🧠Researchers introduce RePoT (Recoverable Program-of-Thought), an enhanced AI reasoning method that fixes failed code generation by replaying execution to identify the first error point, then using a single LLM call to recover rather than restarting. The technique improves accuracy by 3-11 percentage points across multiple models and benchmarks, with particularly strong gains on smaller models like GPT-4 mini.
🧠 GPT-5🧠 Claude🧠 Gemini
AIBullisharXiv – CS AI · May 296/10
🧠OptSkills, a new AI system, advances automated optimization problem-solving by clustering problems by underlying mathematical archetypes rather than surface narratives, achieving 68.27% accuracy on diverse benchmarks and outperforming DeepSeek-V3.2-Thinking on large-scale problems. The system uses skill distillation and trajectory learning to improve generalization across both known and novel problem types.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers present a systematic evaluation of large language models' reasoning capabilities on Boolean satisfiability problems, introducing a paired-formula protocol with Accurate Differentiation Rate (ADR) metric that reveals conventional accuracy metrics can be misleading, as models often succeed through heuristics rather than genuine reasoning.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers found that large language models' chain-of-thought reasoning remains remarkably consistent even when reaching opposite conclusions about conflicting information, suggesting CoT explanations don't faithfully reflect the underlying decision mechanism. While model confidence shows weak but genuine predictive signal for decisions, internal reasoning tokens proved more decision-sensitive than user-facing explanations, indicating models may not transparently report how they actually choose between document claims and training knowledge.
🧠 GPT-4🧠 Claude🧠 Sonnet
AINeutralarXiv – CS AI · May 286/10
🧠Researchers examine how Large Language Models use anthropomorphic reflection markers like 'wait' and 'hmm' during reasoning tasks. The study finds these markers are not uniformly necessary for performance and can often be suppressed without degrading—or even while improving—task outcomes, suggesting they function as surface-level cues rather than indicators of genuine reflection mechanisms.
AIBullisharXiv – CS AI · May 286/10
🧠Researchers propose Skill-Conditioned Gated Self-Distillation (SGSD), a novel method for improving large language model reasoning by leveraging an experience-derived skill bank rather than trusted reference answers. The approach validates skills through a multi-teacher framework and demonstrates consistent improvements over existing methods on mathematical reasoning benchmarks.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers present a novel framework analyzing how reinforcement learning (RL) and supervised fine-tuning (SFT) differently shape reasoning in large language models. The study reveals that RL compresses incorrect reasoning paths while SFT expands correct ones, explaining why the two-stage training approach produces superior reasoning capabilities across models of 1.5B to 14B parameters.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers propose a taxonomy of chain-of-thought (CoT) reasoning in LLM post-training, distinguishing between explicit, composed, and implicit reasoning formats. The study reveals that compressed reasoning data requires different training approaches, with composed CoT benefiting from data scaling while implicit CoT risks memorization, and that reinforcement learning can decompose compressed steps learned during supervised fine-tuning.
AIBullisharXiv – CS AI · May 276/10
🧠Researchers introduce ReasonOps, a unified operational framework that treats AI reasoning as a continuously monitored and verifiable process rather than isolated inference. The paradigm integrates formal verification, symbolic reasoning, and runtime assurance to address critical reliability gaps in LLM-based reasoning systems, particularly for safety-critical applications.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers have developed a mechanistic interpretability framework that reverses information flow through Chain-of-Thought prompting to understand how AI models reason. The study reveals CoT functions as a decoding space pruner that uses answer templates to guide outputs, with task-dependent neuron modulation that reduces activation in open-domain tasks but increases it in closed-domain scenarios.
AIBullisharXiv – CS AI · May 276/10
🧠Researchers propose PTA-GRPO, a two-stage framework that enhances LLM reasoning by combining high-level planning with reinforcement learning. The method first guides models to summarize reasoning into compact guidance, then uses this guidance to optimize both final outputs and reasoning quality, demonstrating consistent improvements across ten benchmarks.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers present Vital Trace, a protocol-constrained multi-agent AI framework designed to improve clinical risk prediction in intensive care units by tracking patient trajectories over extended periods. The system uses compact patient-state memory and structured reasoning agents rather than unbounded text histories, demonstrating better temporal consistency and interpretability on MIMIC-IV and eICU datasets.
AIBullisharXiv – CS AI · May 276/10
🧠Researchers demonstrate that knowledge graphs significantly outperform traditional document stores for LLM-based industrial asset operations, achieving 100% accuracy on 467 maintenance scenarios compared to 65% with flat data structures. The study reveals that data architecture, not LLM orchestration design, is the primary performance bottleneck in structured operational domains.
🏢 Hugging Face🧠 GPT-4
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce AgentPSO, a framework that evolves multi-agent reasoning skills in large language models using particle swarm optimization principles. Rather than relying on static agents or inference-time debate, the system enables agents to iteratively improve their reasoning capabilities through self-reflection and collective learning, demonstrating improved performance and cross-benchmark transferability without modifying underlying model parameters.