AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers introduced FinVerBench, a benchmark for evaluating how well large language models verify financial statement accuracy using real SEC 10-K filings. Testing 14 contemporary LLMs revealed critical limitations: most models produced 95-100% false positives on clean statements, while performance varied dramatically based on how financial data was rendered, suggesting financial verification requires calibrated judgment beyond arithmetic detection.
🧠 Gemini
AIBearisharXiv – CS AI · 3d ago7/10
🧠A research paper argues that AI labor substitution in software development and knowledge work creates a false efficiency illusion by masking dependence on human expertise rather than truly replacing it. While organizations appear to reduce costs and accelerate output through AI adoption, they risk eroding foundational human capabilities that are slow to rebuild, increasing long-term fragility despite short-term gains.
AIBearisharXiv – CS AI · 3d ago7/10
🧠Researchers reveal that LLM-based search agents often rely on intrinsic knowledge rather than genuinely searching the web, with up to 44.5% of answers generated without tool use. The new LiveBrowseComp benchmark, designed to test agents on recent facts within 90 days, shows all evaluated agents drop below 2% accuracy and exposes fundamental limitations in current search-augmented AI evaluation.
🏢 Hugging Face
AIBearishDecrypt – AI · 4d ago7/10
🧠Huawei has introduced Claw-Anything, a benchmark that tests AI agents' ability to handle complex digital tasks over extended simulated timeframes. GPT-5.5, currently the best-performing model, achieved only 34.5% on the benchmark, highlighting significant limitations in current AI agents' capacity to maintain performance during long-horizon tasks.
🧠 GPT-5
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce VisualNeedle, a benchmark that exposes limitations in multimodal large language models' ability to perform genuine fine-grained visual search in information-dense scenes. Despite frontier MLLMs reporting over 90% accuracy on existing benchmarks, VisualNeedle reveals that these models struggle significantly when critical evidence is spatially constrained to minute regions, with the best model achieving only 56% accuracy versus 63% human performance.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers unveiled KnotBench, a comprehensive benchmark testing vision-language models' ability to reason about knot diagrams, revealing that current models like Claude Opus and GPT-5 struggle fundamentally with spatial reasoning and symbolic operations despite perceiving visual details. The benchmark demonstrates a critical gap between perception and reasoning capabilities, with most tasks scoring near or below random chance, suggesting VLMs lack mechanisms to simulate geometric transformations.
🧠 GPT-5🧠 Claude🧠 Opus
AINeutralarXiv – CS AI · May 117/10
🧠Researchers developed a method to extract and analyze search trees from LLM reasoning traces, revealing that large language models use shallower, more myopic planning strategies compared to humans. While LLMs generate extended chain-of-thought reasoning, their actual decision-making is driven primarily by shallow search rather than deep lookahead, contrasting sharply with human expert planning.
AINeutralarXiv – CS AI · May 117/10
🧠Researchers prove a fundamental mathematical incompatibility between accuracy, trust, and human-level reasoning in AI systems, demonstrating that systems designed to never make false claims cannot solve certain problems that humans can easily solve. The findings parallel Gödel's incompleteness theorems and establish formal limitations on what AI systems can achieve regardless of computational power.
AIBearisharXiv – CS AI · May 47/10
🧠Researchers introduced AutoMat, a benchmark testing whether AI coding agents can reproduce computational materials science findings from academic papers. Current LLM-based agents achieved only 54.1% success rates, revealing significant limitations in reconstructing complex scientific workflows, interpreting domain-specific procedures, and validating results against original claims.
AINeutralarXiv – CS AI · May 47/10
🧠Researchers have identified fundamental limitations in how text-to-image diffusion models handle multi-object generation, finding that scene complexity rather than data imbalance is the primary culprit. Through a controlled framework called MOSAIC, they demonstrate that counting objects is particularly difficult in low-data regimes and that compositional generalization collapses when training combinations are systematically excluded.
AIBearisharXiv – CS AI · Apr 207/10
🧠Researchers found that Chain-of-Thought prompting, a technique that improves logical reasoning in multimodal AI models, actually degrades performance on visual spatial tasks. The study evaluated seventeen models across thirteen benchmarks and discovered these systems suffer from shortcut learning, hallucinating visual details from text even when images are absent, indicating a fundamental limitation in current AI reasoning paradigms.
AIBearisharXiv – CS AI · Apr 147/10
🧠A new study reveals that large language models fail at counterfactual reasoning when policy findings contradict intuitive expectations, despite performing well on obvious cases. The research demonstrates that chain-of-thought prompting paradoxically worsens performance on counter-intuitive scenarios, suggesting current LLMs engage in 'slow talking' rather than genuine deliberative reasoning.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers developed the first real-world benchmark for evaluating whether large language models can infer causal relationships from complex academic texts. The study reveals that LLMs struggle significantly with this task, with the best models achieving only 0.535 F1 scores, highlighting a critical gap in AI reasoning capabilities needed for AGI advancement.
AIBearisharXiv – CS AI · Mar 277/10
🧠Research reveals that open-source large language models (LLMs) lack hierarchical knowledge of visual taxonomies, creating a bottleneck for vision LLMs in hierarchical visual recognition tasks. The study used one million visual question answering tasks across six taxonomies to demonstrate this limitation, finding that even fine-tuning cannot overcome the underlying LLM knowledge gaps.
AIBearisharXiv – CS AI · Mar 177/10
🧠Researchers introduced EnterpriseOps-Gym, a new benchmark for evaluating AI agents in enterprise environments, revealing that even top models like Claude Opus 4.5 achieve only 37.4% success rates. The study highlights critical limitations in current AI agents for autonomous enterprise deployment, particularly in strategic reasoning and task feasibility assessment.
🧠 Claude🧠 Opus
AIBearisharXiv – CS AI · Mar 167/10
🧠Researchers introduced CoRE, a benchmark testing whether large language models can reason about human emotions through cognitive dimensions rather than just labels. The study found that while LLMs capture systematic relations between cognitive appraisals and emotions, they show misalignment with human judgments and instability across different contexts.
AIBearisharXiv – CS AI · Mar 167/10
🧠Researchers identify a significant bias in Large Language Models when processing multiple updates to the same factual information within context. The study reveals that LLMs struggle to accurately retrieve the most recent version of updated facts, with performance degrading as the number of updates increases, similar to memory interference patterns observed in cognitive psychology.
AIBearisharXiv – CS AI · Mar 56/10
🧠A research study tested 11 AI tools on their ability to classify the cognitive demand of mathematical tasks, finding they achieved only 63% accuracy on average with no tool exceeding 83%. The tools showed systematic bias toward middle-category classifications and struggled with reasoning about underlying cognitive processes versus surface textual features.
🏢 Perplexity🧠 ChatGPT🧠 Claude
AIBearisharXiv – CS AI · Mar 56/10
🧠Researchers introduced τ-Knowledge, a new benchmark for evaluating AI conversational agents in knowledge-intensive environments, specifically testing their ability to retrieve and apply unstructured domain knowledge. Even frontier AI models achieved only 25.5% success rates when navigating complex fintech customer support scenarios with 700 interconnected knowledge documents.
AIBearisharXiv – CS AI · Mar 56/10
🧠Research comparing four state-of-the-art language models (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5, and Centaur) to humans in goal selection tasks reveals substantial divergence in behavior. While humans explore diverse approaches and learn gradually, the AI models tend to exploit single solutions or show poor performance, raising concerns about using current LLMs as proxies for human decision-making in critical applications.
🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · Mar 46/103
🧠Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.
AIBearisharXiv – CS AI · Mar 46/103
🧠New research reveals that current large language models struggle with collaborative reasoning, showing that 'stronger' models are often more fragile when distracted by misleading information. The study of 15 LLMs found they fail to effectively leverage guidance from other models, with success rates below 9.2% on challenging problems.
AIBearisharXiv – CS AI · Mar 47/103
🧠Researchers introduced ZeroDayBench, a new benchmark testing LLM agents' ability to find and patch 22 critical vulnerabilities in open-source code. Testing on frontier models GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 revealed that current LLMs cannot yet autonomously solve cybersecurity tasks, highlighting limitations in AI-powered code security.
AIBearisharXiv – CS AI · Mar 46/103
🧠Researchers introduce SpatialText, a diagnostic framework to test whether large language models can truly reason about spatial relationships or merely rely on linguistic patterns. The study reveals that current AI models fail at egocentric perspective reasoning despite proficiency in basic spatial fact retrieval.
AINeutralarXiv – CS AI · Feb 277/103
🧠Researchers introduce Tool Decathlon (Toolathlon), a comprehensive benchmark for evaluating AI language agents across 32 software applications and 604 tools in realistic, multi-step scenarios. The benchmark reveals significant limitations in current AI models, with the best performer (Claude-4.5-Sonnet) achieving only 38.6% success rate on complex, real-world tasks.