y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#ai-limitations News & Analysis

52 articles tagged with #ai-limitations. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

52 articles
AIBearisharXiv โ€“ CS AI ยท 2d ago7/10
๐Ÿง 

Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

A new study reveals that large language models fail at counterfactual reasoning when policy findings contradict intuitive expectations, despite performing well on obvious cases. The research demonstrates that chain-of-thought prompting paradoxically worsens performance on counter-intuitive scenarios, suggesting current LLMs engage in 'slow talking' rather than genuine deliberative reasoning.

AINeutralarXiv โ€“ CS AI ยท 2d ago7/10
๐Ÿง 

Can Large Language Models Infer Causal Relationships from Real-World Text?

Researchers developed the first real-world benchmark for evaluating whether large language models can infer causal relationships from complex academic texts. The study reveals that LLMs struggle significantly with this task, with the best models achieving only 0.535 F1 scores, highlighting a critical gap in AI reasoning capabilities needed for AGI advancement.

AIBearisharXiv โ€“ CS AI ยท Mar 277/10
๐Ÿง 

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

Research reveals that open-source large language models (LLMs) lack hierarchical knowledge of visual taxonomies, creating a bottleneck for vision LLMs in hierarchical visual recognition tasks. The study used one million visual question answering tasks across six taxonomies to demonstrate this limitation, finding that even fine-tuning cannot overcome the underlying LLM knowledge gaps.

AIBearisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

Researchers introduced EnterpriseOps-Gym, a new benchmark for evaluating AI agents in enterprise environments, revealing that even top models like Claude Opus 4.5 achieve only 37.4% success rates. The study highlights critical limitations in current AI agents for autonomous enterprise deployment, particularly in strategic reasoning and task feasibility assessment.

๐Ÿง  Claude๐Ÿง  Opus
AIBearisharXiv โ€“ CS AI ยท Mar 167/10
๐Ÿง 

Large language models show fragile cognitive reasoning about human emotions

Researchers introduced CoRE, a benchmark testing whether large language models can reason about human emotions through cognitive dimensions rather than just labels. The study found that while LLMs capture systematic relations between cognitive appraisals and emotions, they show misalignment with human judgments and instability across different contexts.

AIBearisharXiv โ€“ CS AI ยท Mar 167/10
๐Ÿง 

Diagnosing Retrieval Bias Under Multiple In-Context Knowledge Updates in Large Language Models

Researchers identify a significant bias in Large Language Models when processing multiple updates to the same factual information within context. The study reveals that LLMs struggle to accurately retrieve the most recent version of updated facts, with performance degrading as the number of updates increases, similar to memory interference patterns observed in cognitive psychology.

AIBearisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks

A research study tested 11 AI tools on their ability to classify the cognitive demand of mathematical tasks, finding they achieved only 63% accuracy on average with no tool exceeding 83%. The tools showed systematic bias toward middle-category classifications and struggled with reasoning about underlying cognitive processes versus surface textual features.

๐Ÿข Perplexity๐Ÿง  ChatGPT๐Ÿง  Claude
AIBearisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

Language Model Goal Selection Differs from Humans' in an Open-Ended Task

Research comparing four state-of-the-art language models (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5, and Centaur) to humans in goal selection tasks reveals substantial divergence in behavior. While humans explore diverse approaches and learn gradually, the AI models tend to exploit single solutions or show poor performance, raising concerns about using current LLMs as proxies for human decision-making in critical applications.

๐Ÿง  Claude๐Ÿง  Gemini
AIBearisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

$\tau$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Researchers introduced ฯ„-Knowledge, a new benchmark for evaluating AI conversational agents in knowledge-intensive environments, specifically testing their ability to retrieve and apply unstructured domain knowledge. Even frontier AI models achieved only 25.5% success rates when navigating complex fintech customer support scenarios with 700 interconnected knowledge documents.

AINeutralarXiv โ€“ CS AI ยท Mar 46/103
๐Ÿง 

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.

AIBearisharXiv โ€“ CS AI ยท Mar 46/103
๐Ÿง 

Off-Trajectory Reasoning: Can LLMs Collaborate on Reasoning Trajectory?

New research reveals that current large language models struggle with collaborative reasoning, showing that 'stronger' models are often more fragile when distracted by misleading information. The study of 15 LLMs found they fail to effectively leverage guidance from other models, with success rates below 9.2% on challenging problems.

AIBearisharXiv โ€“ CS AI ยท Mar 47/103
๐Ÿง 

ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

Researchers introduced ZeroDayBench, a new benchmark testing LLM agents' ability to find and patch 22 critical vulnerabilities in open-source code. Testing on frontier models GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 revealed that current LLMs cannot yet autonomously solve cybersecurity tasks, highlighting limitations in AI-powered code security.

AINeutralarXiv โ€“ CS AI ยท Feb 277/103
๐Ÿง 

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Researchers introduce Tool Decathlon (Toolathlon), a comprehensive benchmark for evaluating AI language agents across 32 software applications and 604 tools in realistic, multi-step scenarios. The benchmark reveals significant limitations in current AI models, with the best performer (Claude-4.5-Sonnet) achieving only 38.6% success rate on complex, real-world tasks.

AIBearishDecrypt โ€“ AI ยท 1d ago6/10
๐Ÿง 

Can AI Beat the Sports Betting Market? 8 of the Top Models Tried

KellyBench tested eight leading AI models including Claude, GPT-5, Gemini, and Grok on Premier League sports betting predictions over a full season, and none generated profits. The results highlight the persistent difficulty AI faces in beating efficient markets despite advances in language models and reasoning capabilities.

Can AI Beat the Sports Betting Market? 8 of the Top Models Tried
๐Ÿง  GPT-5๐Ÿง  Claude๐Ÿง  Gemini
AINeutralarXiv โ€“ CS AI ยท 2d ago6/10
๐Ÿง 

Human-like Working Memory Interference in Large Language Models

Researchers discovered that large language models exhibit working memory limitations similar to humans, encoding multiple memory items in entangled representations that require interference control rather than direct retrieval. This finding reveals a shared computational constraint between biological and artificial systems, suggesting that working memory capacity may be a fundamental bottleneck in intelligent systems rather than a limitation unique to biological brains.

AIBearisharXiv โ€“ CS AI ยท 3d ago6/10
๐Ÿง 

Overstating Attitudes, Ignoring Networks: LLM Biases in Simulating Misinformation Susceptibility

Researchers found that large language models fail to accurately simulate human susceptibility to misinformation, consistently overstating how attitudes drive belief and sharing while ignoring social network effects. The study reveals systematic biases in how LLMs represent misinformation concepts, suggesting they are better tools for identifying where AI diverges from human judgment rather than replacing human survey responses.

AIBearisharXiv โ€“ CS AI ยท 6d ago6/10
๐Ÿง 

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

Researchers introduce CLI-Tool-Bench, a new benchmark for evaluating large language models' ability to generate complete software from scratch. Testing seven state-of-the-art LLMs reveals that top models achieve under 43% success rates, exposing significant limitations in current AI-driven 0-to-1 software generation despite increased computational investment.

AIBearisharXiv โ€“ CS AI ยท Apr 76/10
๐Ÿง 

Metaphors We Compute By: A Computational Audit of Cultural Translation vs. Thinking in LLMs

New research reveals that Large Language Models (LLMs) exhibit cultural bias and Western defaultism when generating metaphors across different cultural contexts. The study found that LLMs act more as cultural translators using dominant Western frameworks rather than true culturally-aware reasoning systems, even when prompted with specific cultural identities.

AIBearisharXiv โ€“ CS AI ยท Apr 76/10
๐Ÿง 

Individual and Combined Effects of English as a Second Language and Typos on LLM Performance

Research reveals that Large Language Models (LLMs) experience greater performance degradation when facing English as a Second Language (ESL) inputs combined with typographical errors, compared to either factor alone. The study tested eight ESL variants with three levels of typos, finding that evaluations on clean English may overestimate real-world model performance.

AIBearisharXiv โ€“ CS AI ยท Apr 76/10
๐Ÿง 

The Ideation Bottleneck: Decomposing the Quality Gap Between AI-Generated and Human Economics Research

Research reveals AI-generated economics papers significantly underperform human-authored publications, with idea quality representing the primary bottleneck (71% of the gap) rather than execution quality. Analysis of 953 papers shows human research achieves 47.1% exceptional probability versus 16.5% for AI, with only 0.8% of AI papers surpassing median human quality on both dimensions.

๐Ÿง  Gemini
AINeutralarXiv โ€“ CS AI ยท Apr 66/10
๐Ÿง 

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Researchers introduce XpertBench, a new benchmark for evaluating Large Language Models on expert-level professional tasks across domains like finance, healthcare, and legal services. Even top-performing LLMs achieve only ~66% success rates, revealing a significant 'expert-gap' in current AI systems' ability to handle complex professional work.

AIBearisharXiv โ€“ CS AI ยท Apr 66/10
๐Ÿง 

DeltaLogic: Minimal Premise Edits Reveal Belief-Revision Failures in Logical Reasoning Models

Researchers introduce DeltaLogic, a new benchmark that tests AI models' ability to revise their logical conclusions when presented with minimal changes to premises. The study reveals that language models like Qwen and Phi-4 struggle with belief revision even when they perform well on initial reasoning tasks, showing concerning inertia patterns where models fail to update conclusions when evidence changes.

AIBearisharXiv โ€“ CS AI ยท Apr 66/10
๐Ÿง 

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

Researchers introduced ChomskyBench, a new benchmark for evaluating large language models' formal reasoning capabilities using the Chomsky Hierarchy framework. The study reveals that while larger models show improvements, current LLMs face severe efficiency barriers and are significantly less efficient than traditional algorithmic programs for formal reasoning tasks.

Page 1 of 3Next โ†’