52 articles tagged with #ai-limitations. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBearisharXiv โ CS AI ยท 2d ago7/10
๐ง A new study reveals that large language models fail at counterfactual reasoning when policy findings contradict intuitive expectations, despite performing well on obvious cases. The research demonstrates that chain-of-thought prompting paradoxically worsens performance on counter-intuitive scenarios, suggesting current LLMs engage in 'slow talking' rather than genuine deliberative reasoning.
AINeutralarXiv โ CS AI ยท 2d ago7/10
๐ง Researchers developed the first real-world benchmark for evaluating whether large language models can infer causal relationships from complex academic texts. The study reveals that LLMs struggle significantly with this task, with the best models achieving only 0.535 F1 scores, highlighting a critical gap in AI reasoning capabilities needed for AGI advancement.
AIBearisharXiv โ CS AI ยท Mar 277/10
๐ง Research reveals that open-source large language models (LLMs) lack hierarchical knowledge of visual taxonomies, creating a bottleneck for vision LLMs in hierarchical visual recognition tasks. The study used one million visual question answering tasks across six taxonomies to demonstrate this limitation, finding that even fine-tuning cannot overcome the underlying LLM knowledge gaps.
AIBearisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduced EnterpriseOps-Gym, a new benchmark for evaluating AI agents in enterprise environments, revealing that even top models like Claude Opus 4.5 achieve only 37.4% success rates. The study highlights critical limitations in current AI agents for autonomous enterprise deployment, particularly in strategic reasoning and task feasibility assessment.
๐ง Claude๐ง Opus
AIBearisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers introduced CoRE, a benchmark testing whether large language models can reason about human emotions through cognitive dimensions rather than just labels. The study found that while LLMs capture systematic relations between cognitive appraisals and emotions, they show misalignment with human judgments and instability across different contexts.
AIBearisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers identify a significant bias in Large Language Models when processing multiple updates to the same factual information within context. The study reveals that LLMs struggle to accurately retrieve the most recent version of updated facts, with performance degrading as the number of updates increases, similar to memory interference patterns observed in cognitive psychology.
AIBearisharXiv โ CS AI ยท Mar 56/10
๐ง A research study tested 11 AI tools on their ability to classify the cognitive demand of mathematical tasks, finding they achieved only 63% accuracy on average with no tool exceeding 83%. The tools showed systematic bias toward middle-category classifications and struggled with reasoning about underlying cognitive processes versus surface textual features.
๐ข Perplexity๐ง ChatGPT๐ง Claude
AIBearisharXiv โ CS AI ยท Mar 56/10
๐ง Research comparing four state-of-the-art language models (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5, and Centaur) to humans in goal selection tasks reveals substantial divergence in behavior. While humans explore diverse approaches and learn gradually, the AI models tend to exploit single solutions or show poor performance, raising concerns about using current LLMs as proxies for human decision-making in critical applications.
๐ง Claude๐ง Gemini
AIBearisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers introduced ฯ-Knowledge, a new benchmark for evaluating AI conversational agents in knowledge-intensive environments, specifically testing their ability to retrieve and apply unstructured domain knowledge. Even frontier AI models achieved only 25.5% success rates when navigating complex fintech customer support scenarios with 700 interconnected knowledge documents.
AIBearisharXiv โ CS AI ยท Mar 46/103
๐ง Researchers introduce SpatialText, a diagnostic framework to test whether large language models can truly reason about spatial relationships or merely rely on linguistic patterns. The study reveals that current AI models fail at egocentric perspective reasoning despite proficiency in basic spatial fact retrieval.
AINeutralarXiv โ CS AI ยท Mar 46/103
๐ง Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.
AIBearisharXiv โ CS AI ยท Mar 46/103
๐ง New research reveals that current large language models struggle with collaborative reasoning, showing that 'stronger' models are often more fragile when distracted by misleading information. The study of 15 LLMs found they fail to effectively leverage guidance from other models, with success rates below 9.2% on challenging problems.
AIBearisharXiv โ CS AI ยท Mar 47/103
๐ง Researchers introduced ZeroDayBench, a new benchmark testing LLM agents' ability to find and patch 22 critical vulnerabilities in open-source code. Testing on frontier models GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 revealed that current LLMs cannot yet autonomously solve cybersecurity tasks, highlighting limitations in AI-powered code security.
AINeutralarXiv โ CS AI ยท Feb 277/103
๐ง Researchers introduce Tool Decathlon (Toolathlon), a comprehensive benchmark for evaluating AI language agents across 32 software applications and 604 tools in realistic, multi-step scenarios. The benchmark reveals significant limitations in current AI models, with the best performer (Claude-4.5-Sonnet) achieving only 38.6% success rate on complex, real-world tasks.
AIBearishDecrypt โ AI ยท 21h ago6/10
๐ง KellyBench tested eight leading AI models including Claude, GPT-5, Gemini, and Grok on Premier League sports betting predictions over a full season, and none generated profits. The results highlight the persistent difficulty AI faces in beating efficient markets despite advances in language models and reasoning capabilities.
๐ง GPT-5๐ง Claude๐ง Gemini
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers discovered that large language models exhibit working memory limitations similar to humans, encoding multiple memory items in entangled representations that require interference control rather than direct retrieval. This finding reveals a shared computational constraint between biological and artificial systems, suggesting that working memory capacity may be a fundamental bottleneck in intelligent systems rather than a limitation unique to biological brains.
AIBearisharXiv โ CS AI ยท 3d ago6/10
๐ง Researchers found that large language models fail to accurately simulate human susceptibility to misinformation, consistently overstating how attitudes drive belief and sharing while ignoring social network effects. The study reveals systematic biases in how LLMs represent misinformation concepts, suggesting they are better tools for identifying where AI diverges from human judgment rather than replacing human survey responses.
AIBearisharXiv โ CS AI ยท 6d ago6/10
๐ง Researchers introduce CLI-Tool-Bench, a new benchmark for evaluating large language models' ability to generate complete software from scratch. Testing seven state-of-the-art LLMs reveals that top models achieve under 43% success rates, exposing significant limitations in current AI-driven 0-to-1 software generation despite increased computational investment.
AIBearisharXiv โ CS AI ยท Apr 76/10
๐ง New research reveals that Large Language Models (LLMs) exhibit cultural bias and Western defaultism when generating metaphors across different cultural contexts. The study found that LLMs act more as cultural translators using dominant Western frameworks rather than true culturally-aware reasoning systems, even when prompted with specific cultural identities.
AIBearisharXiv โ CS AI ยท Apr 76/10
๐ง Research reveals that Large Language Models (LLMs) experience greater performance degradation when facing English as a Second Language (ESL) inputs combined with typographical errors, compared to either factor alone. The study tested eight ESL variants with three levels of typos, finding that evaluations on clean English may overestimate real-world model performance.
AIBearisharXiv โ CS AI ยท Apr 76/10
๐ง Research reveals AI-generated economics papers significantly underperform human-authored publications, with idea quality representing the primary bottleneck (71% of the gap) rather than execution quality. Analysis of 953 papers shows human research achieves 47.1% exceptional probability versus 16.5% for AI, with only 0.8% of AI papers surpassing median human quality on both dimensions.
๐ง Gemini
AINeutralarXiv โ CS AI ยท Apr 66/10
๐ง Researchers introduce XpertBench, a new benchmark for evaluating Large Language Models on expert-level professional tasks across domains like finance, healthcare, and legal services. Even top-performing LLMs achieve only ~66% success rates, revealing a significant 'expert-gap' in current AI systems' ability to handle complex professional work.
AIBearisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers introduce DeltaLogic, a new benchmark that tests AI models' ability to revise their logical conclusions when presented with minimal changes to premises. The study reveals that language models like Qwen and Phi-4 struggle with belief revision even when they perform well on initial reasoning tasks, showing concerning inertia patterns where models fail to update conclusions when evidence changes.
AIBearisharXiv โ CS AI ยท Apr 66/10
๐ง Research study reveals that Large Language Models can reproduce behavioral patterns but fail to accurately predict intervention effects. The study tested three LLMs on climate psychology interventions across 59,508 participants from 62 countries, finding that descriptive accuracy doesn't translate to causal prediction accuracy.
AIBearisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers introduced ChomskyBench, a new benchmark for evaluating large language models' formal reasoning capabilities using the Chomsky Hierarchy framework. The study reveals that while larger models show improvements, current LLMs face severe efficiency barriers and are significantly less efficient than traditional algorithmic programs for formal reasoning tasks.