#ai-limitations News & Analysis

83 articles tagged with #ai-limitations. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

83 articles

AIBearishWired – AI · 2d ago6/10

🧠

Hands-On With Gemini Spark: I Gave It Access to My Life and It Friend-Zoned My Boyfriend

Google's Gemini Spark AI agent was given access to a user's emails, documents, and calendar to plan a birthday party, but failed to recognize the user's boyfriend as an important person despite having comprehensive personal data. The incident highlights significant limitations in current AI agents' contextual understanding and relationship inference capabilities, raising questions about how well these systems truly comprehend human priorities.

🧠 Gemini

AINeutralarXiv – CS AI · 2d ago6/10

🧠

Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale

Researchers developed a triadic collaboration system integrating Large Language Models, teachers, and students for K-12 writing education, evaluated across 57,954 essays from 10,195 students over two years. The study demonstrates that LLMs effectively reduce teacher workload while teachers serve as quality gatekeepers, though excessive AI suggestions produce diminishing returns, indicating the need for adaptive collaboration strategies.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Understanding Automated Program Repair Agents Through the Lens of Traceability: An Empirical Study

Researchers conducted the first systematic analysis of five state-of-the-art Automated Program Repair agents across 500 real-world tasks, revealing that while LLM-based agents excel at simple fixes, they struggle with logic-intensive bugs and lack access to proper debugging tools. The study identifies critical limitations in current APR systems, including poor test generation capabilities and primitive tooling, proposing that next-generation systems require richer tool ecosystems and better benchmark metrics.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

The Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment

Researchers introduce Prosecution Decision Prediction (PDP), a new legal AI benchmark that evaluates criminal liability assessment at the prosecutorial review stage rather than post-indictment. The study reveals that state-of-the-art large language models perform substantially worse on PDP tasks than traditional Legal Judgment Prediction, exposing significant gaps in AI's ability to evaluate evidence and apply legal discretion.

AIBearishTechCrunch – AI · 3d ago6/10

🧠

Why Google’s AI can’t spell Google (or anything else)

Google's AI systems have demonstrated a surprising inability to accurately spell basic words, including Google itself, exposing fundamental limitations in current large language models despite their apparent sophistication. This incident highlights ongoing challenges in AI reliability and raises questions about the robustness of AI systems being deployed at scale.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Researchers introduced OCR-Reasoning, a new benchmark with 1,069 annotated examples to evaluate how well multimodal AI models handle text-rich image reasoning tasks. The evaluation revealed that even the most advanced models fail to exceed 50% accuracy, indicating significant gaps in this critical capability area.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models

Researchers introduced EconCausal, a benchmark dataset of 10,490 annotated economic causal relationships from peer-reviewed studies, revealing that large language models struggle to properly condition predictions on changing contexts—achieving 88% accuracy in fixed scenarios but dropping to 41.3% when context shifts require reversing causal directions.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering and Reasoning

Researchers introduced EpiQAL, the first benchmark for evaluating large language models on epidemiological reasoning tasks. Testing 15 models reveals significant performance gaps in multi-step inference and evidence synthesis, indicating current LLMs struggle with population-level disease analysis despite their general capabilities.

AIBearishWired – AI · 5d ago6/10

🧠

I’m a Professional Fact-Checker. AI Is Wrong More Often Than You Think

A WIRED fact-checker examines AI's capability to perform fact-checking and finds that AI systems produce inaccurate results more frequently than commonly assumed. The article highlights a critical gap between AI's perceived reliability and its actual performance in verification tasks, raising concerns about deploying AI for critical information validation.

AIBearisharXiv – CS AI · May 126/10

🧠

Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research

A new benchmarking framework reveals that AI tools in academic research excel at exploration and summaries but fail at precision tasks requiring exact information extraction. The study demonstrates that explainable AI features are inadequate, forcing researchers to manually verify outputs, and literature review tools lack reproducibility and transparency for systematic research.

🏢 xAI

AINeutralarXiv – CS AI · May 126/10

🧠

The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans

Researchers studying 21 large language models found a significant 'grounding gap' in how LLMs understand abstract concepts compared to humans. While LLMs rely heavily on word associations, they systematically underreproduce emotional and internal-state properties, achieving maximum correlation of r=0.37 versus human-to-human baselines above r=0.9. The findings suggest current models can identify grounding dimensions when explicitly queried but fail to recruit them naturally during free generation.

AINeutralarXiv – CS AI · May 125/10

🧠

Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke

Researchers evaluate semantic search as a tool for analyzing 18th-century intellectual history, specifically tracking how John Locke's ideas circulated through paraphrases and implicit references. While semantic search substantially outperforms traditional lexical methods at capturing meaning-level correspondences, linguistic analysis reveals that retrieval remains constrained by surface-level vocabulary overlap, suggesting both promise and limitations for historical corpus analysis.

AINeutralarXiv – CS AI · May 116/10

🧠

MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries

Researchers introduced MathlibPR, a benchmark dataset derived from real Mathlib4 pull request histories, to evaluate whether large language models can assist in reviewing mathematical code contributions. Testing revealed that current LLMs struggle to distinguish merge-ready pull requests from those that passed builds but were revised or rejected, highlighting limitations in automated code review for formal mathematics.

🧠 Claude

AIBearisharXiv – CS AI · May 116/10

🧠

Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation

Researchers found that Large Language Models lack behavioral coherence across different experimental settings, despite generating responses similar to humans. While LLMs can mimic human survey answers, they fail to maintain consistent behavioral profiles when tested conversationally, revealing a critical limitation for their use as substitutes in human-subject research.

AINeutralarXiv – CS AI · May 46/10

🧠

Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues

Researchers introduce ArabCulture-Dialogue, a new dataset for evaluating large language models' cultural reasoning across 13 Arabic-speaking countries in both Modern Standard Arabic and regional dialects. Benchmarking reveals significant performance gaps, with LLMs consistently underperforming on dialectal Arabic compared to standardized variants, highlighting a critical blind spot in AI language model training.

AIBearishFortune Crypto · May 16/10

🧠

Hitting the ‘GenAI wall’: Where generative AI stops working, and what it means for your talent strategy

Companies implementing generative AI face a critical limitation where AI capabilities plateau without domain expertise, forcing organizations to reconsider workforce strategy. This phenomenon, termed the 'GenAI wall,' suggests that eliminating human expertise in favor of AI automation leads to stalled transformation initiatives and underperformance.

AINeutralarXiv – CS AI · May 16/10

🧠

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Researchers introduce TopBench, a benchmark dataset of 779 samples designed to evaluate how well Large Language Models handle implicit prediction tasks over tabular data—queries requiring inference from historical patterns rather than simple data retrieval. Testing reveals current LLMs struggle with intent recognition and default to lookup-based approaches, indicating that accurate intent disambiguation is critical before predictive reasoning can succeed.

AINeutralarXiv – CS AI · Apr 206/10

🧠

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

Researchers introduce ReactBench, a benchmark that exposes critical limitations in multimodal large language models' ability to reason about complex topological structures in chemical reaction diagrams. Testing 17 MLLMs reveals a 30%+ performance gap between simple anchor-based tasks and sophisticated structural reasoning tasks, indicating that visual reasoning capabilities remain fundamentally constrained despite strong semantic recognition abilities.

AINeutralarXiv – CS AI · Apr 206/10

🧠

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

Researchers introduce SocialGrid, a benchmark environment for evaluating Large Language Models as autonomous agents in multi-agent social scenarios. The study reveals that even the most capable open-source LLMs achieve below 60% task completion and struggle significantly with social reasoning tasks like detecting deception, exposing critical limitations in current AI agent capabilities.

AIBearishDecrypt – AI · Apr 156/10

🧠

Can AI Beat the Sports Betting Market? 8 of the Top Models Tried

KellyBench tested eight leading AI models including Claude, GPT-5, Gemini, and Grok on Premier League sports betting predictions over a full season, and none generated profits. The results highlight the persistent difficulty AI faces in beating efficient markets despite advances in language models and reasoning capabilities.

🧠 GPT-5🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · Apr 146/10

🧠

Human-like Working Memory Interference in Large Language Models

Researchers discovered that large language models exhibit working memory limitations similar to humans, encoding multiple memory items in entangled representations that require interference control rather than direct retrieval. This finding reveals a shared computational constraint between biological and artificial systems, suggesting that working memory capacity may be a fundamental bottleneck in intelligent systems rather than a limitation unique to biological brains.

AIBearisharXiv – CS AI · Apr 136/10

🧠

Overstating Attitudes, Ignoring Networks: LLM Biases in Simulating Misinformation Susceptibility

Researchers found that large language models fail to accurately simulate human susceptibility to misinformation, consistently overstating how attitudes drive belief and sharing while ignoring social network effects. The study reveals systematic biases in how LLMs represent misinformation concepts, suggesting they are better tools for identifying where AI diverges from human judgment rather than replacing human survey responses.

AIBearisharXiv – CS AI · Apr 106/10

🧠

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

Researchers introduce CLI-Tool-Bench, a new benchmark for evaluating large language models' ability to generate complete software from scratch. Testing seven state-of-the-art LLMs reveals that top models achieve under 43% success rates, exposing significant limitations in current AI-driven 0-to-1 software generation despite increased computational investment.

AIBearisharXiv – CS AI · Apr 76/10

🧠

The Ideation Bottleneck: Decomposing the Quality Gap Between AI-Generated and Human Economics Research

Research reveals AI-generated economics papers significantly underperform human-authored publications, with idea quality representing the primary bottleneck (71% of the gap) rather than execution quality. Analysis of 953 papers shows human research achieves 47.1% exceptional probability versus 16.5% for AI, with only 0.8% of AI papers surpassing median human quality on both dimensions.

🧠 Gemini

AIBearisharXiv – CS AI · Apr 76/10

🧠

Individual and Combined Effects of English as a Second Language and Typos on LLM Performance

Research reveals that Large Language Models (LLMs) experience greater performance degradation when facing English as a Second Language (ESL) inputs combined with typographical errors, compared to either factor alone. The study tested eight ESL variants with three levels of typos, finding that evaluations on clean English may overestimate real-world model performance.

← PrevPage 2 of 4Next →