#llm-limitations News & Analysis

29 articles tagged with #llm-limitations. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

29 articles

AIBearishDecrypt · 2d ago7/10

🧠

AI Models Can’t Agree on Basic Facts Most of the Time, Study Shows

A new study found that five frontier AI models disagreed on how to fact-check 67% of 1,000 real-world claims, raising critical concerns about AI reliability and consistency. This inconsistency highlights fundamental limitations in current large language models that could impact their deployment in high-stakes applications requiring factual accuracy.

AINeutralarXiv – CS AI · 3d ago7/10

🧠

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

Researchers prove that large language models fundamentally cannot perform causal discovery through standard training methods, establishing this limitation as intrinsic to supervised learning rather than a model-specific flaw. They propose Agentic Causal Bayesian Optimization (A-CBO), which bypasses this constraint by using frozen language models as query oracles within an external optimization loop, achieving superior performance on causal inference benchmarks.

AIBearisharXiv – CS AI · 4d ago7/10

🧠

RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

Researchers introduce RepoMirage, an evaluation suite that tests whether code agents truly understand repository context by applying perturbations to challenge their reasoning abilities. The study reveals a significant gap in how agents handle complex, multi-file code tasks, with performance dropping from 66.8% to 25.3% when explicit structural understanding is required.

AIBearisharXiv – CS AI · May 127/10

🧠

MDGYM: Benchmarking AI Agents on Molecular Simulations

Researchers introduced MDGYM, a benchmark testing AI agents' ability to autonomously execute molecular dynamics simulations, finding that even the strongest systems solve only 21% of easy tasks. The poor performance reveals that advanced code generation does not translate to physical reasoning, exposing a critical gap between general software engineering competence and domain-specific scientific workflows.

🧠 Claude

AIBearisharXiv – CS AI · May 127/10

🧠

In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification

Researchers demonstrate that large language models suffer from 'in-context fixation,' where homogeneous demonstration labels—even semantically valid ones—cause classification accuracy to collapse below 12%. The models treat label-slot tokens as an exhaustive vocabulary set rather than learning from semantic meaning, revealing that in-context learning operates as constrained vocabulary retrieval rather than genuine concept learning.

🧠 Llama

AIBearisharXiv – CS AI · May 117/10

🧠

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Research reveals that AI models, particularly few-shot large language models, struggle significantly with mid-range quality responses in automated short answer scoring, while fine-tuned models and human experts maintain consistent performance across all quality levels. This degradation raises fairness concerns for students with developing understanding, emphasizing the need for quality-conditioned evaluation metrics.

🧠 GPT-4🧠 GPT-5🧠 Claude

AIBearisharXiv – CS AI · Apr 147/10

🧠

Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks

Researchers tested whether large language models develop spatial world models through maze-solving tasks, finding that leading models like Gemini, GPT-4, and Claude struggle with spatial reasoning. Performance varies dramatically (16-86% accuracy) depending on input format, suggesting LLMs lack robust, format-invariant spatial understanding rather than building true internal world models.

🧠 GPT-5🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · Apr 147/10

🧠

PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers

Researchers introduce PaperScope, a comprehensive benchmark for evaluating multi-modal AI systems on complex scientific research tasks across multiple documents. The benchmark reveals that even advanced systems like OpenAI Deep Research and Tongyi Deep Research struggle with long-context retrieval and cross-document reasoning, exposing significant gaps in current AI capabilities for scientific workflows.

🏢 OpenAI

AIBearisharXiv – CS AI · Apr 147/10

🧠

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Researchers discovered that at least 27% of labels in MedCalc-Bench, a clinical benchmark partly created with LLM assistance, contain errors or are incomputable. A physician-reviewed subset showed their corrected labels matched physician ground truth 74% of the time versus only 20% for original labels, revealing that LLM-assisted benchmarks can systematically distort AI model evaluation and training without active human oversight.

AINeutralarXiv – CS AI · Apr 137/10

🧠

PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

Researchers introduce PilotBench, a benchmark evaluating large language models on safety-critical aviation tasks using 708 real-world flight trajectories. The study reveals a fundamental trade-off: traditional forecasters achieve superior numerical precision (7.01 MAE) while LLMs provide better instruction-following (86-89%) but with significantly degraded prediction accuracy (11-14 MAE), exposing brittleness in implicit physics reasoning for embodied AI applications.

AIBearishWired – AI · Apr 107/10

🧠

Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice

Meta's Muse Spark AI model requests access to users' raw health data including lab results, raising significant privacy concerns while demonstrating poor medical judgment. The system exemplifies how large language models lack the expertise to provide reliable healthcare guidance despite their persuasive presentation.

AIBullisharXiv – CS AI · Mar 56/10

🧠

A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

Researchers propose a dual-helix governance framework to address AI agent reliability issues in WebGIS development, implementing a 3-track architecture that achieved 51% reduction in code complexity. The framework uses knowledge graphs and self-learning cycles to overcome LLM limitations like context constraints and instruction failures.

AINeutralarXiv – CS AI · Feb 277/107

🧠

Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning

Researchers developed Compositional-ARC, a dataset to test AI models' ability to systematically generalize abstract spatial reasoning tasks. A small 5.7M parameter transformer model trained with meta-learning outperformed large language models like GPT-4o and Gemini 2.0 Flash on novel geometric transformation combinations.

AIBearisharXiv – CS AI · Feb 277/107

🧠

GPT-4o Lacks Core Features of Theory of Mind

New research reveals that GPT-4o and other large language models lack true Theory of Mind capabilities, despite appearing socially proficient. While LLMs can approximate human judgments in simple social tasks, they fail at logically equivalent challenges and show inconsistent mental state reasoning.

AIBearishFortune Crypto · 3d ago6/10

🧠

Starbucks quietly retired its AI agent just months after deployment after it hallucinated coffee shop inventories and slowed down baristas

Starbucks decommissioned an AI agent deployed to manage inventory and operations after just months of use due to persistent hallucinations and performance degradation that ultimately slowed barista workflows. The failure highlights critical challenges in deploying large language models to real-world operational tasks where accuracy directly impacts business efficiency.

AIBearisharXiv – CS AI · 3d ago6/10

🧠

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

Researchers introduce DynaSchedBench, a calibrated framework for testing AI agents on dynamic job scheduling problems, revealing that large language models underperform expectations. The study uncovers an 'Observability Paradox' where providing agents with complete information actually degrades performance, and shows LLM-based schedulers fail to consistently outperform traditional heuristic baselines despite significant computational overhead.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Researchers introduced MentalMap, a multilingual benchmark testing whether large language models can build spatial world models from text alone. The study found a universal performance cliff at reasoning level L3 across all tested models and languages, where models fail to maintain spatial reasoning accuracy despite strong baseline performance, suggesting fundamental text-only working memory constraints rather than architectural limitations.

AINeutralarXiv – CS AI · 4d ago5/10

🧠

Managing Uncertainty in LLM-Generated Procedural Knowledge for Virtual Laboratory Planning

Researchers present a framework for managing uncertainty in language model-generated laboratory procedures for virtual educational environments. The system uses structured domain representations and LLM outputs to extract, validate, and repair procedural steps, addressing common LLM failures like missing actions, incorrect sequencing, and logical incompatibilities.

AIBullishMIT Technology Review · May 216/10

🧠

Roundtables: Can AI Learn to Understand the World?

AI companies are advancing world models to help systems better understand the external environment and move beyond the limitations of large language models. A roundtable discussion featuring MIT Technology Review editors explores how this emerging capability could reshape AI development.

AIBearisharXiv – CS AI · May 126/10

🧠

Agentic AI Scientists Are Not Built For Autonomous Scientific Discovery

A new position paper argues that despite functioning as useful co-scientists, agentic AI systems are fundamentally not designed for truly autonomous scientific discovery due to challenges in problem selection bias, insufficient tacit knowledge in training data, compressed output diversity, and lack of real-world experimental feedback loops.

AINeutralarXiv – CS AI · May 96/10

🧠

Evaluating Prompting and Execution-Based Methods for Deterministic Computation in LLMs

Researchers systematically evaluated multiple prompting strategies for LLMs on deterministic computation tasks, finding that standard methods like Chain-of-Thought achieve only moderate accuracy while Program-of-Thought (PoT) and specialized models achieve perfect accuracy by delegating computation to external tools. The study demonstrates that LLMs simulate reasoning patterns rather than reliably performing exact symbolic computation, suggesting hybrid approaches combining LLMs with external executors provide more reliable solutions for deterministic tasks.

AINeutralarXiv – CS AI · May 76/10

🧠

Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

Researchers present a neuro-symbolic framework that challenges the conventional belief that temporal reasoning failures in LLMs stem from inherent logical deduction deficits. By decoupling text-to-event representation from symbolic reasoning using a Probabilistic Inconsistency Signal, the framework achieves perfect accuracy on structured temporal tasks and identifies that representation quality—not reasoning capability—is the true bottleneck.

AINeutralarXiv – CS AI · May 76/10

🧠

A Dialogue-Based Framework for Correcting Multimodal Errors in AI-Assisted STEM Education

Researchers evaluated three major LLMs (Claude, Gemini, ChatGPT) on multimodal physics problems and found a significant performance drop compared to text-only tasks, identifying visual processing as the primary failure mode. A structured dialogue intervention corrected 82% of errors overall and achieved 100% correction on visual processing errors, offering immediate solutions for educators without requiring model retraining.

🧠 ChatGPT🧠 Claude🧠 Gemini

AIBearisharXiv – CS AI · Apr 156/10

🧠

LLMs Struggle with Abstract Meaning Comprehension More Than Expected

Research shows that large language models like GPT-4o struggle significantly with abstract meaning comprehension across zero-shot, one-shot, and few-shot settings, while fine-tuned models like BERT and RoBERTa perform better. A bidirectional attention classifier inspired by human cognitive strategies improved accuracy by 3-4% on abstract reasoning tasks, revealing a critical gap in how modern LLMs handle non-concrete, high-level semantics.

🧠 GPT-4

AIBearisharXiv – CS AI · Apr 136/10

🧠

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Researchers introduce OmniBehavior, a benchmark for evaluating large language models' ability to simulate real-world human behavior across complex, long-horizon scenarios. The study reveals that current LLMs struggle with authentic behavioral simulation and exhibit systematic biases toward homogenized, overly-positive personas rather than capturing individual differences and realistic long-tail behaviors.

Page 1 of 2Next →