y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-limitations News & Analysis

13 articles tagged with #llm-limitations. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

13 articles
AIBearisharXiv – CS AI · 2d ago7/10
🧠

Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks

Researchers tested whether large language models develop spatial world models through maze-solving tasks, finding that leading models like Gemini, GPT-4, and Claude struggle with spatial reasoning. Performance varies dramatically (16-86% accuracy) depending on input format, suggesting LLMs lack robust, format-invariant spatial understanding rather than building true internal world models.

🧠 GPT-5🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · 2d ago7/10
🧠

PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers

Researchers introduce PaperScope, a comprehensive benchmark for evaluating multi-modal AI systems on complex scientific research tasks across multiple documents. The benchmark reveals that even advanced systems like OpenAI Deep Research and Tongyi Deep Research struggle with long-context retrieval and cross-document reasoning, exposing significant gaps in current AI capabilities for scientific workflows.

🏢 OpenAI
AIBearisharXiv – CS AI · 2d ago7/10
🧠

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Researchers discovered that at least 27% of labels in MedCalc-Bench, a clinical benchmark partly created with LLM assistance, contain errors or are incomputable. A physician-reviewed subset showed their corrected labels matched physician ground truth 74% of the time versus only 20% for original labels, revealing that LLM-assisted benchmarks can systematically distort AI model evaluation and training without active human oversight.

AINeutralarXiv – CS AI · 3d ago7/10
🧠

PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

Researchers introduce PilotBench, a benchmark evaluating large language models on safety-critical aviation tasks using 708 real-world flight trajectories. The study reveals a fundamental trade-off: traditional forecasters achieve superior numerical precision (7.01 MAE) while LLMs provide better instruction-following (86-89%) but with significantly degraded prediction accuracy (11-14 MAE), exposing brittleness in implicit physics reasoning for embodied AI applications.

AIBearishWired – AI · 6d ago7/10
🧠

Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice

Meta's Muse Spark AI model requests access to users' raw health data including lab results, raising significant privacy concerns while demonstrating poor medical judgment. The system exemplifies how large language models lack the expertise to provide reliable healthcare guidance despite their persuasive presentation.

Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice
AIBullisharXiv – CS AI · Mar 56/10
🧠

A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

Researchers propose a dual-helix governance framework to address AI agent reliability issues in WebGIS development, implementing a 3-track architecture that achieved 51% reduction in code complexity. The framework uses knowledge graphs and self-learning cycles to overcome LLM limitations like context constraints and instruction failures.

AINeutralarXiv – CS AI · Feb 277/107
🧠

Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning

Researchers developed Compositional-ARC, a dataset to test AI models' ability to systematically generalize abstract spatial reasoning tasks. A small 5.7M parameter transformer model trained with meta-learning outperformed large language models like GPT-4o and Gemini 2.0 Flash on novel geometric transformation combinations.

AIBearisharXiv – CS AI · Feb 277/107
🧠

GPT-4o Lacks Core Features of Theory of Mind

New research reveals that GPT-4o and other large language models lack true Theory of Mind capabilities, despite appearing socially proficient. While LLMs can approximate human judgments in simple social tasks, they fail at logically equivalent challenges and show inconsistent mental state reasoning.

AIBearisharXiv – CS AI · 3d ago6/10
🧠

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Researchers introduce OmniBehavior, a benchmark for evaluating large language models' ability to simulate real-world human behavior across complex, long-horizon scenarios. The study reveals that current LLMs struggle with authentic behavioral simulation and exhibit systematic biases toward homogenized, overly-positive personas rather than capturing individual differences and realistic long-tail behaviors.

AINeutralCrypto Briefing · 5d ago6/10
🧠

Ranjan Roy: AI is shifting towards consumption-based models, public fear stems from rapid advancements, and large language models are often overhyped | Big Technology

Ranjan Roy discusses AI's transition toward consumption-based pricing models that could reshape digital service economics similar to utility billing. Roy addresses public concerns about AI advancement speed while cautioning that large language models are frequently overvalued beyond their practical capabilities.

Ranjan Roy: AI is shifting towards consumption-based models, public fear stems from rapid advancements, and large language models are often overhyped | Big Technology
AIBearisharXiv – CS AI · Mar 26/1013
🧠

Humans and LLMs Diverge on Probabilistic Inferences

Researchers created ProbCOPA, a dataset testing probabilistic reasoning in humans versus AI models, finding that state-of-the-art LLMs consistently fail to match human judgment patterns. The study reveals fundamental differences in how humans and AI systems process non-deterministic inferences, highlighting limitations in current AI reasoning capabilities.