13 articles tagged with #llm-limitations. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers tested whether large language models develop spatial world models through maze-solving tasks, finding that leading models like Gemini, GPT-4, and Claude struggle with spatial reasoning. Performance varies dramatically (16-86% accuracy) depending on input format, suggesting LLMs lack robust, format-invariant spatial understanding rather than building true internal world models.
🧠 GPT-5🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · 2d ago7/10
🧠Researchers introduce PaperScope, a comprehensive benchmark for evaluating multi-modal AI systems on complex scientific research tasks across multiple documents. The benchmark reveals that even advanced systems like OpenAI Deep Research and Tongyi Deep Research struggle with long-context retrieval and cross-document reasoning, exposing significant gaps in current AI capabilities for scientific workflows.
🏢 OpenAI
AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers discovered that at least 27% of labels in MedCalc-Bench, a clinical benchmark partly created with LLM assistance, contain errors or are incomputable. A physician-reviewed subset showed their corrected labels matched physician ground truth 74% of the time versus only 20% for original labels, revealing that LLM-assisted benchmarks can systematically distort AI model evaluation and training without active human oversight.
AINeutralarXiv – CS AI · 3d ago7/10
🧠Researchers introduce PilotBench, a benchmark evaluating large language models on safety-critical aviation tasks using 708 real-world flight trajectories. The study reveals a fundamental trade-off: traditional forecasters achieve superior numerical precision (7.01 MAE) while LLMs provide better instruction-following (86-89%) but with significantly degraded prediction accuracy (11-14 MAE), exposing brittleness in implicit physics reasoning for embodied AI applications.
AIBearishWired – AI · 6d ago7/10
🧠Meta's Muse Spark AI model requests access to users' raw health data including lab results, raising significant privacy concerns while demonstrating poor medical judgment. The system exemplifies how large language models lack the expertise to provide reliable healthcare guidance despite their persuasive presentation.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers propose a dual-helix governance framework to address AI agent reliability issues in WebGIS development, implementing a 3-track architecture that achieved 51% reduction in code complexity. The framework uses knowledge graphs and self-learning cycles to overcome LLM limitations like context constraints and instruction failures.
AINeutralarXiv – CS AI · Feb 277/107
🧠Researchers developed Compositional-ARC, a dataset to test AI models' ability to systematically generalize abstract spatial reasoning tasks. A small 5.7M parameter transformer model trained with meta-learning outperformed large language models like GPT-4o and Gemini 2.0 Flash on novel geometric transformation combinations.
AIBearisharXiv – CS AI · Feb 277/107
🧠New research reveals that GPT-4o and other large language models lack true Theory of Mind capabilities, despite appearing socially proficient. While LLMs can approximate human judgments in simple social tasks, they fail at logically equivalent challenges and show inconsistent mental state reasoning.
AIBearisharXiv – CS AI · 3d ago6/10
🧠Researchers introduce OmniBehavior, a benchmark for evaluating large language models' ability to simulate real-world human behavior across complex, long-horizon scenarios. The study reveals that current LLMs struggle with authentic behavioral simulation and exhibit systematic biases toward homogenized, overly-positive personas rather than capturing individual differences and realistic long-tail behaviors.
AINeutralCrypto Briefing · 5d ago6/10
🧠Ranjan Roy discusses AI's transition toward consumption-based pricing models that could reshape digital service economics similar to utility billing. Roy addresses public concerns about AI advancement speed while cautioning that large language models are frequently overvalued beyond their practical capabilities.
AINeutralCrypto Briefing · 5d ago7/10
🧠Vishal Misra discusses how transformers learn correlations rather than causal relationships, highlighting the importance of in-context learning and Bayesian updating for advancing AI capabilities beyond pattern matching toward genuine reasoning.
AIBearisharXiv – CS AI · Mar 276/10
🧠Researchers introduce MolQuest, a new benchmark for evaluating AI models' ability to perform complex chemical structure elucidation through multi-step reasoning. Even state-of-the-art AI models achieve only 50% accuracy on this real-world scientific task, revealing significant limitations in current AI systems' strategic reasoning capabilities.
AIBearisharXiv – CS AI · Mar 26/1013
🧠Researchers created ProbCOPA, a dataset testing probabilistic reasoning in humans versus AI models, finding that state-of-the-art LLMs consistently fail to match human judgment patterns. The study reveals fundamental differences in how humans and AI systems process non-deterministic inferences, highlighting limitations in current AI reasoning capabilities.