#model-limitations News & Analysis

22 articles tagged with #model-limitations. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

22 articles

AINeutralarXiv – CS AI · Jun 257/10

🧠

Position: Reasoning After Perception Means Reasoning Without Vision

Researchers challenge the assumption that language reasoning can compensate for vision-language model weaknesses, arguing that deferring visual reasoning to text collapses spatial information and degrades perception to passive encoding. The study introduces the Turing Eye Test to demonstrate tasks requiring visual reasoning in pixel space cannot be solved through text-only reasoning alone, suggesting AI architectures must shift toward reasoning within perception rather than about it.

AINeutralarXiv – CS AI · Jun 237/10

🧠

A Verifiable Search Is Not a Learnable Chain-of-Thought

Researchers demonstrate that language models cannot reliably learn certain types of algorithmic reasoning—specifically backtracking search procedures—through chain-of-thought fine-tuning, regardless of model size or training method. While models perform individual computational steps correctly, they fail to chain those steps into valid forward derivations when the task requires combinatorial search over unstructured information.

AIBearisharXiv – CS AI · Jun 37/10

🧠

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

Researchers demonstrate that Large Reasoning Models (LRMs) frequently 'overthink' problems after reaching correct answers, with continued reasoning degrading accuracy by up to 21%. The study introduces a protocol to measure reasoning sufficiency and reveals that harmful overthinking—where additional reasoning destabilizes correct solutions—represents a broader reliability risk affecting both multimodal and language-only models.

AIBearisharXiv – CS AI · Jun 27/10

🧠

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Researchers demonstrate that Large Language Models exhibit significant limitations in zero-shot annotation tasks, with only 34.8% of initial errors correctable through prompting. The study reveals that model-internalized priors and concept definitions strongly influence LLM performance more than text-level memorization, highlighting fundamental constraints in LLM adaptability for reliable AI-as-a-judge applications.

AIBearisharXiv – CS AI · Jun 27/10

🧠

InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

Researchers introduced InPhyRe, a new benchmark showing that large multimodal models (LMMs) struggle with inductive physical reasoning—their ability to apply learned physical laws to novel, unseen scenarios. Testing 13 LMMs revealed critical weaknesses: models fail to generalize parametric knowledge, perform poorly with unseen physical laws, and exhibit language bias that causes them to ignore visual inputs, raising concerns about their reliability for safety-critical applications.

AIBearisharXiv – CS AI · May 297/10

🧠

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

A physicist supervised Claude AI models over 12 days to build CLAX-PT, a physics simulation module, documenting how AI agents struggle with architectural redesign and distinguishing symptom-fixes from root-cause solutions. The study reveals that supervision design and human domain expertise, rather than model capability alone, determine whether AI-generated scientific code produces trustworthy results.

🧠 Claude

AIBearishArs Technica – AI · May 287/10

🧠

LLMs believe false statements even after explicit warnings that they're false

Research demonstrates that large language models persistently represent false statements as true even after explicit corrections, exhibiting a systematic bias toward confident affirmation regardless of accuracy. This finding reveals a fundamental vulnerability in LLM reliability that has implications for applications requiring factual precision.

AIBearisharXiv – CS AI · May 127/10

🧠

Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

Researchers identified epistemic overreach in LLM-generated explanations of personal sensing data, where AI systems produce coherent-sounding narratives about anomalous days without sufficient evidentiary support. Testing 14,922 explanations across three LLM families revealed that models routinely attribute causes without data justification, and this problem persists even when provided richer context or explicit instructions to constrain claims.

🧠 Llama

AINeutralarXiv – CS AI · May 127/10

🧠

MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies

Researchers introduce MedMeta, a benchmark evaluating how well large language models can synthesize conclusions from medical meta-analyses using only study abstracts. The study reveals that retrieval-augmented generation (RAG) significantly outperforms parametric-only approaches, but all current models struggle with evidence synthesis and fail to properly reject contradictory findings, achieving only marginally above-average performance even under ideal conditions.

AIBearisharXiv – CS AI · May 47/10

🧠

Language Models Struggle to Use Representations Learned In-Context

A new research study reveals that large language models struggle to effectively use representations they learn from in-context information, even though they can encode this information internally. The findings suggest current LLMs have fundamental limitations in adapting to novel contexts, affecting their ability to generalize learned patterns to downstream tasks.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Characterizing Pattern Matching and Its Limits on Compositional Task Structures

New research formally defines and analyzes pattern matching in large language models, revealing predictable limits in their ability to generalize on compositional tasks. The study provides mathematical boundaries for when pattern matching succeeds or fails, with implications for AI model development and understanding.

AIBearishMIT News – AI · Nov 267/106

🧠

Researchers discover a shortcoming that makes LLMs less reliable

Researchers have identified a significant reliability issue in large language models where they incorrectly associate certain sentence patterns with specific topics. This causes LLMs to repeat learned patterns rather than engage in proper reasoning, undermining their reliability for critical applications.

$LINK

AINeutralarXiv – CS AI · Jun 236/10

🧠

When Does Intrinsic Self-Correction Help? A Task-Sensitive Analysis

Researchers find that intrinsic self-correction in large language models works inconsistently across tasks, succeeding only when task structure supports specific revision mechanisms like constraint verification or complex reasoning review. The study challenges the assumption that self-correction is universally reliable and instead positions it as a task-dependent inference strategy.

AINeutralarXiv – CS AI · Jun 196/10

🧠

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

Researchers introduce CombEval, a dynamic benchmark framework for evaluating how well large language models handle combinatorial counting problems. Testing 11 LLMs reveals significant brittleness in handling ordered objects, indistinguishable elements, and nested dependencies, with code-augmented approaches showing modest improvements over direct reasoning.

AIBearisharXiv – CS AI · Jun 196/10

🧠

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

Researchers reveal that large language models hit a hard ceiling at 90.8% accuracy on hardware design tasks, with failures rooted in fundamental knowledge gaps rather than training alignment issues. The study introduces a new error taxonomy showing that while optimization eliminates syntax errors, it paradoxically worsens deeper functional failures, suggesting that improving LLM hardware generation requires architectural advances in reasoning rather than refinement techniques.

AINeutralarXiv – CS AI · Jun 96/10

🧠

TempoBench: Evaluating Temporal Causal Reasoning in Large Language Models

Researchers introduce TempoBench, a formally verified benchmark for evaluating temporal causal reasoning in large language models, revealing a significant gap between forward simulation performance (96% accuracy) and causal reasoning ability (below 25%). The study demonstrates that LLMs struggle with identifying minimal causal inputs, instead over-specifying by listing all possible inputs rather than reasoning about necessity.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

Researchers introduce ChronoVision, a benchmark dataset to evaluate how Vision-Language Models reason about temporal information across images. The study reveals that VLMs often rely on superficial visual shortcuts like color filters rather than genuine chronological logic to make temporal judgments.

AIBearisharXiv – CS AI · Jun 56/10

🧠

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

Researchers conducted the first systematic evaluation of Large Language Models' ability to generate correct TLA+ formal specifications from natural language, testing 30 LLMs across 2,730 runs. Results show LLMs achieve only 8.6% semantic correctness despite 26.6% syntactic correctness, indicating current models cannot reliably produce formal specifications without expert oversight.

AIBearisharXiv – CS AI · May 276/10

🧠

PitchBench: Measuring Pitch Hearing in Audio-Language Models

Researchers introduce PitchBench, a comprehensive evaluation suite that reveals audio-language models struggle significantly with pitch hearing—a fundamental musical perception task. The benchmark's 28 experiments expose inconsistent performance across different acoustic conditions, instrument types, and response formats, indicating current ALMs lack reliable pitch perception despite their growing real-world deployment in music applications.

AINeutralarXiv – CS AI · May 116/10

🧠

Replicating Human Motivated Reasoning Studies with LLMs

Researchers found that base large language models do not replicate human motivated reasoning patterns when tested across four political studies. Unlike humans who adjust their reasoning based on desired conclusions, LLMs show different behavioral patterns, raising concerns about using these models for opinion simulation and argument assessment tasks.

AINeutralarXiv – CS AI · Mar 36/103

🧠

WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

Researchers introduced WebDevJudge, a benchmark for evaluating how well AI models can judge web development quality compared to human experts. The study reveals significant gaps between AI judges and human evaluation, highlighting fundamental limitations in AI's ability to assess complex, interactive web development tasks.

AINeutralarXiv – CS AI · Apr 75/10

🧠

When Models Know More Than They Say: Probing Analogical Reasoning in LLMs

Researchers found that large language models (LLMs) have an asymmetry between their internal knowledge and prompted responses when detecting analogies. While probing reveals models understand rhetorical analogies better than their prompted responses suggest, both methods perform poorly on narrative analogies requiring deeper abstraction.