#diagnostic-framework News & Analysis

9 articles tagged with #diagnostic-framework. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles

AIBearisharXiv – CS AI · Jun 127/10

🧠

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Researchers introduce ToolSense, a diagnostic framework that reveals significant gaps in how large language models understand tools despite strong retrieval performance. Testing on ~47k tools shows parametric models collapse by 50-64% on realistic queries compared to benchmark performance, suggesting current evaluation methods mask fundamental knowledge deficiencies.

AINeutralarXiv – CS AI · Apr 157/10

🧠

Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities

Researchers propose a cognitive diagnostic framework that evaluates large language models across fine-grained ability dimensions rather than aggregate scores, enabling targeted model improvement and task-specific selection. The approach uses multidimensional Item Response Theory to estimate abilities across 35 dimensions for mathematics and generalizes to physics, chemistry, and computer science with strong predictive accuracy.

AINeutralarXiv – CS AI · Mar 57/10

🧠

RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

Researchers propose RAG-X, a diagnostic framework for evaluating retrieval-augmented generation systems in medical AI applications. The study reveals an 'Accuracy Fallacy' showing a 14% gap between perceived system success and actual evidence-based grounding in medical question-answering systems.

AINeutralarXiv – CS AI · Jun 235/10

🧠

PsyBridge: A Hybrid Intelligent Framework for Multi-Dimensional Mental Health Assessment and Decision Support

PsyBridge is a hybrid AI framework that integrates validated mental health screening tools (PHQ-9, GAD-7) with cognitive and personality assessments to provide interpretable, multi-dimensional mental health risk classification. The framework achieved 84% accuracy on a 500-patient semi-synthetic dataset, outperforming isolated screening instruments and demonstrating potential for digital healthcare and telehealth applications.

AINeutralarXiv – CS AI · May 296/10

🧠

NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs

Researchers have developed NICE, a theory-grounded diagnostic benchmark for evaluating the social intelligence of large language models, organizing social abilities into 4 categories and 11 dimensions. Testing across 5 frontier LLMs reveals that while models perform well in aggregate accuracy, they consistently struggle with communication tasks, particularly in multi-turn dialogue, nonverbal understanding, and synchrony.

AINeutralarXiv – CS AI · May 276/10

🧠

DIANOIA: Diagnostic Decomposition and Joint Optimization for Multi-Agent Reasoning

Researchers introduce DIANOIA, a diagnostic framework for multi-agent LLM systems that decomposes reasoning performance into three measurable channels: coverage, fidelity, and synthesis. The method enables practitioners to identify performance bottlenecks and allocate computational resources more efficiently, achieving significant improvements on multiple benchmarks.

🧠 Claude

AINeutralarXiv – CS AI · May 126/10

🧠

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

Researchers present a diagnostic framework for evaluating KV cache eviction selectors in large language models, identifying three failure modes and demonstrating that value-aware ranking combined with evidence recovery achieves 72.6% accuracy on positive-margin test cases. The work addresses a critical bottleneck in long-context LLM inference by revealing why compression strategies succeed or fail.

AIBullisharXiv – CS AI · Mar 276/10

🧠

TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis

Researchers introduce TRAJEVAL, a diagnostic framework that breaks down AI code agent performance into three stages (search, read, edit) to identify specific failure points rather than just binary pass/fail outcomes. The framework analyzed 16,758 trajectories and found that real-time feedback based on trajectory signals improved state-of-the-art models by 2.2-4.6 percentage points while reducing costs by 20-31%.

🧠 GPT-5

AINeutralarXiv – CS AI · Mar 54/10

🧠

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

Researchers developed a framework using face pareidolia (seeing faces in non-face objects) to test how different AI vision models handle ambiguous visual information. The study found that vision-language models like CLIP and LLaVA tend to over-interpret ambiguous patterns, while pure vision models remain more uncertain and detection models are more conservative.