y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#diagnostic-framework News & Analysis

7 articles tagged with #diagnostic-framework. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles
AINeutralarXiv – CS AI · Apr 157/10
🧠

Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities

Researchers propose a cognitive diagnostic framework that evaluates large language models across fine-grained ability dimensions rather than aggregate scores, enabling targeted model improvement and task-specific selection. The approach uses multidimensional Item Response Theory to estimate abilities across 35 dimensions for mathematics and generalizes to physics, chemistry, and computer science with strong predictive accuracy.

AINeutralarXiv – CS AI · May 296/10
🧠

NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs

Researchers have developed NICE, a theory-grounded diagnostic benchmark for evaluating the social intelligence of large language models, organizing social abilities into 4 categories and 11 dimensions. Testing across 5 frontier LLMs reveals that while models perform well in aggregate accuracy, they consistently struggle with communication tasks, particularly in multi-turn dialogue, nonverbal understanding, and synchrony.

AINeutralarXiv – CS AI · May 276/10
🧠

DIANOIA: Diagnostic Decomposition and Joint Optimization for Multi-Agent Reasoning

Researchers introduce DIANOIA, a diagnostic framework for multi-agent LLM systems that decomposes reasoning performance into three measurable channels: coverage, fidelity, and synthesis. The method enables practitioners to identify performance bottlenecks and allocate computational resources more efficiently, achieving significant improvements on multiple benchmarks.

🧠 Claude
AINeutralarXiv – CS AI · May 126/10
🧠

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

Researchers present a diagnostic framework for evaluating KV cache eviction selectors in large language models, identifying three failure modes and demonstrating that value-aware ranking combined with evidence recovery achieves 72.6% accuracy on positive-margin test cases. The work addresses a critical bottleneck in long-context LLM inference by revealing why compression strategies succeed or fail.

AIBullisharXiv – CS AI · Mar 276/10
🧠

TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis

Researchers introduce TRAJEVAL, a diagnostic framework that breaks down AI code agent performance into three stages (search, read, edit) to identify specific failure points rather than just binary pass/fail outcomes. The framework analyzed 16,758 trajectories and found that real-time feedback based on trajectory signals improved state-of-the-art models by 2.2-4.6 percentage points while reducing costs by 20-31%.

🧠 GPT-5
AINeutralarXiv – CS AI · Mar 54/10
🧠

When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

Researchers developed a framework using face pareidolia (seeing faces in non-face objects) to test how different AI vision models handle ambiguous visual information. The study found that vision-language models like CLIP and LLaVA tend to over-interpret ambiguous patterns, while pure vision models remain more uncertain and detection models are more conservative.