AI
21,049 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.
Position: Science of AI Evaluation Requires Item-level Benchmark Data
Researchers argue that current AI evaluation methods have systemic validity failures and propose item-level benchmark data as essential for rigorous AI evaluation. They introduce OpenEval, a repository of item-level benchmark data to support evidence-centered AI evaluation and enable fine-grained diagnostic analysis.
Toward Full Autonomous Laboratory Instrumentation Control with Large Language Models
Researchers demonstrate how large language models like ChatGPT can automate laboratory instrument control, reducing programming barriers for scientists. The study shows LLMs can create custom scripts and operate as autonomous AI agents for lab equipment management.
VERT: Reliable LLM Judges for Radiology Report Evaluation
Researchers introduced VERT, a new LLM-based metric for evaluating radiology reports that shows up to 11.7% better correlation with radiologist judgments compared to existing methods. The study demonstrates that fine-tuned smaller models can achieve significant performance gains while reducing inference time by up to 37.2 times.
When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling
Research reveals that adaptive reward mechanisms in AI-guided satellite scheduling systems actually hurt performance, with static reward weights achieving 342.1 Mbps versus dynamic weights at only 103.3 Mbps. The study found that fine-tuned LLMs performed poorly due to weight oscillation issues, while simpler MLP models achieved superior results of 357.9 Mbps.
Selective Forgetting for Large Reasoning Models
Researchers propose a new framework for 'selective forgetting' in Large Reasoning Models (LRMs) that can remove sensitive information from AI training data while preserving general reasoning capabilities. The method uses retrieval-augmented generation to identify and replace problematic reasoning segments with benign placeholders, addressing privacy and copyright concerns in AI systems.
Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory
Researchers propose Rashomon Memory, a new AI agent memory architecture where multiple goal-conditioned agents maintain parallel interpretations of the same events and negotiate through argumentation at query time. The system allows AI agents to handle conflicting perspectives on experiences rather than forcing a single interpretation, using Dung's argumentation semantics to determine which proposals survive retrieval.













