AINeutralarXiv โ CS AI ยท 4h ago2
๐ง
From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning
Researchers propose a dynamic agent-centric benchmarking system for evaluating large language models that replaces static datasets with autonomous agents that generate, validate, and solve problems iteratively. The protocol uses teacher, orchestrator, and student agents to create progressively challenging text anomaly detection tasks that expose reasoning errors missed by conventional benchmarks.