AINeutralarXiv โ CS AI ยท 8h ago7/10
๐ง
What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design
Researchers have published guidelines for designing rigorous terminal-agent benchmarks to evaluate LLM coding and system-administration capabilities. The paper identifies over 15% of tasks in popular benchmarks as reward-hackable and catalogs six major failure modes caused by treating benchmark design like prompt engineering rather than adversarial testing.