Counsel: A Meta-Evaluation Dataset for Agentic Tasks
Researchers introduce Counsel, the first public meta-evaluation dataset for assessing how well LLM-based judges critique AI agent trajectories. The dataset addresses a critical bottleneck in agent evaluation by providing human-validated assessments of automated critique quality, enabling better calibration of evaluators at scale.
The emergence of autonomous AI agents capable of multi-step reasoning has created an acute evaluation problem. Human annotation of agent trajectories on existing benchmarks demands enormous time investment—single trajectories can require hours to evaluate manually. This constraint has pushed the industry toward automated evaluation using LLM-as-a-judge (LLMJ) approaches, which can assess agent performance at massive scale. However, the reliability of these automated critiques remains largely unvalidated, creating a meta-problem: who evaluates the evaluators?
Counsel directly addresses this gap by providing the first public dataset where humans have systematically evaluated the quality of LLM critiques. The dataset covers two domains—customer support and coding—and achieves strong inter-annotator agreement (0.78 Krippendorff's alpha), lending credibility to the human judgments. By stratifying critiques into categories like "spot on," "correct location but poor reasoning," and "should not have flagged," the dataset enables nuanced understanding of where and why automated judges fail.
For the AI development community, this resource is strategically important. It provides empirical data to improve judge models themselves through training or calibration, rather than assuming automated evaluators are reliable proxies for human assessment. The finding that stronger judge models and increased reasoning effort improve alignment has immediate practical implications for practitioners building agent evaluation pipelines.
The open licensing and use of open-weight models makes this infrastructure available broadly, likely accelerating progress in agent evaluation standardization. As agentic systems move toward production deployment, having validated evaluation methods becomes critical infrastructure. This dataset establishes baseline measurements of judge reliability, enabling the field to track improvements and build more trustworthy autonomous systems.
- →Counsel is the first meta-evaluation dataset validating the quality of LLM-based judge critiques for AI agents.
- →Human annotators achieved 0.78 Krippendorff's alpha agreement rating LLMJ critiques across error detection and reasoning quality.
- →Stronger judge models and increased reasoning effort correlate with improved human alignment, reaching ~88% on location accuracy.
- →The dataset enables calibration and training of better evaluators for agent systems, addressing the bottleneck of manual trajectory annotation.
- →Open licensing and permissive use model democratizes access to rigorous agent evaluation infrastructure across the research community.