GroundEval: A Deterministic Replacement for LLM-as-Judge in Stateful Agent Evaluation
GroundEval introduces a deterministic framework for evaluating AI agents by auditing their evidence retrieval and reasoning paths rather than relying on LLM judges. The tool detected a critical failure case where frontier LLM judges scored an agent response above 0.85, but the actual trace revealed the agent never retrieved the artifact it cited, yielding a GroundEval score of 0.000.
GroundEval addresses a fundamental blindspot in current AI evaluation methodologies. Traditional LLM-as-judge approaches assess only final answers, missing hallucinations disguised in plausible language—a problem the research demonstrates isn't marginal but endemic. The framework operates by instrumenting agent behavior, recording exact tool calls, retrieval timing, and access permissions, then scoring both outputs and trajectories independently. This dual-scoring mechanism exposes when agents fabricate reasoning chains that sound coherent but lack evidentiary foundation.
The three evaluation tracks—Silence (whether agents verify absence claims), Perspective (temporal consistency of evidence), and Counterfactual (correct causal mechanisms)—target failure modes that surface-level answer evaluation cannot catch. These represent critical safety concerns for deploying agents in domains like legal research, financial analysis, or healthcare, where incorrect evidence sourcing carries material consequences.
Industry implications are significant. AI teams currently using LLM judges for agent evaluation may unknowingly operate systems with undetected hallucination rates far higher than reported benchmarks suggest. This research pressure may accelerate adoption of deterministic evaluation frameworks across production deployments. For developers building agent systems, the work validates that observable behavior audit trails are essential—not optional—for trustworthy systems.
The research suggests we're entering a necessary maturation phase where the AI industry transitions from opaque scoring to verifiable trace analysis. This shift mirrors quality assurance practices in regulated industries and establishes a new baseline expectation for what 'proven correctness' means for AI systems.
- →LLM judges can score plausible-sounding hallucinations as high-quality answers despite missing foundational evidence
- →GroundEval's deterministic trace-based scoring revealed a case where judges gave 0.85+ scores to a response with 0.000 grounding validity
- →Three critical failure modes—unverified absence claims, temporal inconsistency, and invalid causal chains—remain invisible to answer-only evaluation
- →Structured per-question diagnostics pairing tool activity with agent narration enable inspectable rather than opaque scoring
- →The evaluation gap appears systemic rather than marginal, affecting production-deployed agent systems across multiple domains