Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
Researchers introduce Cookie-Bench, a comprehensive 1,000-query web development benchmark, and Cookie-Frame, an autonomous evaluation framework that assesses LLM-generated web applications through static perception, agent-driven interaction, and dynamic scoring. The approach eliminates reliance on reference implementations while aligning closely with human expert ratings, revealing significant performance gaps across 13 frontier LLMs.
The evaluation of LLM-generated web applications has become a critical bottleneck in AI development, as major language models increasingly target front-end code generation as a core product differentiator. Traditional evaluation methods—human-judged leaderboards, reference-based testing, and rigid checklists—fail to capture the nuanced reasoning that human reviewers apply when assessing interactive applications in real sessions. Cookie-Bench and Cookie-Frame address this gap by establishing a scalable, reference-free evaluation regime grounded in cognitive science principles.
The benchmark spans 11 domains with 1,000 diverse queries across static and interactive tasks, deliberately rewritten to prevent models from relying on memorized prompt patterns. This design reflects growing concerns about benchmark contamination as circulated prompts enter training data. Cookie-Frame's three-stage approach—static perception, autonomous agent interaction with continuous video/screenshot capture, and holistic scoring with structured failure attribution—creates a methodology that scales beyond human judgment while preserving the contextual reasoning humans provide.
For the AI development community, this framework standardizes web application evaluation at a critical moment when front-end generation has become a key competitive metric. The discovery of substantial headroom across 13 frontier LLMs suggests the field remains early-stage, creating opportunities for model improvement. For investors and developers building on these models, standardized benchmarking reduces uncertainty around capability claims and enables more accurate comparative analysis.
The work signals that AI evaluation methodology is maturing alongside model capabilities. Future development likely involves similar autonomous evaluation systems for other interactive domains, establishing precedent for reference-free assessment at scale.
- →Cookie-Bench provides a 1,000-query, 11-domain benchmark specifically designed for evaluating LLM-generated web applications without reference implementations.
- →Cookie-Frame's three-stage evaluation methodology (static perception, agent interaction, dynamic scoring) aligns closely with expert human ratings while scaling beyond manual review.
- →Testing reveals substantial performance headroom across 13 frontier LLMs, indicating the web generation field remains in early development stages.
- →The benchmark uses rewritten prompts to resist contamination from circulated training data, addressing a growing concern in AI evaluation.
- →Autonomous evaluation with continuous screen capture and video analysis provides evidence chains for structured failure attribution in interactive applications.