🧠 AI⚪ NeutralImportance 6/10

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

arXiv – CS AI|Haoyue Yang, Zhangxiao Shen, Fan Ding, Hangting Lou, Yifeng Kou, Haoqing Yu, Jingyao Li, Zhengfan Wu, Siqi Bao, Jing Liu, Hua Wu|May 29, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Cookie-Bench, a comprehensive 1,000-query web development benchmark, and Cookie-Frame, an autonomous evaluation framework that assesses LLM-generated web applications through static perception, agent-driven interaction, and dynamic scoring. The approach eliminates reliance on reference implementations while aligning closely with human expert ratings, revealing significant performance gaps across 13 frontier LLMs.

Analysis

The evaluation of LLM-generated web applications has become a critical bottleneck in AI development, as major language models increasingly target front-end code generation as a core product differentiator. Traditional evaluation methods—human-judged leaderboards, reference-based testing, and rigid checklists—fail to capture the nuanced reasoning that human reviewers apply when assessing interactive applications in real sessions. Cookie-Bench and Cookie-Frame address this gap by establishing a scalable, reference-free evaluation regime grounded in cognitive science principles.

The benchmark spans 11 domains with 1,000 diverse queries across static and interactive tasks, deliberately rewritten to prevent models from relying on memorized prompt patterns. This design reflects growing concerns about benchmark contamination as circulated prompts enter training data. Cookie-Frame's three-stage approach—static perception, autonomous agent interaction with continuous video/screenshot capture, and holistic scoring with structured failure attribution—creates a methodology that scales beyond human judgment while preserving the contextual reasoning humans provide.

For the AI development community, this framework standardizes web application evaluation at a critical moment when front-end generation has become a key competitive metric. The discovery of substantial headroom across 13 frontier LLMs suggests the field remains early-stage, creating opportunities for model improvement. For investors and developers building on these models, standardized benchmarking reduces uncertainty around capability claims and enables more accurate comparative analysis.

The work signals that AI evaluation methodology is maturing alongside model capabilities. Future development likely involves similar autonomous evaluation systems for other interactive domains, establishing precedent for reference-free assessment at scale.

Key Takeaways

→Cookie-Bench provides a 1,000-query, 11-domain benchmark specifically designed for evaluating LLM-generated web applications without reference implementations.
→Cookie-Frame's three-stage evaluation methodology (static perception, agent interaction, dynamic scoring) aligns closely with expert human ratings while scaling beyond manual review.
→Testing reveals substantial performance headroom across 13 frontier LLMs, indicating the web generation field remains in early development stages.
→The benchmark uses rewritten prompts to resist contamination from circulated training data, addressing a growing concern in AI evaluation.
→Autonomous evaluation with continuous screen capture and video analysis provides evidence chains for structured failure attribution in interactive applications.

#llm-evaluation #web-generation #benchmark #autonomous-testing #ai-assessment #front-end-code #evaluation-framework

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge