AINeutralarXiv – CS AI · 14h ago6/10
🧠
Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
Researchers introduce Cookie-Bench, a comprehensive 1,000-query web development benchmark, and Cookie-Frame, an autonomous evaluation framework that assesses LLM-generated web applications through static perception, agent-driven interaction, and dynamic scoring. The approach eliminates reliance on reference implementations while aligning closely with human expert ratings, revealing significant performance gaps across 13 frontier LLMs.