y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 7/10

HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML

arXiv – CS AI|Jiajun Wu, Jian Yang, Tuney Zheng, Wei Zhang, Haowen Wang, Yihang Lou, Xianglong Liu|
🤖AI Summary

HTMLCure introduces a browser experience framework that improves how large language models generate functional HTML pages by testing them across multiple interactions and states rather than relying on static screenshots. The system automatically repairs broken pages through a closed-loop process, demonstrating significant performance improvements on HTML generation benchmarks.

Analysis

HTMLCure addresses a critical gap in LLM-generated HTML evaluation and refinement. While language models can produce syntactically correct HTML that renders in initial screenshots, many pages fail under real-world conditions like scrolling, clicking, resizing, or gameplay interactions. Traditional screenshot-based evaluation misses these functional failures, creating a misleading assessment of model capability. The framework simulates actual browser experiences across different viewports and user interactions, capturing deterministic failures that static evaluation methods overlook.

The approach reflects broader challenges in AI model development: the gap between laboratory performance and real-world utility. As LLMs increasingly generate code and interactive content, ensuring functional correctness becomes essential. HTMLCure's closed-loop repair mechanism—diagnosing failures, applying targeted fixes, and validating corrections—demonstrates an automated pathway to improving training data quality without manual intervention.

The results show substantial practical impact. From a 97K prompt corpus, the framework identified and repaired 63,703 quality-cleared pages, creating a refined 40K training set. The resulting HTMLCure-27B-Refined model achieved 50.6 on HTMLBench-400 with 45.2% deterministic test case pass rates, competitive with reference systems like Kimi-K2.6 and GPT-5.4. On MiniAppBench validation, it achieved 81.2 average performance, a 15.3-point improvement over baseline.

This work matters for the AI development ecosystem because it demonstrates that synthetic data quality—not just quantity—drives model performance. The framework's applicability extends beyond HTML to any interactive content generation, suggesting a scalable template for improving LLM-generated code reliability across domains.

Key Takeaways
  • HTMLCure evaluates HTML pages through simulated browser interactions rather than static screenshots, catching failures missed by conventional methods
  • The framework's closed-loop repair system automatically diagnoses and fixes broken pages, generating high-quality training data without manual curation
  • Performance improvements of 15+ points demonstrate that interactive state evaluation significantly enhances model capabilities on functional benchmarks
  • From 97K prompts, the system produced 63,703 quality-cleared candidate pages, expanding usable training data by approximately 66%
  • Results position smaller models like HTMLCure-27B competitively with reference systems, suggesting evaluation methodology drives benchmark performance
Mentioned in AI
Models
GPT-5OpenAI
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles