AINeutralarXiv – CS AI · 7h ago6/10
🧠
Benchmarking Multimodal LLMs on Code Generation for Complex Interactive Webpages
Researchers introduced WebIGBench, the first benchmark for evaluating multimodal LLMs on code generation for interactive webpages, addressing a critical gap in existing evaluation frameworks that only assess static pages. The benchmark includes 103 real-world webpages with 871 distinct interactive actions and proposes novel automated assessment methods to measure interaction consistency beyond visual fidelity.