AINeutralarXiv โ CS AI ยท 3d ago7/10
๐ง
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants
Researchers introduce MiniAppBench, a new benchmark for evaluating Large Language Models' ability to generate interactive HTML applications rather than static text responses. The benchmark includes 500 real-world tasks and an agentic evaluation framework called MiniAppEval that uses browser automation for testing.