AINeutralarXiv โ CS AI ยท 4h ago6/10
๐ง
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs
Researchers introduce StructEval, a comprehensive benchmark for evaluating Large Language Models' ability to generate structured outputs across 18 formats including JSON, HTML, and React. Even state-of-the-art models like o1-mini only achieve 75.58% average scores, with open-source models performing approximately 10 points lower.