AINeutralarXiv – CS AI · 10h ago6/10
🧠
NL2Scratch: An Executable Benchmark and Evaluation for Block-Based Programming
Researchers introduce NL2Scratch, a benchmark dataset of 311,648 natural-language-to-Scratch program pairs designed to evaluate AI models' ability to generate block-based code. The study reveals significant gaps between traditional metrics and semantic accuracy, with models excelling at token-level matching but failing to produce functionally correct programs.