NL2Scratch: An Executable Benchmark and Evaluation for Block-Based Programming
Researchers introduce NL2Scratch, a benchmark dataset of 311,648 natural-language-to-Scratch program pairs designed to evaluate AI models' ability to generate block-based code. The study reveals significant gaps between traditional metrics and semantic accuracy, with models excelling at token-level matching but failing to produce functionally correct programs.
NL2Scratch addresses a critical blind spot in AI code generation research. While transformer-based models have achieved impressive results on text-based programming benchmarks, block-based environments like Scratch remain largely unexplored despite their dominance in K-12 education. The benchmark's 311,648 examples drawn from real Scratch projects provide genuine, complex program structures that text-based benchmarks cannot replicate, offering researchers authentic training material.
The Semantic Alignment Consistency metric represents a methodological breakthrough. Conventional evaluation metrics like BLEU scores and token-level F1 measure surface-level similarity, masking fundamental failures in program correctness. SAC operates at the slot level—evaluating whether generated programs correctly implement intended actions, conditions, and numeric arguments—capturing semantic fidelity that matters to actual users and educators.
The findings expose a troubling disconnect in LLM capabilities. Models achieving 0.93+ token F1 scores often generate semantically invalid programs, particularly failing on longer examples. This pattern suggests that current sequence-to-sequence approaches struggle with the compositional and event-driven nature of Scratch programs, where multiple concurrent scripts must interact coherently.
For AI education and programming tool developers, NL2Scratch signals both opportunity and challenge. The dataset enables building better AI-assisted coding systems for millions of student programmers, but the persistent semantic errors indicate that existing architectures require fundamental improvements to handle visual, compositional programming paradigms. The research underscores that benchmark sophistication must match task complexity.
- →NL2Scratch provides 311,648 real-world block-based programming examples, filling a gap in NL-to-code research dominated by text-based languages
- →The Semantic Alignment Consistency metric reveals that high token-level scores mask critical semantic failures in generated programs
- →Current LLMs struggle with event-driven, concurrent program structures inherent to visual block-based environments
- →Errors concentrate on operational elements like actions, conditions, and numeric arguments, exposing failure modes invisible to traditional metrics
- →The benchmark enables development of AI tools for K-12 programming education, a sector with millions of potential users