Researchers improved Text2DSL, a system that automatically generates domain-specific language code from natural language, by replacing prompt-based generation with context-aware distillation using structured inputs like BNF grammars and API specifications. The enhanced approach scaled verified training data from 4,204 to 10,073 examples while maintaining 99.7% runtime accuracy, and ablation studies confirmed that vocabulary context provides the strongest semantic improvements.
This research advances automated code generation through a rigorous empirical framework that moves beyond simple prompt engineering toward structured knowledge integration. The study demonstrates a critical insight: when models encounter harder problem instances, architectural improvements in how context is provided become essential rather than optional. The baseline approach collapsed under increased difficulty—syntax validity dropping from 97.6% to 58.5%—while the context-enhanced model remained robust at 97.4%, proving that structured context functions as a load-bearing mechanism rather than a marginal optimization.
The work builds on increasing recognition that language models benefit from explicit schema and constraint information. By incorporating BNF grammar specifications, API documentation, and closed vocabularies, the teacher model generates more coherent and verifiable code samples. The two-tier validation pipeline combining abstract syntax tree checks with runtime testing against actual system daemons ensures practical applicability beyond theoretical correctness.
The Shapley-style decomposition provides actionable guidance for practitioners: vocabulary contributes the most to semantic quality (0.198 improvement), while API specifications and BNF grammars drive structural validity improvements of 24.7 and 22.3 percentage points respectively. This granular attribution reveals that different context components serve distinct functions, suggesting that practitioners should prioritize vocabulary completeness for semantic coherence and formal specifications for syntactic reliability.
The expansion to 10,073 verified examples in PolkitBench establishes a benchmark for future work. As large language models increasingly handle code generation tasks across security-critical domains, this research pattern—combining distillation, structured context, and rigorous validation—offers a template for improving reliability in domains where correctness matters.
- →Context-aware distillation with structured inputs (BNF, API specs, vocabularies) outperforms prompt-only generation, especially on harder problems.
- →The enhanced approach scaled verified training data to 10,073 examples with 99.7% runtime accuracy on Polkit security rules.
- →Vocabulary context drives semantic quality improvements while formal grammar and API specifications optimize structural validity.
- →The baseline approach degraded sharply under increased difficulty while structured context remained robust, confirming context is not cosmetic.
- →Granular ablation studies reveal different context components serve distinct functions, enabling targeted optimization strategies.