Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation
Researchers conducted the first systematic evaluation of Large Language Models' ability to generate correct TLA+ formal specifications from natural language, testing 30 LLMs across 2,730 runs. Results show LLMs achieve only 8.6% semantic correctness despite 26.6% syntactic correctness, indicating current models cannot reliably produce formal specifications without expert oversight.
This research addresses a critical gap in understanding LLM capabilities for formal verification tasks, which have significant real-world applications in systems used by major technology companies like Amazon and Microsoft. The study's comprehensive evaluation across model families, sizes, and prompting strategies reveals fundamental limitations in how LLMs handle formal languages—a domain requiring precise semantic understanding rather than pattern matching.
The findings challenge several prevailing assumptions about LLM development. The discovery that model size does not correlate with performance, with DeepSeek's 8B variant outperforming its 70B counterpart, suggests that architectural choices and training approaches matter more than parameter count for formal language tasks. The consistent underperformance of code-specialized models points to negative transfer effects, where models optimized for conventional programming struggle with formal specification syntax and semantics. The identification of five recurring hallucination categories traceable to training data biases provides actionable insights for future model development.
For the formal verification and software engineering communities, these results temper expectations around LLM-assisted specification writing while highlighting which prompting strategies show promise. The progressive prompting approach achieving highest success rates suggests that intermediate reasoning steps improve semantic correctness. The research indicates that widespread adoption of LLM-based formal specification generation remains premature without substantial advances in model training or verification techniques.
Future work should focus on whether specialized fine-tuning on formal language datasets or novel prompting paradigms can meaningfully improve semantic correctness rates, and whether hybrid human-AI approaches could accelerate specification development while maintaining correctness guarantees.
- →LLMs achieve only 8.6% semantic correctness in generating TLA+ specifications, far below practical utility despite 26.6% syntactic correctness.
- →Model size does not predict TLA+ generation quality; smaller models with better reasoning alignment outperform larger variants.
- →Code-specialized LLMs consistently underperform due to negative transfer from mainstream programming language training.
- →Progressive prompting strategies show the highest success rates, suggesting multi-step reasoning improves formal specification correctness.
- →Hallucination patterns are traceable to specific training data biases, offering pathways for targeted model improvements.