y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 6/10

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

arXiv – CS AI|Arslan Bisharat, Brian Ortiz, Eric Spencer, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad|
🤖AI Summary

Researchers conducted the first systematic evaluation of Large Language Models' ability to generate correct TLA+ formal specifications from natural language, testing 30 LLMs across 2,730 runs. Results show LLMs achieve only 8.6% semantic correctness despite 26.6% syntactic correctness, indicating current models cannot reliably produce formal specifications without expert oversight.

Analysis

This research addresses a critical gap in understanding LLM capabilities for formal verification tasks, which have significant real-world applications in systems used by major technology companies like Amazon and Microsoft. The study's comprehensive evaluation across model families, sizes, and prompting strategies reveals fundamental limitations in how LLMs handle formal languages—a domain requiring precise semantic understanding rather than pattern matching.

The findings challenge several prevailing assumptions about LLM development. The discovery that model size does not correlate with performance, with DeepSeek's 8B variant outperforming its 70B counterpart, suggests that architectural choices and training approaches matter more than parameter count for formal language tasks. The consistent underperformance of code-specialized models points to negative transfer effects, where models optimized for conventional programming struggle with formal specification syntax and semantics. The identification of five recurring hallucination categories traceable to training data biases provides actionable insights for future model development.

For the formal verification and software engineering communities, these results temper expectations around LLM-assisted specification writing while highlighting which prompting strategies show promise. The progressive prompting approach achieving highest success rates suggests that intermediate reasoning steps improve semantic correctness. The research indicates that widespread adoption of LLM-based formal specification generation remains premature without substantial advances in model training or verification techniques.

Future work should focus on whether specialized fine-tuning on formal language datasets or novel prompting paradigms can meaningfully improve semantic correctness rates, and whether hybrid human-AI approaches could accelerate specification development while maintaining correctness guarantees.

Key Takeaways
  • LLMs achieve only 8.6% semantic correctness in generating TLA+ specifications, far below practical utility despite 26.6% syntactic correctness.
  • Model size does not predict TLA+ generation quality; smaller models with better reasoning alignment outperform larger variants.
  • Code-specialized LLMs consistently underperform due to negative transfer from mainstream programming language training.
  • Progressive prompting strategies show the highest success rates, suggesting multi-step reasoning improves formal specification correctness.
  • Hallucination patterns are traceable to specific training data biases, offering pathways for targeted model improvements.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles