AINeutralarXiv – CS AI · 8h ago6/10
🧠
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
Researchers introduce PBT-Bench, a benchmark testing AI agents' ability to derive semantic invariants from documentation and construct property-based testing strategies across 100 problems in Python libraries. Results show current LLMs achieve 42-83% bug recall with structured prompting, revealing significant performance gaps where different models fail on different problems.