🧠 AI⚪ NeutralImportance 6/10

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

arXiv – CS AI|Lucas Jing, Xinqi Wang, Liao Zhang, Simon S. Du|June 2, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce PBT-Bench, a benchmark testing AI agents' ability to derive semantic invariants from documentation and construct property-based testing strategies across 100 problems in Python libraries. Results show current LLMs achieve 42-83% bug recall with structured prompting, revealing significant performance gaps where different models fail on different problems.

Analysis

PBT-Bench addresses a critical gap in AI evaluation methodology by isolating property-based testing—a specialized skill requiring agents to understand semantic documentation, identify invariants, and generate targeted input strategies. Unlike existing benchmarks measuring basic bug reproduction or patching, this work tests whether AI systems can perform the reasoning equivalent of expert software testing, where finding edge cases demands deep library comprehension. The benchmark's 365 bugs stratified across difficulty levels reveal that current LLMs struggle consistently with stateful, cross-function violations, suggesting fundamental limitations in reasoning about protocol semantics. The research demonstrates that Hypothesis scaffolding—structured prompting that guides agents toward framework-specific syntax—provides significant gains (20+ percentage points) for mid-tier models but produces inconsistent results for stronger models, sometimes degrading performance. This counterintuitive finding indicates that rigid prompting templates may constrain model reasoning rather than enhance it. The benchmark's most important implication is the discovery that different AI architectures fail on different problems with no universal superior performer, suggesting property-based testing represents a genuinely difficult reasoning task that current models handle unevenly. For developers and AI researchers, this work establishes a more rigorous evaluation framework that better correlates with real-world testing expertise. The release of the full evaluation corpus enables downstream optimization of prompting strategies and model selection for automated software verification tasks.

Key Takeaways

→PBT-Bench evaluates AI agents on deriving semantic invariants from documentation and building property-based test strategies across 100 curated problems with 365 injected bugs.
→Structured Hypothesis prompting improves mid-capability model performance by over 20 percentage points but shows inconsistent effects on stronger models, sometimes degrading results.
→Bug recall ranges from 42.1% to 83.4% across eight contemporary LLMs, with hardest bugs proving model-specific, indicating no single model closes all performance gaps.
→The benchmark reveals that property-based testing is a complex reasoning task requiring deep understanding of library semantics and protocol correctness, not just syntax.
→Released benchmark and evaluation corpus enable future research on documentation-grounded semantic reasoning and automated software verification optimization.