AINeutralarXiv – CS AI · 6h ago6/10
🧠
QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation
Researchers introduce QMFOL, an automated framework for generating controlled-complexity logical reasoning benchmarks to evaluate large language models. The resulting QMFOLBench dataset of 2,880 instances reveals that LLM reasoning performance degrades significantly with increased logical complexity, with models showing consistent bias toward true-labeled tasks over false or unknown ones.