AIBearisharXiv – CS AI · 8h ago7/10
🧠
HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs
Researchers introduce HOLMES, a new benchmark for evaluating higher-order logical reasoning in large language models, revealing that current LLMs struggle significantly with complex symbolic reasoning tasks that go beyond simple first-order logic. The benchmark demonstrates critical gaps in AI reliability, with the best-performing models achieving only 59.54% accuracy on tasks involving reasoning over rules, predicates, and constraints across legal and financial domains.