How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
A new empirical study evaluates how Large Language Models perform on the Equivalence Class Problem, a simple yet computationally demanding long-chain reasoning task. The research reveals that non-reasoning LLMs fail entirely at the task, while reasoning-capable models perform significantly better but still struggle with complete accuracy, with performance patterns differing based on problem complexity metrics.