AIBearisharXiv โ CS AI ยท 7h ago6/10
๐ง
BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models
Researchers introduced BrainBench, a new benchmark revealing significant gaps in commonsense reasoning among leading LLMs. Even the best model (Claude Opus 4.6) achieved only 80.3% accuracy on 100 brainteaser questions, while GPT-4o scored just 39.7%, exposing fundamental reasoning deficits across frontier AI models.
๐ง GPT-4๐ง Claude๐ง Opus