AINeutralarXiv โ CS AI ยท 14h ago7/10
๐ง
General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks
Researchers introduce General365, a benchmark revealing that leading LLMs achieve only 62.8% accuracy on general reasoning tasks despite excelling in domain-specific domains. The findings highlight a critical gap: current AI models rely heavily on specialized knowledge rather than developing robust, transferable reasoning capabilities applicable to real-world scenarios.