y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#gpt-evaluation News & Analysis

2 articles tagged with #gpt-evaluation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles
AINeutralarXiv – CS AI · Apr 147/10
🧠

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Researchers introduce General365, a benchmark revealing that leading LLMs achieve only 62.8% accuracy on general reasoning tasks despite excelling in domain-specific domains. The findings highlight a critical gap: current AI models rely heavily on specialized knowledge rather than developing robust, transferable reasoning capabilities applicable to real-world scenarios.

AINeutralarXiv – CS AI · Apr 206/10
🧠

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

Researchers introduce SocialGrid, a benchmark environment for evaluating Large Language Models as autonomous agents in multi-agent social scenarios. The study reveals that even the most capable open-source LLMs achieve below 60% task completion and struggle significantly with social reasoning tasks like detecting deception, exposing critical limitations in current AI agent capabilities.