#gpt-evaluation News & Analysis

3 articles tagged with #gpt-evaluation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles

AINeutralarXiv – CS AI · Apr 147/10

🧠

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Researchers introduce General365, a benchmark revealing that leading LLMs achieve only 62.8% accuracy on general reasoning tasks despite excelling in domain-specific domains. The findings highlight a critical gap: current AI models rely heavily on specialized knowledge rather than developing robust, transferable reasoning capabilities applicable to real-world scenarios.

AIBearisharXiv – CS AI · Jun 96/10

🧠

Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

Researchers evaluated how large language models (GPT and Grok) perform at grading graduate-level research reports, finding significant inconsistencies both within individual models and between different models. The study reveals that interaction history causes models to systematically drift from human grading standards, raising concerns about fairness in automated academic assessment.

🧠 Grok

AINeutralarXiv – CS AI · Apr 206/10

🧠

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

Researchers introduce SocialGrid, a benchmark environment for evaluating Large Language Models as autonomous agents in multi-agent social scenarios. The study reveals that even the most capable open-source LLMs achieve below 60% task completion and struggle significantly with social reasoning tasks like detecting deception, exposing critical limitations in current AI agent capabilities.