AINeutralarXiv – CS AI · 7h ago6/10
🧠
GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models
Researchers introduced GeoNatureAgent Benchmark, the first evaluation framework for AI agents performing environmental geospatial analysis through real API interactions. Testing seven major LLMs across 93 tasks, Claude Sonnet 4 achieved 60.8% accuracy while DeepSeek V3.2 delivered 93% of Claude's capability at 11x lower cost, revealing significant performance gaps in structured reasoning tasks.
🧠 Claude🧠 Sonnet🧠 Gemini