Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction
Researchers introduce Oracle, a novel benchmark that evaluates LLM reasoning through black-box environment interaction, where models must deduce hidden functions by exploring unknown systems. Testing 19 models reveals that OpenAI's o3 leads in performance but struggles with complex tasks, exposing a universal weakness: LLMs lack strategic planning capabilities for efficient hypothesis refinement.