GroundAct: Can LLM Agents Ground Actions in Environmental States?
Researchers introduce GroundAct, a benchmark revealing that LLM agents fail dramatically when task feasibility depends on environmental context rather than explicit instructions, dropping from 85-96% to 29-53% success rates. The study identifies action grounding—inferring feasibility from environmental state—as a fundamental capability gap that scaling alone cannot solve.
The GroundAct research exposes a critical weakness in current LLM agent architectures: the inability to dynamically assess whether actions are viable given real-world constraints. While large language models excel at following explicit instructions, they struggle when required to reason about environmental prerequisites, resource limitations, and coordinated action feasibility. This distinction matters because real-world applications—from robotics to autonomous systems—constantly operate in states where instructions cannot enumerate every contingency.
The benchmark's design across 11 domains with 16,592 task instances reveals action grounding as a multidimensional problem. The research identifies three distinct cognitive layers: attribute reasoning (understanding object properties), tool reasoning (knowing what tools accomplish), and coordination reasoning (recognizing dependencies between agents). Notably, these dimensions show weak correlation, meaning a model's strength in one area doesn't predict performance in others, complicating any simple scaling solution.
The most striking finding involves fine-tuning outcomes: Qwen2.5-3B jumped from 0.6% to 76.3% on direct commands through supervised fine-tuning, yet barely improved from 1.5% to 5.5% on implicit collaboration tasks. This asymmetry suggests that learning to ground actions requires fundamentally different training approaches than scaling parameters. The research also demonstrates that complete environmental graphs eliminate up to 27.6% of errors in tool-use tasks, indicating the bottleneck involves constraint filtering rather than search capabilities.
These findings challenge the dominant assumption that bigger models solve harder reasoning problems. Instead, action grounding emerges as a tractable but distinct capability requiring architectural or training innovations beyond current approaches.
- →LLM agents drop from 85-96% to 29-53% success when action feasibility depends on unstated environmental context rather than explicit instructions
- →Action grounding comprises three weakly-correlated dimensions—attribute, tool, and coordination reasoning—each requiring different optimization strategies
- →Supervised fine-tuning dramatically improves explicit command performance but shows minimal gains on implicit coordination tasks, indicating fundamental architectural limitations
- →Complete environmental state representation can eliminate up to 27.6% of errors, separating search-bound problems from constraint-filtering bottlenecks
- →Current scaling approaches alone cannot solve action grounding, requiring new training paradigms or architectural innovations