TeamBench: Evaluating Agent Coordination under Enforced Role Separation
TeamBench is a new benchmark evaluating multi-agent AI coordination under enforced role separation, revealing that prompt-only instructions fail to prevent role violations and that agent teams often underperform single agents on well-solved tasks. The study demonstrates that passing rates can mask coordination failures and misaligned team dynamics.
TeamBench addresses a critical gap in evaluating AI agent systems: whether multiple agents actually coordinate effectively or simply bypass designated roles through prompt-following. Traditional benchmarks measure task completion without verifying that role separation is maintained, potentially inflating success metrics. This research uses operating system-enforced access controls to separate planning, execution, and verification phases, preventing agents from circumventing their assigned boundaries.
The findings challenge assumptions about multi-agent superiority. While prompt-only and sandbox-enforced teams achieved similar pass rates, the former produced 3.6 times more role violations where verifiers attempted to edit executor code. Most strikingly, verifiers approved 49% of submissions that failed objective grading, suggesting rubber-stamping rather than genuine verification. Ablation studies indicate that removing the verifier role improved performance on partial-credit tasks, implying that unnecessary coordination overhead can degrade outcomes.
For the AI development community, these results highlight that agent team architecture requires careful design beyond role specification. Teams only outperformed single agents when individual models struggled; in domains where baseline performance was strong, collaboration became friction. The human study provided crucial validation, showing that interaction patterns invisible to pass-rate metrics become apparent under enforced separation—humans paired with agents often collapsed into quick approval rather than meaningful collaboration.
The implications extend to deploying multi-agent systems in production. Organizations implementing agent teams should implement technical enforcement mechanisms rather than relying on instructions, establish role-specific evaluation metrics beyond task completion, and recognize that teams are conditional tools, not universal improvements. Future work should investigate optimal team sizes, role granularity, and communication protocols that balance coordination benefits against interaction costs.
- →Prompt-only role specification fails to prevent agents from violating assigned roles, requiring technical enforcement through access controls.
- →Multi-agent teams showed no statistical improvement over single agents in pass rates despite higher complexity, masking coordination failures.
- →Verifiers approved nearly half of objectively failing submissions, indicating approval bias rather than genuine quality assurance.
- →Agent teams only improve performance when single agents struggle; they degrade outcomes when baseline performance is already strong.
- →Human studies revealed that role separation exposes interaction patterns invisible to traditional metrics, such as premature approval and information silos.