CollabSkill: Evaluating Human-Agent Collaboration On Real-World Tasks
Researchers introduce CollabSkill, a framework for evaluating how AI agents perform when collaborating with real human workers on occupational tasks. Using data from 93 workers across 386 sessions, the study reveals that Claude Code outperforms Codex in practical collaboration scenarios—diverging from autonomous benchmark rankings—and identifies hands-on experience as the primary driver of effective human-AI teamwork.
CollabSkill addresses a critical gap in AI evaluation methodology by moving beyond isolated agent performance metrics to assess real-world human-agent collaboration. Traditional benchmarks measure autonomous AI capability, but workplace deployment requires agents that augment human workers while preserving agency and delivering economic value. This study shifts focus to practical collaboration dynamics, where different capabilities matter than pure task completion.
The research framework employs a Bayesian skill rating system to decompose individual contributions from humans and agents, accounting for worker variability that previous studies ignored. Drawing from 1,500+ prompts across occupational backgrounds, CollabSkill captures authentic usage patterns rather than idealized scenarios. The finding that Claude Code ranks first in collaborative settings while Codex leads autonomous benchmarks suggests agent design trade-offs exist between standalone performance and human-compatible interfaces.
For the AI industry, CollabSkill provides evidence that collaboration skill differs fundamentally from raw capability. This has immediate implications for enterprise AI adoption decisions, where worker productivity and skill development matter as much as raw performance metrics. The observation that practical collaboration experience significantly improves worker AI literacy suggests economic value extends beyond task completion to workforce capability building.
The framework establishes a new evaluation standard that should influence how developers optimize agents for workplace integration. Organizations evaluating AI tools now have methodological grounding to assess true productivity gains rather than relying on benchmark scores. Future development will likely emphasize collaborative UX design and interpretability features that support human workers, rather than pure autonomous performance maximization.
- →Claude Code outperforms Codex in human-AI collaboration scenarios, despite different autonomous benchmark rankings, indicating collaboration requires distinct capabilities.
- →Practical hands-on experience emerges as the strongest predictor of worker collaboration skill, not pre-existing technical expertise.
- →CollabSkill's Bayesian rating system successfully disentangles human and agent contributions, enabling fair evaluation across variable worker populations.
- →Workers demonstrate measurable improvements in AI literacy through collaboration, suggesting augmentation creates secondary productivity gains.
- →Real-world collaboration benchmarks diverge substantially from autonomous agent metrics, challenging how the industry prioritizes AI capability development.