TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework
Researchers introduced TensorBench, a 199-task benchmark for evaluating coding agents on a PyTorch-based tensor framework, addressing the trade-off between task difficulty and evaluation reliability in repository-level coding benchmarks. Testing seven frontier AI models revealed significant performance variation, with pass rates ranging from 64.8% to 22.1%, suggesting distinct strengths across different coding agent architectures.
TensorBench represents a meaningful advancement in how the AI research community evaluates coding capabilities at scale. Rather than relying on small, isolated coding tasks that fail to capture real-world complexity, the benchmark leverages an existing compiler-based tensor framework with comprehensive test coverage, enabling automated grading through the framework's regression test suite. This approach sidesteps the human review bottleneck that traditionally limits benchmarking scope while maintaining task difficulty.
The benchmark's design reflects growing recognition that frontier models exhibit uneven performance across seemingly similar tasks. The low pairwise Cohen's kappa values (ranging from -0.07 to 0.43) indicate that different agents excel at different problem types, suggesting no single architecture has achieved general coding competency. Even the strongest performer achieves only 64.8% pass rate, revealing substantial gaps in reasoning about compiler internals, optimization passes, and IR transformations.
For the broader AI industry, TensorBench provides a replicable evaluation framework for serious coding agents rather than marketing benchmarks. The 199 tasks spanning sparse formats, optimization passes, and runtime components cover specialized domains where humans struggle to review solutions quickly. This benchmark infrastructure matters because it enables researchers to identify which architectural choices, training data, or reasoning patterns improve performance on complex, real-world codebases.
The divergent performance across agents suggests continued opportunities for architectural improvements. As coding agents become more critical for software development, benchmarks like TensorBench establish objective performance baselines that guide model development priorities and help organizations assess which agents to trust with production systems.
- βTensorBench introduces automated grading for complex coding tasks using an open-source tensor framework's existing test suite, enabling scalable evaluation without human review bottlenecks.
- βThe strongest coding agent achieved only 64.8% pass rate, indicating substantial gaps in reasoning about compiler internals and optimization transformations across all evaluated models.
- βLow inter-agent agreement (Cohen's kappa up to 0.43) reveals agents succeed on different task subsets, suggesting no architecture has achieved general coding competency yet.
- βThe benchmark spans specialized domains including sparse tensor formats, IR transformations, and scheduler changes that represent realistic compiler engineering challenges.
- βAutomated regression testing combined with agent-added test cases provides more reliable evaluation than human review while maintaining task difficulty and real-world applicability.