🧠 AI⚪ NeutralImportance 5/10

TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks

arXiv – CS AI|Mykola Pinchuk|March 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduced TML-Bench, a new benchmark for evaluating AI coding agents on tabular machine learning tasks similar to Kaggle competitions. The study tested 10 open-source language models across four competitions with different time budgets, finding that MiniMax-M2.1 achieved the best overall performance.

Key Takeaways

→TML-Bench provides a standardized way to evaluate AI agents on data science tasks with real-world time constraints.
→MiniMax-M2.1 outperformed other open-source language models across all four Kaggle-style competitions tested.
→Performance generally improved with longer time budgets (240s, 600s, 1200s), though scaling varied by model.
→Success rates and run-to-run variability were measured alongside median performance for comprehensive evaluation.
→The benchmark focuses on end-to-end correctness and practical reliability rather than just code generation quality.