AINeutralarXiv โ CS AI ยท 6h ago5
๐ง
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
Researchers introduce DARE-bench, a new benchmark with 6,300 Kaggle-derived tasks for evaluating Large Language Models' performance on data science and machine learning tasks. The benchmark reveals that even advanced models like GPT-4-mini struggle with ML modeling tasks, while fine-tuning on DARE-bench data can improve model accuracy by up to 8x.