🧠 AI🟢 BullishImportance 6/10

HARP: Efficient Data Selection for Finetuning Large Language Models

arXiv – CS AI|Ning Wang, Zhengxin Zhang, Maosen Tang, Yitang Gao, Claire Cardie, Sainyam Galhotra|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce HARP (Hierarchical Active Region Pruning), a novel training-efficient method for selecting optimal data when finetuning large language models. The approach reduces computational costs by 7x while maintaining or improving model performance by using hierarchical organization and Bayesian inference to evaluate representative subsets rather than exhaustively training on all data.

Analysis

HARP addresses a fundamental challenge in modern machine learning: finetuning large language models requires careful data selection, but evaluating which training examples matter most typically demands expensive repeated training cycles. The paper presents a middle ground between fast but potentially suboptimal train-free selectors and slower but more accurate train-based approaches. By organizing training pools hierarchically and evaluating only representative leaves, HARP dramatically reduces the number of train-evaluate iterations needed while maintaining alignment with downstream objectives. This matters because finetuning costs represent a significant operational burden for organizations deploying customized LLMs. The method employs empirical Bayes posteriors to infer utility for unevaluated data points and uses two complementary selection strategies—conservative redundancy control and additive region rewards—that together optimize data quality. Achieving 8.9 point improvements while using roughly 7x fewer training examples suggests the approach efficiently prioritizes high-value examples. The theoretical guarantees under local smoothness conditions add credibility to the empirical findings. For AI practitioners and organizations running multiple finetuning operations, this research directly impacts operational efficiency and cost. Model developers can achieve better performance with smaller, more strategically selected datasets. The work intersects academic AI advancement with practical deployment concerns, making it particularly valuable for both research and production environments scaling LLM customization.

Key Takeaways

→HARP reduces finetuning training iterations by approximately 7x while maintaining or exceeding baseline performance
→Hierarchical organization combined with Bayesian inference enables accurate utility estimation without exhaustive training
→Two complementary selection envelopes (HARP-C and HARP-E) balance data redundancy control and complementary region rewards
→Theoretical analysis shows HARP controls selection error under local smoothness and bounded estimation conditions
→Method achieves up to 8.9 point improvements over strongest baselines on downstream objectives