HARP: Efficient Data Selection for Finetuning Large Language Models
Researchers introduce HARP (Hierarchical Active Region Pruning), a novel training-efficient method for selecting optimal data when finetuning large language models. The approach reduces computational costs by 7x while maintaining or improving model performance by using hierarchical organization and Bayesian inference to evaluate representative subsets rather than exhaustively training on all data.