Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning
Researchers propose DualSelect, a framework for fine-tuning large language models that simultaneously selects relevant safety references and compatible task samples to preserve safety alignment while improving task performance. The method achieves significant safety improvements (5.10+ points) across models from 1B to 8B parameters without sacrificing utility.
DualSelect addresses a critical challenge in LLM development: the tension between adapting models to downstream tasks and maintaining their safety guardrails. When language models are fine-tuned on new data, they often lose previously learned safety behaviors, creating a fundamental trade-off that has limited existing solutions relying on fixed safety examples or crude task filtering.
The research responds to this problem through a coupled selection approach that treats safety reference selection and task sample filtering as interdependent processes. Rather than applying static safety constraints globally, DualSelect dynamically identifies task-specific safety references with high preservation loss and conflict detection, then filters task samples compatible with these induced safety directions. This minimax framework uses entropy-regularized scoring, lazy reference refresh, and gradient correction to optimize the selection process.
For the AI industry, this advancement carries substantial implications. As enterprises deploy fine-tuned LLMs in production environments, maintaining safety alignment during customization directly impacts liability, regulatory compliance, and user trust. The consistent 5+ point Safety Average improvement across multiple evaluation judges suggests the method scales reliably across model sizes, making it practically applicable to current commercial deployments.
The broader significance extends beyond individual model fine-tuning. As the field moves toward continual learning paradigms where models adapt to evolving tasks and domains, safety-preserving adaptation mechanisms become infrastructure-critical. This work establishes that coupled selection strategies outperform one-sided approaches, potentially informing future safety alignment research during model development cycles.
- βDualSelect framework jointly optimizes task and safety reference selection rather than treating them independently
- βMethod preserves safety alignment while maintaining task utility across 1B-8B parameter models
- βSafety Average improvements of 5.10+ points achieved against strongest baselines using REDORCA judge
- βCoupled selection approach scales effectively across multiple evaluation frameworks and model sizes
- βFramework extends to retention-focused continual learning scenarios beyond traditional fine-tuning