Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders
Researchers propose Feature Activation Coverage (FAC), a new metric for measuring data diversity in large language models using sparse autoencoders instead of traditional text-based metrics. The FAC Synthesis framework generates synthetic training data to fill feature gaps, demonstrating consistent improvements across multiple tasks and revealing transferable feature spaces across different model families.
This research addresses a fundamental challenge in large language model optimization: constructing post-training datasets that meaningfully improve downstream performance. Traditional approaches rely on linguistic diversity metrics that fail to capture task-relevant features, creating a gap between what practitioners measure and what actually drives model capability. The introduction of Feature Activation Coverage bridges this gap by operating in the interpretable feature space extracted by sparse autoencoders, providing a more precise signal for data quality.
The work builds on growing recognition that data-centric AI approaches can rival or exceed model-scaling benefits. By identifying missing features in seed datasets and synthetically generating samples to address gaps, FAC Synthesis represents a practical advancement in efficient post-training. The framework's validation across instruction following, toxicity detection, reward modeling, and behavior steering demonstrates broad applicability rather than narrow task optimization.
A particularly significant finding is the discovery of shared, interpretable feature spaces across model families including LLaMA, Mistral, and Qwen. This cross-model consistency suggests fundamental architectural similarities in how these models organize learned representations, enabling knowledge transfer and reducing the need for model-specific optimization approaches. For developers and organizations managing multiple model variants, this opens pathways for consolidated data synthesis strategies.
The methodology's emphasis on interpretability distinguishes it from black-box optimization techniques. Understanding which features drive performance enables targeted interventions and provides transparency into model behavior, increasingly important for applications in safety and alignment. Future work likely involves scaling these techniques to larger models and exploring feature-space optimization for specialized domains.
- βFeature Activation Coverage provides a more precise diversity metric than text-based alternatives by measuring gaps in learned feature space.
- βFAC Synthesis generates synthetic data targeting missing features, consistently improving performance on multiple downstream tasks.
- βShared interpretable feature spaces exist across LLaMA, Mistral, and Qwen models, enabling cross-model knowledge transfer.
- βThe approach prioritizes data efficiency and interpretability over pure model scaling, aligning with data-centric AI trends.
- βResults span diverse applications from instruction-following to toxicity detection, demonstrating broad practical applicability.