FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction
FineGen is a VLM-based multi-agent framework that automatically constructs vision-language datasets by generating hard negative samples through a Generation-Verification-Correction pipeline. The resulting FineGen-100K dataset contains 147,000+ attribute-specific hard negatives and demonstrates a 14.4% accuracy improvement on fine-grained object detection benchmarks, addressing a critical gap in existing datasets.
FineGen addresses a fundamental limitation in current vision-language datasets: the scarcity of hard negative samples that are semantically valid yet visually contradictory. This gap directly impairs fine-grained perception capabilities, a critical requirement for advanced computer vision applications. The framework leverages vision language models in a multi-agent architecture with closed-loop feedback, enabling autonomous generation and validation of challenging training examples at scale.
The core innovation lies in the Generation-Verification-Correction pipeline, which ensures quality through automated checks rather than manual curation. By achieving 96.7% attribute validity while maintaining a strict 1:10 positive-to-negative ratio, FineGen demonstrates that synthetic hard negatives can match human-quality standards. This approach mirrors broader trends in AI where automation and self-improvement mechanisms reduce reliance on expensive human annotation.
The results validate this methodology's practical impact. Fine-tuning on FineGen-100K yielded 14.4% accuracy gains on hard samples in the FG-OVD benchmark—a substantial improvement that suggests downstream applications spanning retail, autonomous systems, and medical imaging could benefit significantly. The hierarchical nature of the 100K dataset structure enables flexible scaling and domain-specific adaptation.
Looking forward, this work establishes a template for dataset construction in other domains requiring fine-grained discrimination. The success of VLM-based multi-agent frameworks for data generation may encourage similar approaches in specialized dataset creation, potentially reducing bottlenecks in training advanced vision models. Industry players should monitor whether this methodology becomes standardized for benchmark dataset construction.
- →FineGen's Generation-Verification-Correction pipeline generates hard negatives with 96.7% validity, addressing scarcity in vision-language datasets.
- →FineGen-100K contains 147,000+ attribute-specific hard negatives with rigorous positive-to-negative ratios, enabling 14.4% accuracy gains on fine-grained benchmarks.
- →Multi-agent VLM frameworks demonstrate viability for automating high-quality dataset construction at scale, reducing manual annotation costs.
- →The approach shows particular strength for hard sample detection, directly improving model robustness in challenging real-world scenarios.
- →Template success suggests broader adoption of automated dataset generation methodologies across AI research and commercial applications.