Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing
Researchers developed an automated image classification system using fine-tuned deep learning models to categorize scanned historical documents by content type (text, tables, graphics), achieving 99.16% accuracy on Czech archaeological archives. The system successfully processed over 649,000 unlabeled pages, with RegNetY-16GF emerging as the most reliable model for production deployment due to consistent inter-model agreement.
This research addresses a critical bottleneck in digital humanities infrastructure: the manual labor required to process vast historical document archives. The team's achievement of near-perfect classification accuracy (99.16%) using RegNetY-16GF demonstrates how modern computer vision can automate previously intractable sorting tasks at scale, transforming humanities research workflows.
The work builds on decades of progress in image classification, from traditional machine learning baselines (75% accuracy) to transformer-based architectures. What distinguishes this effort is the rigorous methodological approach: four-stage expert annotation, collaborative label design, and systematic model comparison across CNNs, Vision Transformers, and multimodal systems. The authors' decision to prioritize inter-model agreement over raw test-set accuracy reveals practical deployment wisdom—CLIP's 99.14% test accuracy became unreliable on unlabeled data, achieving only 65% agreement with image-only models.
For the broader AI industry, this work validates fine-tuned vision transformers as production-ready systems for domain-specific document understanding. The public release of annotated datasets and open-source models creates positive externalities, enabling other institutions to deploy similar systems for their archives. The research demonstrates that transformer architectures, despite requiring substantial computational resources, justify their overhead through consistency and reliability rather than marginal accuracy gains.
Institutions managing historical archives now have validated, open-source baselines for automated document classification. The next frontier involves extending these systems to handle multilingual text recognition and extracting semantic relationships between classified document types—opportunities that position vision-language models as increasingly valuable infrastructure for knowledge preservation.
- →RegNetY-16GF achieved 99.16% accuracy on 48,000 annotated historical page images, outperforming CNN and transformer baselines substantially.
- →Fine-tuned CLIP models showed high test accuracy but poor generalization on unlabeled data, making image-only models preferable for production deployment.
- →The system successfully processed 649,508 unlabeled archival pages with over 90% inter-model agreement, demonstrating scalable automation.
- →Open-source release of annotated dataset and trained models enables other institutions to deploy document classification for their archives.
- →Document classification enables downstream content-specific processing like OCR and structured data extraction, streamlining humanities digitization workflows.