#dataset-benchmark News & Analysis

6 articles tagged with #dataset-benchmark. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AINeutralarXiv – CS AI · Jun 237/10

🧠

HALAS: A Human-Annotated Dataset of Hallucinations of Modern ASR Systems

Researchers introduce HALAS, the first human-annotated dataset documenting naturally occurring hallucinations from seven state-of-the-art ASR systems on real earnings call recordings. The benchmark reveals that hallucinations persist even in nearly correct transcriptions and establishes rigorous evaluation methods, with current detection techniques achieving only 53.1% F1 scores despite character-level metrics reaching 81% ROC-AUC.

AIBullisharXiv – CS AI · Jun 57/10

🧠

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

Researchers introduce DragOn, a large-scale benchmark dataset with 286K training screenshots and 3.5M tasks designed to improve GUI agents' ability to perform drag-based interactions like highlighting, resizing, and swiping. The dataset addresses a critical gap where drag-grounding capabilities lag significantly behind click-grounding in AI models controlling desktops and mobile devices.

🧠 Claude

AIBullisharXiv – CS AI · May 277/10

🧠

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

Researchers introduce FineVLA, a framework that enhances Vision-Language-Action models for robotics by incorporating fine-grained instruction supervision beyond simple goal-level commands. The system combines 972,247 trajectories into a curated dataset of 47,159 fine-grained trajectories and demonstrates that mixing fine-grained and coarse instructions improves real-world robot manipulation success rates to 62.7% compared to 49.9% with goal-level instructions alone.

AINeutralarXiv – CS AI · Jun 96/10

🧠

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

Researchers introduce ArtiFact, a large-scale multi-modal dataset containing 651,045 museum records from three major art institutions combined with images, text, and structured data. The dataset benchmarks AI systems on cross-modal error detection and semantic query processing tasks, revealing significant challenges in detecting domain-specific errors and handling culturally-nuanced information retrieval.

AINeutralarXiv – CS AI · Jun 56/10

🧠

MAviS: A Multimodal Conversational Assistant For Avian Species

Researchers introduce MAviS, a specialized multimodal AI system combining image, audio, and text data for avian species identification and ecological monitoring. The system includes a large dataset covering 1,000+ bird species, a fine-tuned language model, and a comprehensive benchmark, demonstrating state-of-the-art performance in domain-specific biodiversity conservation applications.

AINeutralarXiv – CS AI · May 46/10

🧠

ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts

Researchers have introduced ViLegalNLI, the first large-scale Vietnamese Natural Language Inference dataset for legal texts, containing 42,012 premise-hypothesis pairs from statutory documents. The dataset enables AI systems to understand legal reasoning patterns and supports development of reliable AI tools for Vietnamese legal analysis and decision-making.