16 articles tagged with #benchmark-dataset. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv β CS AI Β· 3d ago7/10
π§ Researchers developed the first real-world benchmark for evaluating whether large language models can infer causal relationships from complex academic texts. The study reveals that LLMs struggle significantly with this task, with the best models achieving only 0.535 F1 scores, highlighting a critical gap in AI reasoning capabilities needed for AGI advancement.
AINeutralarXiv β CS AI Β· Feb 277/108
π§ Researchers introduce MM-NeuroOnco, a large-scale multimodal dataset containing 24,726 MRI slices and 200,000 instructions for training AI models in brain tumor diagnosis. The benchmark reveals significant challenges in medical AI, with even advanced models like Gemini 3 Flash achieving only 41.88% accuracy on diagnostic questions.
AIBullisharXiv β CS AI Β· 3d ago6/10
π§ Researchers introduced MMR-AD, a large-scale multimodal dataset designed to benchmark general anomaly detection using Multimodal Large Language Models (MLLMs). The study reveals that current state-of-the-art MLLMs fall short of industrial requirements for anomaly detection, though a proposed baseline model called Anomaly-R1 demonstrates significant improvements through reasoning-based approaches enhanced by reinforcement learning.
AINeutralarXiv β CS AI Β· 3d ago6/10
π§ Researchers have introduced C-ReD, a Chinese benchmark dataset for detecting AI-generated text that addresses gaps in model diversity and data homogeneity. The dataset, derived from real-world prompts, demonstrates reliable in-domain detection and strong generalization to unseen language models, with resources publicly available on GitHub.
AINeutralarXiv β CS AI Β· 4d ago6/10
π§ Researchers propose Noise-Aware In-Context Learning (NAICL), a plug-and-play method to reduce hallucinations in auditory large language models without expensive fine-tuning. The approach uses a noise prior library to guide models toward more conservative outputs, achieving a 37% reduction in hallucination rates while establishing a new benchmark for evaluating audio understanding systems.
AINeutralarXiv β CS AI Β· Apr 106/10
π§ Researchers introduce A-MBER, a benchmark dataset designed to evaluate AI assistants' ability to recognize emotions based on long-term interaction history rather than immediate context. The benchmark tests whether models can retrieve relevant past interactions, infer current emotional states, and provide grounded explanationsβrevealing that memory's value lies in selective, context-aware interpretation rather than simple historical volume.
AINeutralarXiv β CS AI Β· Apr 106/10
π§ Researchers evaluated whether large language models understand long-form narratives similarly to humans by comparing summaries of 150 novels written by humans and nine state-of-the-art LLMs. The study found that LLMs focus disproportionately on story endings rather than distributing attention like human readers, revealing gaps in narrative comprehension despite expanded context windows.
AINeutralarXiv β CS AI Β· Apr 106/10
π§ Researchers introduced a new benchmark dataset for evaluating world models' ability to maintain spatial consistency across long sequences, addressing a critical gap in AI evaluation. The dataset, collected from Minecraft environments with 20 million frames across 150 locations, enables development of memory-augmented models that can reliably simulate physical spaces for downstream tasks like planning and simulation.
AINeutralarXiv β CS AI Β· Apr 106/10
π§ Q-Probe introduces a novel agentic framework for scaling image quality assessment to high-resolution images by addressing limitations in existing reinforcement learning approaches. The research presents Vista-Bench, a new benchmark for fine-grained degradation analysis, and demonstrates state-of-the-art performance across multiple resolution scales through context-aware probing mechanisms.
AINeutralarXiv β CS AI Β· Mar 166/10
π§ Researchers discovered that large language models exhibit gender bias at the individual question level, creating different amounts of information for men versus women despite appearing unbiased at category levels. A new benchmark dataset called RealWorldQuestioning was developed, and a simple prompt-based debiasing approach was shown to improve response quality in 78% of cases.
π’ Hugging Faceπ§ ChatGPT
AIBullisharXiv β CS AI Β· Mar 36/106
π§ Researchers introduce TripleSumm, a novel AI architecture that adaptively fuses visual, text, and audio modalities for improved video summarization. The team also releases MoSu, the first large-scale benchmark dataset providing all three modalities for multimodal video summarization research.
AIBullisharXiv β CS AI Β· Mar 36/104
π§ Researchers propose MOON, the first generative multimodal large language model designed specifically for e-commerce product understanding. The model addresses key challenges in product representation learning through guided Mixture-of-Experts modules and semantic region detection, while introducing a new benchmark dataset for evaluation.
AINeutralarXiv β CS AI Β· Feb 275/106
π§ Researchers introduce FIRE, a comprehensive benchmark for evaluating Large Language Models' financial intelligence and reasoning capabilities. The benchmark includes theoretical financial knowledge tests from qualification exams and 3,000 practical financial scenario questions covering complex business domains.
AINeutralarXiv β CS AI Β· Feb 276/107
π§ Researchers introduce PoSh, a new evaluation metric for detailed image descriptions that uses scene graphs to guide LLMs-as-a-Judge, achieving better correlation with human judgments than existing methods. They also present DOCENT, a challenging benchmark dataset featuring artwork with expert-written descriptions to evaluate vision-language models' performance on complex image analysis.
AINeutralarXiv β CS AI Β· Mar 44/103
π§ Researchers introduce Whisper-RIR-Mega, a new benchmark dataset for testing automatic speech recognition robustness in reverberant acoustic environments. The study evaluates five Whisper models and finds that reverberation consistently degrades performance across all model sizes, with word error rates increasing by 0.12 to 1.07 percentage points.
AINeutralarXiv β CS AI Β· Mar 44/102
π§ Researchers developed new prompting-based approaches using multimodal large language models to generate real-time video commentary that considers both content relevance and timing. The study introduces dynamic interval-based decoding that adjusts prediction timing based on utterance duration, showing improved alignment with human commentary patterns without requiring model fine-tuning.