253 articles tagged with #benchmark. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBearisharXiv ā CS AI Ā· Mar 36/104
š§ Researchers introduced SciTrek, a new benchmark for testing large language models' ability to perform numerical reasoning across long scientific documents. The benchmark reveals significant challenges for current LLMs, with the best model achieving only 46.5% accuracy at 128K tokens, and performance declining as context length increases.
$COMP
AIBearisharXiv ā CS AI Ā· Mar 36/104
š§ Researchers introduced HardcoreLogic, a benchmark of over 5,000 logic puzzles across 10 games to test Large Reasoning Models (LRMs) on non-standard puzzle variants. The study reveals significant performance drops in current LRMs when faced with complex or uncommon puzzle variations, indicating heavy reliance on memorized patterns rather than genuine logical reasoning.
AINeutralarXiv ā CS AI Ā· Mar 36/103
š§ Researchers introduced OVERTONBENCH, a framework for measuring viewpoint diversity in large language models through the OVERTONSCORE metric. In a study of 8 LLMs with 1,208 participants, models scored 0.35-0.41 out of 1.0, with DeepSeek V3 performing best, showing significant room for improvement in pluralistic representation.
AIBearisharXiv ā CS AI Ā· Mar 36/104
š§ Researchers introduced SimpleToM, a benchmark revealing that state-of-the-art language models can infer mental states but struggle to apply that knowledge for behavior prediction and judgment. The study exposes a critical gap between explicit Theory of Mind inference and implicit application in real-world scenarios.
AIBullisharXiv ā CS AI Ā· Mar 36/104
š§ Researchers introduce LLaVE, a new multimodal embedding model that uses hardness-weighted contrastive learning to better distinguish between positive and negative pairs in image-text tasks. The model achieves state-of-the-art performance on the MMEB benchmark, with LLaVE-2B outperforming previous 7B models and demonstrating strong zero-shot transfer capabilities to video retrieval tasks.
AIBearisharXiv ā CS AI Ā· Mar 36/103
š§ Researchers introduced JALMBench, a comprehensive benchmark to evaluate jailbreak vulnerabilities in Large Audio Language Models (LALMs), comprising over 245,000 audio samples and 11,000 text samples. The study reveals that LALMs face significant safety risks from jailbreak attacks, with text-based safety measures only partially transferring to audio inputs, highlighting the need for specialized defense mechanisms.
AIBullisharXiv ā CS AI Ā· Mar 36/103
š§ Researchers introduce SHINE, a training-free framework that enables FLUX and other diffusion models to perform high-quality image composition without retraining. The framework addresses complex lighting scenarios like shadows and reflections, achieving state-of-the-art performance on new benchmark ComplexCompo.
AINeutralarXiv ā CS AI Ā· Mar 36/104
š§ Researchers introduced SpinBench, a new benchmark for evaluating spatial reasoning abilities in vision language models (VLMs), focusing on perspective taking and viewpoint transformations. Testing 43 state-of-the-art VLMs revealed systematic weaknesses including strong egocentric bias and poor rotational understanding, with human performance significantly outpacing AI models at 91.2% accuracy.
AINeutralarXiv ā CS AI Ā· Mar 35/103
š§ Researchers introduce C³B (Comics Cross-Cultural Benchmark), a new benchmark to test cultural awareness capabilities in Multimodal Large Language Models using over 2000 comic images and 18000 QA pairs. Testing revealed significant performance gaps between current MLLMs and human performance, highlighting the need for improved cultural understanding in AI systems.
AIBullisharXiv ā CS AI Ā· Mar 36/104
š§ DragFlow introduces the first framework to leverage FLUX's DiT priors for drag-based image editing, addressing distortion issues that plagued earlier Stable Diffusion-based approaches. The system uses region-based editing with affine transformations instead of point-based supervision, achieving state-of-the-art results on benchmarks.
AINeutralarXiv ā CS AI Ā· Mar 35/104
š§ Researchers introduced SimuHome, a high-fidelity smart home simulator and benchmark with 600 episodes for testing LLM-based smart home agents. The system uses the Matter protocol standard and enables time-accelerated simulation to evaluate how AI agents handle device control, environmental monitoring, and workflow scheduling in smart homes.
AINeutralarXiv ā CS AI Ā· Mar 36/104
š§ Researchers introduce EgoNight, the first comprehensive benchmark for nighttime egocentric vision understanding, featuring day-night aligned videos and visual question answering tasks. The benchmark reveals significant performance drops in state-of-the-art multimodal large language models when operating under low-light conditions.
AINeutralarXiv ā CS AI Ā· Mar 36/103
š§ Researchers introduced WebDevJudge, a benchmark for evaluating how well AI models can judge web development quality compared to human experts. The study reveals significant gaps between AI judges and human evaluation, highlighting fundamental limitations in AI's ability to assess complex, interactive web development tasks.
AINeutralarXiv ā CS AI Ā· Mar 36/104
š§ Researchers introduce Vision-DeepResearch Benchmark (VDR-Bench) with 2,000 VQA instances to better evaluate multimodal AI systems' visual and textual search capabilities. The benchmark addresses limitations in existing evaluations where answers could be inferred without proper visual search, and proposes a multi-round cropped-search workflow to improve model performance.
$NEAR
AIBearisharXiv ā CS AI Ā· Mar 27/1014
š§ Researchers have developed ForesightSafety Bench, a comprehensive AI safety evaluation framework covering 94 risk dimensions across 7 fundamental safety pillars. The benchmark evaluation of over 20 advanced large language models revealed widespread safety vulnerabilities, particularly in autonomous AI agents, AI4Science, and catastrophic risk scenarios.
AINeutralarXiv ā CS AI Ā· Mar 26/1014
š§ Researchers introduce Jailbreak Foundry (JBF), a system that automatically converts AI jailbreak research papers into executable code modules for standardized testing. The system successfully reproduced 30 attacks with high accuracy and reduces implementation code by nearly half while enabling consistent evaluation across multiple AI models.
AIBullisharXiv ā CS AI Ā· Mar 26/1011
š§ Researchers developed AMBER-AFNO, a new lightweight architecture for 3D medical image segmentation that replaces traditional attention mechanisms with Adaptive Fourier Neural Operators. The model achieves state-of-the-art results on medical datasets while maintaining linear memory scaling and quasi-linear computational complexity.
$NEAR
AINeutralarXiv ā CS AI Ā· Mar 26/1012
š§ Researchers introduce Ref-Adv, a new benchmark for testing multimodal large language models' visual reasoning capabilities in referring expression tasks. The benchmark reveals that current MLLMs, despite performing well on standard datasets like RefCOCO, rely heavily on shortcuts and show significant gaps in genuine visual reasoning and grounding abilities.
AIBullisharXiv ā CS AI Ā· Mar 26/1018
š§ Researchers introduce TTE-v2, a new multimodal retrieval framework that achieves state-of-the-art performance by incorporating reasoning steps during retrieval and reranking. The approach demonstrates that scaling based on reasoning tokens rather than model size can significantly improve performance, with TTE-v2-7B reaching 75.7% accuracy on MMEB-V2 benchmark.
AIBullisharXiv ā CS AI Ā· Mar 27/1022
š§ Researchers introduce DataMind, a new training framework for building open-source data-analytic AI agents that can handle complex, multi-step data analysis tasks. The DataMind-14B model achieves state-of-the-art performance with 71.16% average score, outperforming proprietary models like DeepSeek-V3.1 and GPT-5 on data analysis benchmarks.
AINeutralarXiv ā CS AI Ā· Mar 26/1017
š§ Researchers conducted a systematic benchmark study on multimodal fusion between Electronic Health Records (EHR) and chest X-rays for clinical decision support, revealing when and how combining data modalities improves healthcare AI performance. The study found that multimodal fusion helps when data is complete but benefits degrade under realistic missing data scenarios, and released an open-source benchmarking toolkit for reproducible evaluation.
AINeutralarXiv ā CS AI Ā· Mar 27/1020
š§ Researchers have developed LemmaBench, a new benchmark for evaluating Large Language Models on research-level mathematics by automatically extracting and rewriting lemmas from arXiv papers. Current state-of-the-art LLMs achieve only 10-15% accuracy on these mathematical theorem proving tasks, revealing a significant gap between AI capabilities and human-level mathematical research.
AINeutralarXiv ā CS AI Ā· Mar 27/1020
š§ Researchers have released HumanMCP, the first large-scale dataset designed to evaluate tool retrieval performance in Model Context Protocol (MCP) servers. The dataset addresses a critical gap by providing realistic, human-like queries paired with 2,800 tools across 308 MCP servers, improving upon existing benchmarks that lack authentic user interaction patterns.
AIBullisharXiv ā CS AI Ā· Mar 26/1014
š§ Researchers introduce MMKG-RDS, a framework that uses multimodal knowledge graphs to synthesize high-quality training data for improving AI model reasoning abilities. Testing on Qwen3 models showed 9.2% improvement in reasoning accuracy, with applications for complex benchmark construction involving tables and formulas.
AIBullisharXiv ā CS AI Ā· Mar 26/1018
š§ Researchers developed RD-MLDG, a new framework that uses multimodal large language models with reasoning chains to improve domain generalization in deep learning. The approach addresses challenges in cross-domain visual recognition by leveraging reasoning capabilities rather than just visual feature invariance, achieving state-of-the-art performance on standard benchmarks.