AINeutralarXiv โ CS AI ยท 4h ago7
๐ง Researchers have released HumanMCP, the first large-scale dataset designed to evaluate tool retrieval performance in Model Context Protocol (MCP) servers. The dataset addresses a critical gap by providing realistic, human-like queries paired with 2,800 tools across 308 MCP servers, improving upon existing benchmarks that lack authentic user interaction patterns.
AIBullisharXiv โ CS AI ยท 4h ago3
๐ง Researchers have developed SleepLM, a family of AI foundation models that combine natural language processing with sleep analysis using polysomnography data. The system can interpret and describe sleep patterns in natural language, trained on over 100K hours of sleep data from 10,000+ individuals, enabling new capabilities like language-guided sleep event detection and zero-shot generalization to novel sleep analysis tasks.
AIBullisharXiv โ CS AI ยท 4h ago3
๐ง Researchers propose a minimal baseline architecture for AI-based theorem proving that achieves competitive performance with state-of-the-art systems while using significantly simpler design. The open-source implementation demonstrates that iterative proof refinement approaches are more sample-efficient and cost-effective than single-shot generation methods.
AIBullisharXiv โ CS AI ยท 4h ago3
๐ง Researchers from PKU-SEC-Lab have developed KEEP, a new memory management system that significantly improves the efficiency of AI-powered embodied planning by optimizing KV cache usage. The system achieves 2.68x speedup compared to text-based memory methods while maintaining accuracy, addressing a key bottleneck in memory-augmented Large Language Models for complex planning tasks.
AIBullisharXiv โ CS AI ยท 4h ago4
๐ง Researchers have developed MPU, a privacy-preserving framework that enables machine unlearning for large language models without requiring servers to share parameters or clients to share data. The framework uses perturbed model copies and harmonic denoising to achieve comparable performance to non-private methods, with most algorithms showing less than 1% performance degradation.
AINeutralarXiv โ CS AI ยท 4h ago2
๐ง Researchers introduce RewardUQ, a unified framework for evaluating uncertainty quantification in reward models used to align large language models with human preferences. The study finds that model size and initialization have the most significant impact on performance, while providing an open-source Python package to advance the field.
AIBullisharXiv โ CS AI ยท 4h ago5
๐ง Researchers developed Whisper-LLaDA, a diffusion-based large language model for automatic speech recognition that achieves 12.3% relative improvement over baseline models. The study demonstrates that audio-conditioned embeddings are crucial for accuracy improvements, while plain-text processing without acoustic features fails to enhance performance.
AIBullisharXiv โ CS AI ยท 4h ago10
๐ง Researchers developed MobileLLM-R1, a sub-billion parameter AI model that demonstrates strong reasoning capabilities using only 2T tokens of high-quality data instead of massive 10T+ token datasets. The 950M parameter model achieves superior performance on reasoning benchmarks compared to larger competitors while using only 11.7% of the training data compared to proprietary models like Qwen3.
AIBullisharXiv โ CS AI ยท 4h ago5
๐ง Researchers introduce DataMind, a new training framework for building open-source data-analytic AI agents that can handle complex, multi-step data analysis tasks. The DataMind-14B model achieves state-of-the-art performance with 71.16% average score, outperforming proprietary models like DeepSeek-V3.1 and GPT-5 on data analysis benchmarks.
AIBullisharXiv โ CS AI ยท 4h ago6
๐ง Researchers developed LIA, a supervised fine-tuning approach using DeepSeek-R1-Distill-Llama-8B to automatically assign software issues to developers. The system achieved up to 187.8% improvement over the base model and 211.2% better performance than existing methods in developer recommendation accuracy.
AIBullisharXiv โ CS AI ยท 4h ago0
๐ง Researchers have developed SDMixer, a new AI framework for multivariate time series forecasting that uses dual-stream sparse processing to analyze data in both frequency and time domains. The method employs sparsity mechanisms to filter noise and improve cross-variable dependency modeling, achieving leading performance on real-world datasets in transportation, energy, and finance applications.
AINeutralarXiv โ CS AI ยท 4h ago0
๐ง NuBench is a new open benchmark for deep learning-based event reconstruction in neutrino telescopes, comprising seven large-scale simulated datasets with nearly 130 million neutrino interactions. The benchmark enables comparison of machine learning reconstruction methods across different detector geometries and evaluates four algorithms including ParticleNeT and DynEdge on core reconstruction tasks.