20,618 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.
AIBearisharXiv – CS AI · Apr 136/10
🧠Researchers introduce OmniBehavior, a benchmark for evaluating large language models' ability to simulate real-world human behavior across complex, long-horizon scenarios. The study reveals that current LLMs struggle with authentic behavioral simulation and exhibit systematic biases toward homogenized, overly-positive personas rather than capturing individual differences and realistic long-tail behaviors.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce Spatial-Gym, a benchmarking environment that evaluates AI models on spatial reasoning tasks through step-by-step pathfinding in 2D grids rather than one-shot generation. Testing eight models reveals a significant performance gap, with the best model achieving only 16% solve rate versus 98% for humans, exposing critical limitations in how AI systems scale reasoning effort and process spatial information.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers introduce E3-TIR, a new training paradigm for Large Language Models that improves tool-use reasoning by combining expert guidance with self-exploration. The method achieves 6% performance gains while using less than 10% of typical synthetic data, addressing key limitations in current reinforcement learning approaches for AI agents.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers propose improved divergence measures for training Generative Flow Networks (GFlowNets), comparing Renyi-α, Tsallis-α, and KL divergences to enhance statistical efficiency. The work introduces control variates that reduce gradient variance and achieve faster convergence than existing methods, bridging GFlowNets training with generalized variational inference frameworks.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers introduce WAND, a framework that reduces computational and memory costs of autoregressive text-to-speech models by replacing full self-attention with windowed attention combined with knowledge distillation. The approach achieves up to 66.2% KV cache memory reduction while maintaining speech quality, addressing a critical scalability bottleneck in modern AR-TTS systems.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers present PETITE, a tutor-student multi-agent framework that enhances LLM problem-solving by assigning complementary roles to agents from the same model. Evaluated on coding benchmarks, the approach achieves comparable or superior accuracy to existing methods while consuming significantly fewer tokens, demonstrating that structured role-differentiated interactions can improve LLM performance more efficiently than larger models or heterogeneous ensembles.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers propose StaRPO, a reinforcement learning framework that improves large language model reasoning by incorporating stability metrics alongside task rewards. The method uses Autocorrelation Function and Path Efficiency measurements to evaluate logical coherence and goal-directedness, demonstrating improved accuracy and reasoning consistency across four benchmarks.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce SEA-Eval, a new benchmark for evaluating self-evolving AI agents that go beyond single-task execution by measuring how agents improve across sequential tasks and accumulate experience over time. The benchmark reveals significant inefficiencies in current state-of-the-art frameworks, exposing up to 31.2x differences in token consumption despite identical success rates, highlighting a critical bottleneck in agent development.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce Soft Silhouette Loss, a novel machine learning objective that improves deep neural network representations by enforcing intra-class compactness and inter-class separation. The lightweight differentiable loss outperforms cross-entropy and supervised contrastive learning when combined, achieving 39.08% top-1 accuracy compared to 37.85% for existing methods while reducing computational overhead.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce ASPECT, a novel reinforcement learning framework that uses large language models as semantic operators to enable zero-shot transfer learning across novel tasks. By conditioning a text-based VAE on LLM-generated task descriptions, the approach allows agents to reuse policies on structurally similar but previously unseen tasks without discrete category constraints.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers present a novel approach using agentic language model feedback frameworks to generate planning domains from natural language descriptions augmented with symbolic information. The method employs heuristic search over model space optimized by various feedback mechanisms, including landmarks and plan validator outputs, to improve domain quality for practical deployment.
AIBearisharXiv – CS AI · Apr 136/10
🧠Researchers introduce GRM, a frequency-selective jailbreak framework that exploits vulnerabilities in audio large language models while maintaining utility preservation. By strategically perturbing specific frequency bands rather than entire spectrums, GRM achieves 88.46% jailbreak success rates with better trade-offs between attack effectiveness and transcription quality compared to existing methods.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers have developed RandSymKL, a debiasing technique for Bangla language models that mitigates gender bias in classification tasks like sentiment analysis and hate speech detection. The study introduces four manually annotated benchmark datasets with gender-perturbation testing and demonstrates that the approach effectively reduces bias while maintaining competitive accuracy compared to existing methods.
AIBearisharXiv – CS AI · Apr 136/10
🧠Researchers found that large language models fail to accurately simulate human susceptibility to misinformation, consistently overstating how attitudes drive belief and sharing while ignoring social network effects. The study reveals systematic biases in how LLMs represent misinformation concepts, suggesting they are better tools for identifying where AI diverges from human judgment rather than replacing human survey responses.
AIBearisharXiv – CS AI · Apr 136/10
🧠Researchers conducted a large-scale computational analysis comparing 17,790 articles from Grokipedia, Elon Musk's AI-generated encyclopedia, against Wikipedia. The study found that Grokipedia articles are longer but contain fewer citations, with some entries showing systematic rightward political bias in media sources, particularly in history, religion, and arts sections.
🏢 xAI🧠 Grok
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce Dejavu, a post-deployment learning framework that enables frozen Vision-Language-Action policies to improve through experience retrieval and feedback networks. The system allows embodied AI agents to continuously learn from past trajectories without retraining, improving task performance across diverse robotic applications.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce AV-SpeakerBench, a new 3,212-question benchmark designed to evaluate how well multimodal large language models understand audiovisual speech by correlating speakers with their dialogue and timing. Testing reveals Gemini 2.5 Pro significantly outperforms open-source competitors, with the gap primarily attributable to inferior audiovisual fusion capabilities rather than visual perception limitations.
🧠 Gemini
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers investigate how multimodal large language models (MLLMs) can assist with usability evaluation of user interfaces by analyzing text and visual context together. The study compares MLLM-generated assessments against expert evaluations, finding that these models can effectively prioritize usability issues by severity and offer complementary insights to traditional resource-intensive evaluation methods.
AIBearisharXiv – CS AI · Apr 136/10
🧠Researchers demonstrate a white-box adversarial attack on computer vision models using SHAP values to identify and exploit critical input features, showing superior robustness compared to the Fast Gradient Sign Method, particularly when gradient information is obscured or hidden.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers propose AR-KAN, a neural network combining autoregressive models with Kolmogorov-Arnold Networks for improved time series forecasting. The model addresses limitations of traditional deep learning approaches by integrating temporal memory preservation with nonlinear function approximation, demonstrating superior performance on both synthetic and real-world datasets.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers provide the first rigorous theoretical analysis of OPTQ (GPTQ), a widely-used post-training quantization algorithm for neural networks and LLMs, establishing quantitative error bounds and validating practical design choices. The study extends theoretical guarantees to both deterministic and stochastic variants of OPTQ and the Qronos algorithm, offering guidance for regularization parameter selection and quantization alphabet sizing.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce CoA-LoRA, a method that dynamically adapts LoRA fine-tuning to different quantization configurations without requiring separate retraining for each setting. The approach uses a configuration-aware model and Pareto-based search to optimize low-rank adjustments across heterogeneous edge devices, achieving comparable performance to traditional methods with zero additional computational cost.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers propose Editing Anchor Compression (EAC), a framework that addresses degradation of large language models' general abilities during sequential knowledge editing. By constraining parameter matrix deviations through selective anchor compression, EAC preserves over 70% of model performance while maintaining edited knowledge, advancing the practical viability of model editing as an alternative to expensive retraining.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce ReplicatorBench, a comprehensive benchmark for evaluating AI agents' ability to replicate scientific research claims in social and behavioral sciences. The study reveals that current LLM agents excel at designing and executing experiments but struggle significantly with data retrieval, highlighting critical gaps in autonomous research validation capabilities.
AINeutralarXiv – CS AI · Apr 136/10
🧠OmniPrism introduces a new visual concept disentanglement approach for AI image generation that separates multiple visual aspects (content, style, composition) to enable more controlled and creative outputs. The method uses a contrastive training pipeline and a new 200K paired dataset to train diffusion models that can incorporate disentangled concepts while maintaining fidelity to text prompts.