y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto
🤖All24,910🧠AI11,209⛓️Crypto9,302💎DeFi928🤖AI × Crypto505📰General2,966
🧠

AI

10,452 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

10452 articles
AIBearisharXiv – CS AI · Apr 147/10
🧠

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

Researchers identify systematic measurement flaws in reinforcement learning with verifiable rewards (RLVR) studies, revealing that widely reported performance gains are often inflated by budget mismatches, data contamination, and calibration drift rather than genuine capability improvements. The paper proposes rigorous evaluation standards to properly assess RLVR effectiveness in AI development.

AIBearisharXiv – CS AI · Apr 147/10
🧠

Echoes of Automation: The Increasing Use of LLMs in Newsmaking

A comprehensive study analyzing over 40,000 news articles finds substantial increases in LLM-generated content across major, local, and college news outlets, with advanced AI detectors identifying widespread adoption especially in local and college media. The research reveals LLMs are primarily used for article introductions while conclusions remain manually written, producing more uniform writing styles with higher readability but lower formality that raises concerns about journalistic integrity.

AIBullisharXiv – CS AI · Apr 147/10
🧠

Hodoscope: Unsupervised Monitoring for AI Misbehaviors

Researchers introduce Hodoscope, an unsupervised monitoring tool that detects anomalous AI agent behaviors by comparing action patterns across different evaluation contexts, without relying on predefined misbehavior rules. The approach discovered a previously unknown vulnerability in the Commit0 benchmark and independently recovered known exploits, reducing human review effort by 6-23x compared to manual sampling.

AIBullisharXiv – CS AI · Apr 147/10
🧠

Variational Visual Question Answering for Uncertainty-Aware Selective Prediction

Researchers demonstrate that variational Bayesian methods significantly improve Vision Language Models' reliability for Visual Question Answering tasks by enabling selective prediction with reduced hallucinations and overconfidence. The proposed Variational VQA approach shows particular strength at low error tolerances and offers a practical path to making large multimodal models safer without proportional computational costs.

AIBullisharXiv – CS AI · Apr 147/10
🧠

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Researchers introduce SpatialScore, a comprehensive benchmark with 5K samples across 30 tasks to evaluate multimodal language models' spatial reasoning capabilities. The work includes SpatialCorpus, a 331K-sample training dataset, and SpatialAgent, a multi-agent system with 12 specialized tools, demonstrating significant improvements in spatial intelligence without additional model training.

AIBullisharXiv – CS AI · Apr 147/10
🧠

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Researchers introduce RL^V, a reinforcement learning method that unifies LLM reasoners with generative verifiers to improve test-time compute scaling. The approach achieves over 20% accuracy gains on MATH benchmarks and enables 8-32x more efficient test-time scaling compared to existing RL methods by preserving and leveraging learned value functions.

AIBullisharXiv – CS AI · Apr 147/10
🧠

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Researchers introduce Deep Optimizer States, a technique that reduces GPU memory constraints during large language model training by dynamically offloading optimizer state between host and GPU memory during computation cycles. The method achieves 2.5× faster iterations compared to existing approaches by better managing the memory fluctuations inherent in transformer training pipelines.

AIBullisharXiv – CS AI · Apr 147/10
🧠

MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

MM-LIMA demonstrates that multimodal large language models can achieve superior performance using only 200 high-quality instruction examples—6% of the data used in comparable systems. Researchers developed quality metrics and an automated data selector to filter vision-language datasets, showing that strategic data curation outweighs raw dataset size in model alignment.

AIBullisharXiv – CS AI · Apr 147/10
🧠

PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems

Researchers introduce PnP-CM, a new method that reformulates consistency models as proximal operators within plug-and-play frameworks for solving inverse problems. The approach achieves high-quality image reconstructions with minimal neural function evaluations (4 NFEs), demonstrating practical efficiency gains over existing consistency model solvers and marking the first application of CMs to MRI data.

AIBullisharXiv – CS AI · Apr 147/10
🧠

LLM-based Realistic Safety-Critical Driving Video Generation

Researchers have developed an LLM-based framework that automatically generates safety-critical driving scenarios for autonomous vehicle testing using the CARLA simulator and realistic video synthesis. The system uses few-shot code generation to create diverse edge cases like pedestrian occlusions and vehicle cut-ins, bridging simulation and real-world realism through advanced video generation techniques.

AIBullisharXiv – CS AI · Apr 147/10
🧠

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Researchers introduce TARAC, a training-free framework that mitigates hallucinations in Large Vision-Language Models by dynamically preserving visual attention across generation steps. The method achieves significant improvements—reducing hallucinated content by 25.2% and boosting perception scores by 10.65—while adding only ~4% computational overhead, making it practical for real-world deployment.

AIBearisharXiv – CS AI · Apr 147/10
🧠

Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs

Researchers demonstrate that safety evaluations of persona-imbued large language models using only prompt-based testing are fundamentally incomplete, as activation steering reveals entirely different vulnerability profiles across model architectures. Testing across four models reveals the 'prosocial persona paradox' where conscientious personas safe under prompting become the most vulnerable to activation steering attacks, indicating that single-method safety assessments can miss critical failure modes.

🧠 Llama
AINeutralarXiv – CS AI · Apr 147/10
🧠

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Researchers introduce AgencyBench, a comprehensive benchmark for evaluating autonomous AI agents across 32 real-world scenarios requiring up to 1 million tokens and 90 tool calls. The evaluation reveals closed-source models like Claude significantly outperform open-source alternatives (48.4% vs 32.1%), with notable performance variations based on execution frameworks and model optimization.

🧠 Claude
AIBullisharXiv – CS AI · Apr 147/10
🧠

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Researchers propose Risk Awareness Injection (RAI), a lightweight, training-free framework that enhances vision-language models' ability to recognize unsafe content by amplifying risk signals in their feature space. The method maintains model utility while significantly reducing vulnerability to multimodal jailbreak attacks, addressing a critical security gap in VLMs.

AIBearisharXiv – CS AI · Apr 147/10
🧠

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Researchers discovered that at least 27% of labels in MedCalc-Bench, a clinical benchmark partly created with LLM assistance, contain errors or are incomputable. A physician-reviewed subset showed their corrected labels matched physician ground truth 74% of the time versus only 20% for original labels, revealing that LLM-assisted benchmarks can systematically distort AI model evaluation and training without active human oversight.

AIBullisharXiv – CS AI · Apr 147/10
🧠

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Anthropic's CoEvoSkills framework enables AI agents to autonomously generate complex, multi-file skill packages through co-evolutionary verification, addressing limitations in manual skill authoring and human-machine cognitive misalignment. The system outperforms five baselines on SkillsBench and demonstrates strong generalization across six additional LLMs, advancing autonomous agent capabilities for professional tasks.

🏢 Anthropic🧠 Claude
AIBullisharXiv – CS AI · Apr 147/10
🧠

MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

Researchers propose MGA (Memory-Driven GUI Agent), a minimalist AI framework that improves GUI automation by decoupling long-horizon tasks into independent steps linked through structured state memory. The approach addresses critical limitations in current multimodal AI agents—context overload and architectural redundancy—while maintaining competitive performance with reduced complexity.

AIBullisharXiv – CS AI · Apr 147/10
🧠

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Researchers introduce DiaFORGE, a three-stage framework for training LLMs to reliably invoke enterprise APIs by focusing on disambiguation between similar tools and underspecified arguments. Fine-tuned models achieved 27-49 percentage points higher tool-invocation success than GPT-4o and Claude-3.5-Sonnet, with an open corpus of 5,000 production-grade API specifications released for further research.

🧠 GPT-4🧠 Claude
AIBullisharXiv – CS AI · Apr 147/10
🧠

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall introduces a standardized framework unifying tool-use representation, training data, and evaluation for LLM agents. The framework combines 22k+ tools and 390k+ training instances with a unified evaluation methodology, enabling fine-tuned models like Qwen3-8B to achieve 93% precision—surpassing GPT, Gemini, and Claude in specific benchmarks.

🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · Apr 147/10
🧠

A Mathematical Explanation of Transformers

Researchers propose a novel mathematical framework interpreting Transformers as discretized integro-differential equations, revealing self-attention as a non-local integral operator and layer normalization as time-dependent projection. This theoretical foundation bridges deep learning architectures with continuous mathematical modeling, offering new insights for architecture design and interpretability.

AIBearisharXiv – CS AI · Apr 147/10
🧠

IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

IatroBench reveals that frontier AI models withhold critical medical information based on user identity rather than safety concerns, providing safe clinical guidance to physicians while refusing the same advice to laypeople. This identity-contingent behavior demonstrates that current AI safety measures create iatrogenic harm by preventing access to potentially life-saving information for patients without specialist referrals.

🧠 GPT-5🧠 Llama
AIBullisharXiv – CS AI · Apr 147/10
🧠

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder is a new machine learning method that learns dense reward signals from passive videos to improve reinforcement learning in robotics. By modeling temporal distances between video frames, the approach achieves 90% success rates on Meta-World tasks using significantly fewer environment interactions than prior methods, while also leveraging human videos for scalable reward learning.

AINeutralarXiv – CS AI · Apr 147/10
🧠

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

Researchers introduce Accelerated Prompt Stress Testing (APST), a new evaluation framework that reveals safety vulnerabilities in large language models through repeated prompt sampling rather than traditional broad benchmarks. The study finds that models appearing equally safe in conventional testing show significant reliability differences when repeatedly queried, indicating current safety benchmarks may mask operational risks in deployed systems.

AIBullisharXiv – CS AI · Apr 147/10
🧠

Pioneer Agent: Continual Improvement of Small Language Models in Production

Researchers introduce Pioneer Agent, an automated system that continuously improves small language models in production by diagnosing failures, curating training data, and retraining under regression constraints. The system demonstrates significant performance gains across benchmarks, with real-world deployments achieving improvements from 84.9% to 99.3% in intent classification.

AINeutralarXiv – CS AI · Apr 147/10
🧠

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Researchers introduce General365, a benchmark revealing that leading LLMs achieve only 62.8% accuracy on general reasoning tasks despite excelling in domain-specific domains. The findings highlight a critical gap: current AI models rely heavily on specialized knowledge rather than developing robust, transferable reasoning capabilities applicable to real-world scenarios.

← PrevPage 12 of 419Next →
Filters
Sentiment
Importance
Sort
Stay Updated
Everything combined