🧠

AI

12,737 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

12737 articles

AINeutralarXiv – CS AI · Apr 76/10

🧠

Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition

Researchers propose a new metric to assess consistency of AI model explanations across similar inputs, implementing it on BERT models for sentiment analysis. The framework uses cosine similarity of SHAP values to detect inconsistent reasoning patterns and biased feature reliance, providing more robust evaluation of model behavior.

AINeutralarXiv – CS AI · Apr 76/10

🧠

Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

Researchers have developed a new automated pipeline that generates challenging math problems by first identifying specific mathematical concepts where LLMs struggle, then creating targeted problems to test these weaknesses. The method successfully reduced a leading LLM's accuracy from 77% to 45%, demonstrating its effectiveness at creating more rigorous benchmarks.

🧠 Llama

AINeutralarXiv – CS AI · Apr 76/10

🧠

Position: Science of AI Evaluation Requires Item-level Benchmark Data

Researchers argue that current AI evaluation methods have systemic validity failures and propose item-level benchmark data as essential for rigorous AI evaluation. They introduce OpenEval, a repository of item-level benchmark data to support evidence-centered AI evaluation and enable fine-grained diagnostic analysis.

AIBullisharXiv – CS AI · Apr 76/10

🧠

Toward Full Autonomous Laboratory Instrumentation Control with Large Language Models

Researchers demonstrate how large language models like ChatGPT can automate laboratory instrument control, reducing programming barriers for scientists. The study shows LLMs can create custom scripts and operate as autonomous AI agents for lab equipment management.

🧠 ChatGPT

AIBullisharXiv – CS AI · Apr 76/10

🧠

SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems

Researchers have released SuperLocalMemory V3.3, an open-source AI agent memory system that operates entirely locally without cloud LLMs, implementing biologically-inspired forgetting mechanisms and multi-channel retrieval. The system achieves 70.4% performance on LoCoMo benchmarks while running on CPU only, addressing the paradox of AI agents having vast knowledge but poor conversational memory.

AIBullisharXiv – CS AI · Apr 76/10

🧠

Memory Intelligence Agent

Researchers have developed Memory Intelligence Agent (MIA), a new AI framework that improves deep research agents through a Manager-Planner-Executor architecture with advanced memory systems. The framework enables continuous learning during inference and demonstrates superior performance across eleven benchmarks through enhanced cooperation between parametric and non-parametric memory systems.

AIBullisharXiv – CS AI · Apr 76/10

🧠

Optimizing Service Operations via LLM-Powered Multi-Agent Simulation

Researchers introduce an LLM-powered multi-agent simulation framework for optimizing service operations by modeling human behavior through AI agents. The method uses prompts to embed design choices and extracts outcomes from LLM responses to create a controlled Markov chain model, showing superior performance in supply chain and contest design applications.

AINeutralarXiv – CS AI · Apr 76/10

🧠

Implementing surrogate goals for safer bargaining in LLM-based agents

Researchers developed methods to implement 'surrogate goals' in LLM-based agents to reduce bargaining risks by deflecting threats away from what principals care about. The study tested four approaches (prompting, fine-tuning, scaffolding) and found that scaffolding and fine-tuning methods outperformed simple prompting for implementing desired threat response behaviors.

AIBullisharXiv – CS AI · Apr 76/10

🧠

REAM: Merging Improves Pruning of Experts in LLMs

Researchers propose REAM (Router-weighted Expert Activation Merging), a new method for compressing large language models that groups and merges expert weights instead of pruning them. The technique preserves model performance better than existing pruning methods while reducing memory requirements for deployment.

AIBullisharXiv – CS AI · Apr 76/10

🧠

Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus

Research reveals that multi-agent LLM committees suffer from 'representational collapse' where agents produce highly similar outputs despite different role prompts, with mean cosine similarity of 0.888. A new diversity-aware consensus protocol (DALC) improves accuracy to 87% while reducing token costs by 26% compared to traditional self-consistency methods.

AIBullisharXiv – CS AI · Apr 76/10

🧠

Context Engineering: A Practitioner Methodology for Structured Human-AI Collaboration

Researchers introduce Context Engineering, a structured methodology for improving AI output quality through better context assembly rather than just prompting techniques. The study of 200 AI interactions showed that structured context reduced iteration cycles from 3.8 to 2.0 and improved first-pass acceptance rates from 32% to 55%.

🧠 ChatGPT🧠 Claude

AIBullisharXiv – CS AI · Apr 76/10

🧠

InferenceEvolve: Towards Automated Causal Effect Estimators through Self-Evolving AI

Researchers introduce InferenceEvolve, an AI framework using large language models to automatically discover and refine causal inference methods. The system outperformed 58 human submissions in a recent competition and demonstrates how AI can optimize complex scientific programs through evolutionary approaches.

AIBullisharXiv – CS AI · Apr 76/10

🧠

Schema-Aware Planning and Hybrid Knowledge Toolset for Reliable Knowledge Graph Triple Verification

Researchers have developed SHARP, a new AI agent that significantly improves knowledge graph verification by combining internal structural data with external evidence. The system achieved 4.2% and 12.9% accuracy improvements over existing methods on major datasets, offering better interpretability for complex fact verification tasks.

AIBearisharXiv – CS AI · Apr 76/10

🧠

Don't Blink: Evidence Collapse during Multimodal Reasoning

Research reveals that Vision Language Models (VLMs) progressively lose visual grounding during reasoning tasks, creating dangerous low-entropy predictions that appear confident but lack visual evidence. The study found attention to visual evidence drops by over 50% during reasoning across multiple benchmarks, requiring task-aware monitoring for safe AI deployment.

AINeutralarXiv – CS AI · Apr 76/10

🧠

TimeSeek: Temporal Reliability of Agentic Forecasters

TimeSeek introduces a benchmark showing that AI language models perform best at predicting binary market outcomes early in a market's lifecycle and on high-uncertainty markets, but struggle near resolution and on consensus markets. Web search generally improves forecasting accuracy across models, though not uniformly, while simple ensembles reduce errors without beating market performance overall.

AIBullisharXiv – CS AI · Apr 76/10

🧠

VERT: Reliable LLM Judges for Radiology Report Evaluation

Researchers introduced VERT, a new LLM-based metric for evaluating radiology reports that shows up to 11.7% better correlation with radiologist judgments compared to existing methods. The study demonstrates that fine-tuned smaller models can achieve significant performance gains while reducing inference time by up to 37.2 times.

AINeutralarXiv – CS AI · Apr 76/10

🧠

Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems

Researchers developed a four-layer pedagogical safety framework for AI tutoring systems and introduced the Reward Hacking Severity Index (RHSI) to measure misalignment between proxy rewards and genuine learning. Their study of 18,000 simulated interactions found that engagement-optimized AI agents systematically selected high-engagement actions with no learning benefits, requiring constrained architectures to reduce reward hacking.

AIBullisharXiv – CS AI · Apr 76/10

🧠

Decocted Experience Improves Test-Time Inference in LLM Agents

Researchers present a new approach to improve Large Language Model performance without updating model parameters by using 'decocted experience' - extracting and organizing key insights from previous interactions to guide better reasoning. The method shows effectiveness across reasoning tasks including math, web browsing, and software engineering by constructing better contextual inputs rather than simply scaling computational resources.

AIBullisharXiv – CS AI · Apr 76/10

🧠

Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

Researchers developed DualJudge, a new framework for evaluating large language models that combines structured Fuzzy Analytic Hierarchy Process (FAHP) with traditional direct scoring methods. The approach addresses inconsistent LLM evaluation by incorporating uncertainty-aware reasoning and achieved state-of-the-art performance on JudgeBench testing.

AIBullisharXiv – CS AI · Apr 76/10

🧠

Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

Researchers developed a new method to reduce hallucinations in Large Vision-Language Models (LVLMs) by identifying a three-phase attention structure in vision processing and selectively suppressing low-attention tokens during the focus phase. The training-free approach significantly reduces object hallucinations while maintaining caption quality with minimal inference latency impact.

AIBullisharXiv – CS AI · Apr 76/10

🧠

Generative AI for material design: A mechanics perspective from burgers to matter

Researchers demonstrate that generative AI and computational mechanics share fundamental principles by using diffusion models to design burger recipes and materials. The study trained models on 2,260 recipes to generate new combinations, with three AI-designed burgers outperforming McDonald's Big Mac in taste tests with 100 participants.

AINeutralarXiv – CS AI · Apr 76/10

🧠

Selective Forgetting for Large Reasoning Models

Researchers propose a new framework for 'selective forgetting' in Large Reasoning Models (LRMs) that can remove sensitive information from AI training data while preserving general reasoning capabilities. The method uses retrieval-augmented generation to identify and replace problematic reasoning segments with benign placeholders, addressing privacy and copyright concerns in AI systems.

AINeutralarXiv – CS AI · Apr 76/10

🧠

When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling

Research reveals that adaptive reward mechanisms in AI-guided satellite scheduling systems actually hurt performance, with static reward weights achieving 342.1 Mbps versus dynamic weights at only 103.3 Mbps. The study found that fine-tuned LLMs performed poorly due to weight oscillation issues, while simpler MLP models achieved superior results of 357.9 Mbps.

AINeutralarXiv – CS AI · Apr 76/10

🧠

Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory

Researchers propose Rashomon Memory, a new AI agent memory architecture where multiple goal-conditioned agents maintain parallel interpretations of the same events and negotiate through argumentation at query time. The system allows AI agents to handle conflicting perspectives on experiences rather than forcing a single interpretation, using Dung's argumentation semantics to determine which proposals survive retrieval.

AIBullisharXiv – CS AI · Apr 76/10

🧠

PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training

Researchers introduce PRAISE, a new framework that improves training efficiency for AI agents performing complex search tasks like multi-hop question answering. The method addresses key limitations in current reinforcement learning approaches by reusing partial search trajectories and providing intermediate rewards rather than only final answer feedback.

← PrevPage 165 of 510Next →