🧠

AI

12,717 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

12717 articles

AIBullisharXiv – CS AI · Apr 136/10

🧠

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Researchers introduce Sequence-Level PPO (SPPO), a new algorithm that improves how large language models are trained for reasoning tasks by addressing stability and computational efficiency issues in standard reinforcement learning approaches. SPPO matches the performance of resource-heavy methods while significantly reducing memory and computational costs, potentially accelerating LLM alignment for complex problem-solving.

AINeutralarXiv – CS AI · Apr 136/10

🧠

StaRPO: Stability-Augmented Reinforcement Policy Optimization

Researchers propose StaRPO, a reinforcement learning framework that improves large language model reasoning by incorporating stability metrics alongside task rewards. The method uses Autocorrelation Function and Path Efficiency measurements to evaluate logical coherence and goal-directedness, demonstrating improved accuracy and reasoning consistency across four benchmarks.

AIBullisharXiv – CS AI · Apr 136/10

🧠

Enhancing LLM Problem Solving via Tutor-Student Multi-Agent Interaction

Researchers present PETITE, a tutor-student multi-agent framework that enhances LLM problem-solving by assigning complementary roles to agents from the same model. Evaluated on coding benchmarks, the approach achieves comparable or superior accuracy to existing methods while consuming significantly fewer tokens, demonstrating that structured role-differentiated interactions can improve LLM performance more efficiently than larger models or heterogeneous ensembles.

AINeutralarXiv – CS AI · Apr 136/10

🧠

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

Researchers introduce SEA-Eval, a new benchmark for evaluating self-evolving AI agents that go beyond single-task execution by measuring how agents improve across sequential tasks and accumulate experience over time. The benchmark reveals significant inefficiencies in current state-of-the-art frameworks, exposing up to 31.2x differences in token consumption despite identical success rates, highlighting a critical bottleneck in agent development.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym

Researchers introduce Spatial-Gym, a benchmarking environment that evaluates AI models on spatial reasoning tasks through step-by-step pathfinding in 2D grids rather than one-shot generation. Testing eight models reveals a significant performance gap, with the best model achieving only 16% solve rate versus 98% for humans, exposing critical limitations in how AI systems scale reasoning effort and process spatial information.

AIBullisharXiv – CS AI · Apr 136/10

🧠

E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

Researchers introduce E3-TIR, a new training paradigm for Large Language Models that improves tool-use reasoning by combining expert guidance with self-exploration. The method achieves 6% performance gains while using less than 10% of typical synthetic data, addressing key limitations in current reinforcement learning approaches for AI agents.

AIBullisharXiv – CS AI · Apr 136/10

🧠

On Divergence Measures for Training GFlowNets

Researchers propose improved divergence measures for training Generative Flow Networks (GFlowNets), comparing Renyi-α, Tsallis-α, and KL divergences to enhance statistical efficiency. The work introduces control variates that reduce gradient variance and achieve faster convergence than existing methods, bridging GFlowNets training with generalized variational inference frameworks.

AIBearisharXiv – CS AI · Apr 136/10

🧠

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Researchers introduce OmniBehavior, a benchmark for evaluating large language models' ability to simulate real-world human behavior across complex, long-horizon scenarios. The study reveals that current LLMs struggle with authentic behavioral simulation and exhibit systematic biases toward homogenized, overly-positive personas rather than capturing individual differences and realistic long-tail behaviors.

AINeutralarXiv – CS AI · Apr 136/10

🧠

GNN-as-Judge: Unleashing the Power of LLMs for Graph Learning with GNN Feedback

Researchers propose GNN-as-Judge, a framework combining Large Language Models with Graph Neural Networks to improve learning on text-attributed graphs in low-resource settings. The approach uses collaborative pseudo-labeling and weakly-supervised fine-tuning to generate reliable labels while reducing noise, demonstrating significant performance gains when labeled data is scarce.

AIBullisharXiv – CS AI · Apr 136/10

🧠

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

Researchers introduce WAND, a framework that reduces computational and memory costs of autoregressive text-to-speech models by replacing full self-attention with windowed attention combined with knowledge distillation. The approach achieves up to 66.2% KV cache memory reduction while maintaining speech quality, addressing a critical scalability bottleneck in modern AR-TTS systems.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models

Researchers systematically evaluated how sampling temperature and prompting strategies affect extended reasoning performance in large language models, finding that zero-shot prompting peaks at moderate temperatures (T=0.4-0.7) while chain-of-thought performs better at extremes. The study reveals that extended reasoning benefits grow substantially with higher temperatures, suggesting that T=0 is suboptimal for reasoning tasks.

🧠 Grok

AINeutralarXiv – CS AI · Apr 136/10

🧠

Silhouette Loss: Differentiable Global Structure Learning for Deep Representations

Researchers introduce Soft Silhouette Loss, a novel machine learning objective that improves deep neural network representations by enforcing intra-class compactness and inter-class separation. The lightweight differentiable loss outperforms cross-entropy and supervised contrastive learning when combined, achieving 39.08% top-1 accuracy compared to 37.85% for existing methods while reducing computational overhead.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Structured Exploration and Exploitation of Label Functions for Automated Data Annotation

Researchers introduce EXPONA, an automated framework for generating label functions that improve weak label quality in machine learning datasets. The system balances exploration across surface, structural, and semantic levels with reliability filtering, achieving up to 98.9% label coverage and 46% downstream performance improvements across diverse classification tasks.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models

Researchers analyzed how large language models decide whether to act on predictions or escalate to humans, finding that models use inconsistent and miscalibrated thresholds across five real-world domains. Supervised fine-tuning on chain-of-thought reasoning proved most effective at establishing robust escalation policies that generalize across contexts, suggesting escalation behavior requires explicit characterization before AI system deployment.

AIBullisharXiv – CS AI · Apr 136/10

🧠

Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean

Researchers introduce Temperature-Controlled Verdict Aggregation (TCVA), a novel evaluation method that adapts AI system assessment rigor based on application domain requirements. By combining verdict scoring with generalized power-mean aggregation and a tunable temperature parameter, TCVA achieves human-aligned evaluation comparable to existing benchmarks while offering computational efficiency.

AIBullisharXiv – CS AI · Apr 136/10

🧠

TiAb Review Plugin: A Browser-Based Tool for AI-Assisted Title and Abstract Screening

Researchers developed TiAb Review Plugin, an open-source Chrome extension that enables AI-assisted screening of academic titles and abstracts without requiring server subscriptions or coding skills. The tool combines Google Sheets for collaboration, Google's Gemini API for LLM-based screening, and an in-browser machine learning algorithm achieving 94-100% recall, demonstrating practical viability for systematic literature reviews.

🧠 Gemini

AINeutralarXiv – CS AI · Apr 136/10

🧠

Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach

Researchers present a forensic-focused multimodal framework for detecting hate speech and threats across images, documents, and text. The approach intelligently determines what evidence is present before applying appropriate AI models, improving accuracy and evidentiary traceability in digital investigations.

AINeutralarXiv – CS AI · Apr 136/10

🧠

From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity

Researchers propose FEAT, a federated learning method that improves continual learning by addressing class imbalance and representation collapse across distributed clients. The approach combines geometric alignment and energy-based correction to better utilize exemplar samples while maintaining performance under dynamic heterogeneity.

AINeutralarXiv – CS AI · Apr 136/10

🧠

StructRL: Recovering Dynamic Programming Structure from Learning Dynamics in Distributional Reinforcement Learning

StructRL is a new reinforcement learning framework that recovers dynamic programming structure from distributional learning dynamics without requiring explicit models. The research demonstrates that temporal patterns in return distribution evolution reveal inherent structure in how information propagates through state spaces, enabling more efficient and stable learning.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Practical Bayesian Inference for Speech SNNs: Uncertainty and Loss-Landscape Smoothing

Researchers demonstrate that applying Bayesian inference to Spiking Neural Networks (SNNs) for speech processing smooths the irregular loss landscape caused by threshold-based spike generation. Testing on speech datasets shows improved performance metrics and more regular predictive landscapes compared to deterministic approaches.

AINeutralarXiv – CS AI · Apr 136/10

🧠

VOLTA: The Surprising Ineffectiveness of Auxiliary Losses for Calibrated Deep Learning

Researchers introduce VOLTA, a simplified deep learning approach for uncertainty quantification that outperforms ten established baselines including ensemble methods and MC Dropout. The method achieves superior calibration with expected calibration error of 0.010 and competitive accuracy across multiple datasets, suggesting that complex auxiliary losses may be unnecessary for reliable uncertainty estimation in safety-critical applications.

AINeutralarXiv – CS AI · Apr 136/10

🧠

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

Researchers introduce 3D-VCD, an inference-time framework that reduces hallucinations in 3D-LLM embodied agents by contrasting predictions against distorted scene graphs. The method addresses failures specific to 3D spatial reasoning without requiring model retraining, advancing reliability in embodied AI systems.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition

Researchers introduce MATU, a novel uncertainty quantification framework using tensor decomposition to address reliability challenges in Large Language Model-based Multi-Agent Systems. The method analyzes entire reasoning trajectories rather than single outputs, effectively measuring uncertainty across different agent structures and communication topologies.

AINeutralarXiv – CS AI · Apr 136/10

🧠

LLMs Underperform Graph-Based Parsers on Supervised Relation Extraction for Complex Graphs

A new study comparing large language models against graph-based parsers for relation extraction demonstrates that smaller, specialized architectures significantly outperform LLMs when processing complex linguistic graphs with multiple relations. This finding challenges the prevailing assumption that larger language models are universally superior for natural language processing tasks.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models

Researchers benchmarked five frontier LLMs against human players in Cards Against Humanity games, finding that while models exceed random baseline performance, their humor preferences align poorly with humans but strongly with each other. The findings suggest LLM humor judgment may reflect systematic biases and structural artifacts rather than genuine preference understanding.

← PrevPage 152 of 509Next →