y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm News & Analysis

956 articles tagged with #llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

956 articles
AIBullisharXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

Researchers propose a new four-phase architecture to reduce AI hallucinations using domain-specific retrieval and verification systems. The framework achieved win rates up to 83.7% across multiple benchmarks, demonstrating significant improvements in factual accuracy for large language models.

AIBearisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

On the Adversarial Transferability of Generalized "Skip Connections"

Researchers discovered that skip connections in deep neural networks make adversarial attacks more transferable across different AI models. They developed the Skip Gradient Method (SGM) which exploits this vulnerability in ResNets, Vision Transformers, and even Large Language Models to create more effective adversarial examples.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Estimating Causal Effects of Text Interventions Leveraging LLMs

Researchers propose CausalDANN, a novel method using large language models to estimate causal effects of textual interventions in social systems. The approach addresses limitations of traditional causal inference methods when dealing with complex, high-dimensional textual data and can handle arbitrary text interventions even with observational data only.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

A comprehensive research study examines the relationship between Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) methods for improving Large Language Models after pre-training. The research identifies emerging trends toward hybrid post-training approaches that combine both methods, analyzing applications from 2023-2025 to establish when each method is most effective.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Relationship-Aware Safety Unlearning for Multimodal LLMs

Researchers propose a new framework for improving safety in multimodal AI models by targeting unsafe relationships between objects rather than removing entire concepts. The approach uses parameter-efficient edits to suppress dangerous combinations while preserving benign uses of the same objects and relations.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Researchers propose a hierarchical planning framework to analyze why LLM-based web agents fail at complex navigation tasks. The study reveals that while structured PDDL plans outperform natural language plans, low-level execution and perceptual grounding remain the primary bottlenecks rather than high-level reasoning.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Contests with Spillovers: Incentivizing Content Creation with GenAI

Researchers propose the Content Creation with Spillovers (CCS) model to address how GenAI and LLMs create positive spillovers where creators' content can be reused by others, potentially undermining individual incentives. They introduce Provisional Allocation mechanisms to guarantee equilibrium existence and develop approximation algorithms to maximize social welfare in content creation ecosystems.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Researchers introduce AgentProcessBench, the first benchmark for evaluating step-level effectiveness in AI tool-using agents, comprising 1,000 trajectories and 8,509 human-labeled annotations. The benchmark reveals that current AI models struggle with distinguishing neutral and erroneous actions in tool execution, and that process-level signals can significantly enhance test-time performance.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Argumentation for Explainable and Globally Contestable Decision Support with LLMs

Researchers introduce ArgEval, a new framework that enhances Large Language Model decision-making through structured argumentation and global contestability. Unlike previous approaches limited to binary choices and local corrections, ArgEval maps entire decision spaces and builds reusable argumentation frameworks that can be globally modified to prevent repeated mistakes.

AIBearisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

Researchers introduced BrainBench, a new benchmark revealing significant gaps in commonsense reasoning among leading LLMs. Even the best model (Claude Opus 4.6) achieved only 80.3% accuracy on 100 brainteaser questions, while GPT-4o scored just 39.7%, exposing fundamental reasoning deficits across frontier AI models.

๐Ÿง  GPT-4๐Ÿง  Claude๐Ÿง  Opus
AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

Researchers introduce OpenHospital, a new interactive arena designed to develop and benchmark Large Language Model-based Collective Intelligence through physician-patient agent interactions. The platform uses a data-in-agent-self paradigm to rapidly enhance AI agent capabilities while providing evaluation metrics for medical proficiency and system efficiency.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

PMAx: An Agentic Framework for AI-Driven Process Mining

Researchers have developed PMAx, an autonomous AI framework that democratizes process mining by allowing business users to analyze organizational workflows through natural language queries. The system uses a multi-agent architecture with local execution to ensure data privacy and mathematical accuracy while eliminating the need for specialized technical expertise.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

Researchers developed an information-theoretic framework to explain 'Aha moments' in large language models during reasoning tasks. The study reveals that strong reasoning performance stems from uncertainty externalization rather than specific tokens, decomposing LLM reasoning into procedural information and epistemic verbalization.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

PREBA: Surgical Duration Prediction via PCA-Weighted Retrieval-Augmented LLMs and Bayesian Averaging Aggregation

Researchers developed PREBA, a retrieval-augmented framework that uses PCA-weighted retrieval and Bayesian averaging to improve surgical duration prediction accuracy by up to 40% using large language models. The system grounds LLM predictions in institution-specific clinical data without requiring computationally intensive training, achieving performance competitive with supervised machine learning methods.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Learning Retrieval Models with Sparse Autoencoders

Researchers introduce SPLARE, a new method that uses sparse autoencoders (SAEs) to improve learned sparse retrieval in language models. The technique outperforms existing vocabulary-based approaches in multilingual and out-of-domain settings, with SPLARE-7B achieving top results on multilingual retrieval benchmarks.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

FedTreeLoRA: Reconciling Statistical and Functional Heterogeneity in Federated LoRA Fine-Tuning

Researchers propose FedTreeLoRA, a new framework for privacy-preserving fine-tuning of large language models that addresses both statistical and functional heterogeneity across federated learning clients. The method uses tree-structured aggregation to allow layer-wise specialization while maintaining shared consensus on foundational layers, significantly outperforming existing personalized federated learning approaches.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

From Stochastic Answers to Verifiable Reasoning: Interpretable Decision-Making with LLM-Generated Code

Researchers propose a new framework that uses LLMs as code generators rather than per-instance evaluators for high-stakes decision-making, creating interpretable and reproducible AI systems. The approach generates executable decision logic once instead of querying LLMs for each prediction, demonstrated through venture capital founder screening with competitive performance while maintaining full transparency.

๐Ÿง  GPT-4
AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Not All Queries Need Rewriting: When Prompt-Only LLM Refinement Helps and Hurts Dense Retrieval

Research reveals that LLM query rewriting in RAG systems shows highly domain-dependent performance, degrading retrieval effectiveness by 9% in financial domains while improving it by 5.1% in scientific contexts. The study identifies that effectiveness depends on whether rewriting improves or worsens lexical alignment between queries and domain-specific terminology.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Evidence-based Distributional Alignment for Large Language Models

Researchers propose Evi-DA, an evidence-based technique that improves how large language models predict population response distributions across different cultures and domains. The method uses World Values Survey data and reinforcement learning to achieve up to 44% improvement in accuracy compared to existing approaches.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

GPrune-LLM: Generalization-Aware Structured Pruning for Large Language Models

Researchers introduce GPrune-LLM, a new structured pruning framework that improves compression of large language models by addressing calibration bias and cross-task generalization issues. The method partitions neurons into behavior-consistent modules and uses adaptive metrics based on distribution sensitivity, showing consistent improvements in post-compression performance.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

Researchers introduced QuarkMedBench, a new benchmark for evaluating large language models on real-world medical queries using over 20,000 queries across clinical care scenarios. The benchmark addresses limitations of current medical AI evaluations that rely on multiple-choice questions by using an automated scoring framework that achieves 91.8% concordance with clinical expert assessments.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

IGU-LoRA: Adaptive Rank Allocation via Integrated Gradients and Uncertainty-Aware Scoring

Researchers introduce IGU-LoRA, a new parameter-efficient fine-tuning method for large language models that adaptively allocates ranks across layers using integrated gradients and uncertainty-aware scoring. The approach addresses limitations of existing methods like AdaLoRA by providing more stable and accurate layer importance estimates, consistently outperforming baselines across diverse tasks.