956 articles tagged with #llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Mar 266/10
๐ง Researchers propose a new four-phase architecture to reduce AI hallucinations using domain-specific retrieval and verification systems. The framework achieved win rates up to 83.7% across multiple benchmarks, demonstrating significant improvements in factual accuracy for large language models.
AIBearisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers discovered that skip connections in deep neural networks make adversarial attacks more transferable across different AI models. They developed the Skip Gradient Method (SGM) which exploits this vulnerability in ResNets, Vision Transformers, and even Large Language Models to create more effective adversarial examples.
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose CausalDANN, a novel method using large language models to estimate causal effects of textual interventions in social systems. The approach addresses limitations of traditional causal inference methods when dealing with complex, high-dimensional textual data and can handle arbitrary text interventions even with observational data only.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose EMBRAG, a new framework that combines large language models with knowledge graphs to improve reasoning accuracy and reduce hallucinations. The system generates multiple logical rules from queries and applies them in embedding space, achieving state-of-the-art performance on knowledge graph question-answering benchmarks.
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง A comprehensive research study examines the relationship between Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) methods for improving Large Language Models after pre-training. The research identifies emerging trends toward hybrid post-training approaches that combine both methods, analyzing applications from 2023-2025 to establish when each method is most effective.
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose a new framework for improving safety in multimodal AI models by targeting unsafe relationships between objects rather than removing entire concepts. The approach uses parameter-efficient edits to suppress dangerous combinations while preserving benign uses of the same objects and relations.
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose a hierarchical planning framework to analyze why LLM-based web agents fail at complex navigation tasks. The study reveals that while structured PDDL plans outperform natural language plans, low-level execution and perceptual grounding remain the primary bottlenecks rather than high-level reasoning.
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose the Content Creation with Spillovers (CCS) model to address how GenAI and LLMs create positive spillovers where creators' content can be reused by others, potentially undermining individual incentives. They introduce Provisional Allocation mechanisms to guarantee equilibrium existence and develop approximation algorithms to maximize social welfare in content creation ecosystems.
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง Researchers introduce AgentProcessBench, the first benchmark for evaluating step-level effectiveness in AI tool-using agents, comprising 1,000 trajectories and 8,509 human-labeled annotations. The benchmark reveals that current AI models struggle with distinguishing neutral and erroneous actions in tool execution, and that process-level signals can significantly enhance test-time performance.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers introduce ArgEval, a new framework that enhances Large Language Model decision-making through structured argumentation and global contestability. Unlike previous approaches limited to binary choices and local corrections, ArgEval maps entire decision spaces and builds reusable argumentation frameworks that can be globally modified to prevent repeated mistakes.
AIBearisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers introduced BrainBench, a new benchmark revealing significant gaps in commonsense reasoning among leading LLMs. Even the best model (Claude Opus 4.6) achieved only 80.3% accuracy on 100 brainteaser questions, while GPT-4o scored just 39.7%, exposing fundamental reasoning deficits across frontier AI models.
๐ง GPT-4๐ง Claude๐ง Opus
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers introduce OpenHospital, a new interactive arena designed to develop and benchmark Large Language Model-based Collective Intelligence through physician-patient agent interactions. The platform uses a data-in-agent-self paradigm to rapidly enhance AI agent capabilities while providing evaluation metrics for medical proficiency and system efficiency.
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง Researchers have developed PMAx, an autonomous AI framework that democratizes process mining by allowing business users to analyze organizational workflows through natural language queries. The system uses a multi-agent architecture with local execution to ensure data privacy and mathematical accuracy while eliminating the need for specialized technical expertise.
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง Researchers developed an information-theoretic framework to explain 'Aha moments' in large language models during reasoning tasks. The study reveals that strong reasoning performance stems from uncertainty externalization rather than specific tokens, decomposing LLM reasoning into procedural information and epistemic verbalization.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers developed PREBA, a retrieval-augmented framework that uses PCA-weighted retrieval and Bayesian averaging to improve surgical duration prediction accuracy by up to 40% using large language models. The system grounds LLM predictions in institution-specific clinical data without requiring computationally intensive training, achieving performance competitive with supervised machine learning methods.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers introduce SPLARE, a new method that uses sparse autoencoders (SAEs) to improve learned sparse retrieval in language models. The technique outperforms existing vocabulary-based approaches in multilingual and out-of-domain settings, with SPLARE-7B achieving top results on multilingual retrieval benchmarks.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose FedTreeLoRA, a new framework for privacy-preserving fine-tuning of large language models that addresses both statistical and functional heterogeneity across federated learning clients. The method uses tree-structured aggregation to allow layer-wise specialization while maintaining shared consensus on foundational layers, significantly outperforming existing personalized federated learning approaches.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose a new framework that uses LLMs as code generators rather than per-instance evaluators for high-stakes decision-making, creating interpretable and reproducible AI systems. The approach generates executable decision logic once instead of querying LLMs for each prediction, demonstrated through venture capital founder screening with competitive performance while maintaining full transparency.
๐ง GPT-4
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง Research reveals that LLM query rewriting in RAG systems shows highly domain-dependent performance, degrading retrieval effectiveness by 9% in financial domains while improving it by 5.1% in scientific contexts. The study identifies that effectiveness depends on whether rewriting improves or worsens lexical alignment between queries and domain-specific terminology.
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose Evi-DA, an evidence-based technique that improves how large language models predict population response distributions across different cultures and domains. The method uses World Values Survey data and reinforcement learning to achieve up to 44% improvement in accuracy compared to existing approaches.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers introduce GPrune-LLM, a new structured pruning framework that improves compression of large language models by addressing calibration bias and cross-task generalization issues. The method partitions neurons into behavior-consistent modules and uses adaptive metrics based on distribution sensitivity, showing consistent improvements in post-compression performance.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง NormCode Canvas v1.1.3 introduces a case-based reasoning system for LLM agentic workflows using a semi-formal planning language called NormCode. The deployed system demonstrates multi-step AI task automation across presentation generation, code assistance, and plan compilation with self-sustaining capabilities.
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง Researchers introduced QuarkMedBench, a new benchmark for evaluating large language models on real-world medical queries using over 20,000 queries across clinical care scenarios. The benchmark addresses limitations of current medical AI evaluations that rely on multiple-choice questions by using an automated scoring framework that achieves 91.8% concordance with clinical expert assessments.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers developed a framework to make large language model-based query expansion more efficient by distilling knowledge from powerful teacher models into compact student models. The approach uses retrieval feedback and preference alignment to maintain 97% of the original performance while dramatically reducing inference costs.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers introduce IGU-LoRA, a new parameter-efficient fine-tuning method for large language models that adaptively allocates ranks across layers using integrated gradients and uncertainty-aware scoring. The approach addresses limitations of existing methods like AdaLoRA by providing more stable and accurate layer importance estimates, consistently outperforming baselines across diverse tasks.