#llm-research News & Analysis

52 articles tagged with #llm-research. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

52 articles

AIBullisharXiv – CS AI · Jun 46/10

🧠

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

MM-BizRAG introduces a structured approach to multimodal retrieval-augmented generation for enterprise document analysis, dynamically routing documents through layout-specific processing pipelines and outperforming existing vision-centric baselines by up to 32% on heterogeneous enterprise datasets. The system decouples retrieval from generation contexts and introduces FastRAGEval, a cost-efficient evaluation metric for RAG system quality assessment.

AIBullisharXiv – CS AI · Jun 46/10

🧠

Self-Evolving Deep Research via Joint Generation and Evaluation

Researchers introduce SCORE, a self-evolving co-evolutionary framework that jointly trains evaluation and generation models for deep research report generation. The approach addresses limitations in LLM-based research agents by enabling evaluators to dynamically adapt standards as solver performance improves, demonstrating consistent quality improvements over static evaluation methods.

AINeutralarXiv – CS AI · Jun 36/10

🧠

Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

Researchers introduce TBS (Think-Before-Speak), a multi-agent simulation framework that separates LLM agents' internal reasoning from public dialogue in social interactions. The framework tracks internal states like cognitive dissonance and speaking willingness, then orchestrates public utterances, enabling detailed analysis of how private evaluation drives public expression in collective deliberation scenarios.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Evolutionary Discovery of Bivariate Bicycle Codes with LLM-Guided Search

Researchers developed an LLM-guided evolutionary algorithm to discover quantum LDPC codes, a critical component for scaling quantum computers. The system identified 465 new candidate codes including several with improved parameters, demonstrating that AI-assisted program synthesis can accelerate quantum code discovery at relatively low computational cost.

$US

AINeutralarXiv – CS AI · Jun 26/10

🧠

Are LLMs Ready for Neural-integrated Mechanistic Modeling? A Benchmark and Agentic Framework

Researchers introduce NIMM, a benchmark for evaluating large language models' ability to construct neural-integrated mechanistic models that combine traditional scientific equations with neural networks. They propose NIMMGen, an agentic framework using tree-guided search that significantly outperforms existing LLM approaches on this complex modeling task across three scientific domains.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models

Researchers demonstrate that multilingual large language models encode shared confidence features that transfer across languages without retraining. A lightweight linear probe trained on English can predict answer correctness in unseen languages with zero-shot generalization, suggesting confidence estimation mechanisms are language-universal in LLMs.

AINeutralarXiv – CS AI · May 296/10

🧠

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

Researchers analyzed 12,000 Microsoft Bing Copilot users over time and found that individual user behavior with LLMs remains remarkably consistent despite broader population-level trends, with significant variation between active and casual users. The study reveals that existing datasets like WildChat-4.8M predominantly represent power users and fail to capture typical user-AI interactions.

🏢 Microsoft

AINeutralarXiv – CS AI · May 296/10

🧠

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

Researchers propose a unified framework for long-form egocentric video understanding that separates reasoning into semantic and visual evidence streams, achieving competitive results on the HD-EPIC-VQA benchmark. The approach addresses fundamental limitations in how multimodal language models process extended video content by combining procedural structure extraction with fine-grained object grounding.

AINeutralarXiv – CS AI · May 296/10

🧠

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

Researchers propose an adaptive interview framework to improve how large language models simulate individual decision-making by gathering persona-relevant information through structured dialogue. The study finds that richer contextual information alone doesn't guarantee better accuracy; instead, LLMs only improve predictions (45.5% vs. 39.3%) when they actively ground decisions in user-specific evidence extracted during follow-up questions.

AINeutralarXiv – CS AI · May 296/10

🧠

MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery

MOOSE-Copilot introduces a unified framework for scientific hypothesis discovery that combines exploratory ideation with fine-grained refinement through structured human-AI interaction. The web-based system enables scientists to guide LLM-powered discovery processes via initial blueprints, routing decisions, and feedback mechanisms, outperforming autonomous baselines while lowering accessibility barriers through an intuitive visual interface.

🏢 Microsoft

AINeutralarXiv – CS AI · May 285/10

🧠

LLM-assisted sentiment analysis for integrated computational and qualitative mixed methods education research: A case study of students' written reflection assignments

Researchers demonstrate how large language models can assist in analyzing student written reflections for mixed-methods education research, combining computational sentiment analysis with qualitative thematic analysis. The study of 151 study-abroad students reveals that prior international living experience significantly impacts sentiment toward language learning, suggesting LLM-assisted workflows enable efficient multi-variable demographic comparisons in qualitative research.

AINeutralarXiv – CS AI · May 276/10

🧠

AI evaluation may bias perceptions: The importance of context in interpreting academic writing

A new study demonstrates that pooled benchmarks for detecting AI-generated academic text systematically misrepresent AI adoption across countries and research fields by ignoring contextual stylistic variations. Using country-field-specific benchmarks instead provides more accurate measurements and reveals that previous estimates substantially over- or underestimated AI use depending on geographic and disciplinary context.

AINeutralarXiv – CS AI · May 276/10

🧠

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

Researchers introduce MUSE, a framework that disentangles two distinct mechanisms driving LLM conformity: sycophancy learned through reinforcement learning and uncertainty-driven conformity based on epistemic uncertainty at inference time. The findings suggest that LLMs don't simply yield to user pushback due to training, but also because they genuinely lack confidence in their initial responses, with both factors amplified when users appear knowledgeable or suggestions seem plausible.

AINeutralarXiv – CS AI · May 126/10

🧠

SDG-MoE: Signed Debate Graph Mixture-of-Experts

Researchers introduce SDG-MoE, a novel mixture-of-experts architecture that enables deliberation among routed experts through signed graph communication before output aggregation. The model demonstrates 19.8% perplexity improvement over vanilla MoE and achieves state-of-the-art results on multiple language modeling benchmarks while maintaining computational efficiency.

🏢 Perplexity

AINeutralarXiv – CS AI · May 126/10

🧠

A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases

Researchers have developed a geometric framework for understanding how large language models process information across their layers, identifying three distinct phases in next-token prediction: Seeding Multiplexing, Hoisting Overriding, and Focal Convergence. The study reveals that model depth primarily increases capacity for candidate disambiguation rather than adding fundamentally new computational stages.

AINeutralarXiv – CS AI · May 116/10

🧠

Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

Researchers conducted a controlled empirical study evaluating three LLMs (Claude Haiku, DeepSeek-Chat, Gemini 2.5 Flash) for qualitative coding of psychological safety in software engineering communities. Multi-shot prompting improved Claude Haiku's performance but not the others, while all models exhibited systematic biases in coding predictions, providing evidence-based guidelines for LLM-assisted qualitative research.

🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · May 96/10

🧠

Patch-Effect Graph Kernels for LLM Interpretability

Researchers propose a novel framework for understanding transformer neural networks by converting activation patching data into graph structures analyzable through machine learning techniques. The approach demonstrates that localized graph features can effectively preserve and classify circuit-level computational patterns in language models like GPT-2, providing a systematic method for mechanistic interpretability research.

AINeutralarXiv – CS AI · May 16/10

🧠

The TEA Nets framework combines AI and cognitive network science to model targets, events and actors in text

Researchers introduce TEA Nets (Target-Event-Agent Networks), an open-source AI framework that extracts subjects, verbs, and objects from text to analyze emotional and semantic patterns. Testing across conspiracy narratives and psychotherapy transcripts reveals that highly conspiratorial texts link personal pronouns to actions twice as frequently as low-conspiracy texts, while LLMs express emotions with measurably lower intensity than humans.

🧠 Claude

AIBullisharXiv – CS AI · Apr 156/10

🧠

Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching

Researchers introduce SLATE, a large-scale benchmark for evaluating AI agents using APIs, and propose Entropy-Guided Branching (EGB), a search algorithm that improves task success rates and computational efficiency. The work addresses critical limitations in deploying language models within complex tool environments by establishing rigorous evaluation frameworks and reducing the computational burden of exploring massive decision spaces.

AINeutralarXiv – CS AI · Apr 156/10

🧠

Why Did Apple Fall: Evaluating Curiosity in Large Language Models

Researchers have developed a comprehensive evaluation framework based on human curiosity scales to assess whether large language models exhibit curiosity-driven learning. The study finds that LLMs demonstrate stronger knowledge-seeking than humans but remain conservative in uncertain situations, with curiosity correlating positively to improved reasoning and active learning capabilities.

AINeutralarXiv – CS AI · Apr 146/10

🧠

A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities

Researchers demonstrate that inducing specific personas in Large Language Models produces measurable shifts in cognitive task performance, with effects showing 73.68% alignment to human personality-cognition relationships. The study introduces Dynamic Persona Routing, a lightweight strategy that optimizes LLM performance by dynamically selecting personas based on query type, outperforming static persona approaches without additional training.

AINeutralarXiv – CS AI · Mar 266/10

🧠

PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

Researchers developed PoliticsBench, a new framework to evaluate political bias in large language models through multi-turn roleplay scenarios. The study found that 7 out of 8 major LLMs (Claude, Deepseek, Gemini, GPT, Llama, Qwen) showed left-leaning political bias, while only Grok exhibited right-leaning tendencies.

🧠 Claude🧠 Gemini🧠 Llama

AINeutralarXiv – CS AI · Mar 176/10

🧠

MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection

Researchers released MALINT, the first human-annotated English dataset for detecting disinformation and its malicious intent, developed with expert fact-checkers. The study benchmarked 12 language models and introduced intent-based inoculation techniques that improved zero-shot disinformation detection across six datasets, five LLMs, and seven languages.

🧠 Llama

AINeutralarXiv – CS AI · Mar 116/10

🧠

CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models

Researchers introduce CRANE, a new framework for analyzing how multilingual large language models organize language capabilities at the neuron level. The method uses targeted interventions to identify language-specific neurons based on functional necessity rather than activation patterns, revealing asymmetric specialization where neurons contribute selectively to specific languages while maintaining broader functionality.

AINeutralarXiv – CS AI · Mar 96/10

🧠

ContextBench: Modifying Contexts for Targeted Latent Activation

Researchers have developed ContextBench, a new benchmark for evaluating methods that generate targeted inputs to trigger specific behaviors in language models. The study introduces enhanced Evolutionary Prompt Optimization techniques that better balance effectiveness in activating AI model features while maintaining linguistic fluency.

← PrevPage 2 of 3Next →