#llm News & Analysis

956 articles tagged with #llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

956 articles

AIBullisharXiv – CS AI · Mar 126/10

🧠

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

Researchers introduce EvoKernel, a self-evolving AI framework that addresses the 'Data Wall' problem in deploying Large Language Models for kernel synthesis on data-scarce hardware platforms like NPUs. The system uses memory-based reinforcement learning to improve correctness from 11% to 83% and achieves 3.60x speedup through iterative refinement.

AIBullisharXiv – CS AI · Mar 126/10

🧠

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

Researchers have developed LookaheadKV, a new framework that significantly improves memory efficiency in large language models by intelligently evicting less important cached data. The method achieves superior accuracy while reducing computational costs by up to 14.5x compared to existing approaches, making long-context AI tasks more practical.

AIBullisharXiv – CS AI · Mar 126/10

🧠

When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

Research demonstrates that LoRA fine-tuning of large language models significantly improves text-to-speech systems, achieving up to 0.42 DNS-MOS gains and 34% SNR improvements when training data has sufficient acoustic diversity. The study establishes LoRA as an effective mechanism for speaker adaptation in compact LLM-based TTS systems, outperforming frozen base models across perceptual quality, speaker fidelity, and signal quality metrics.

AIBullisharXiv – CS AI · Mar 116/10

🧠

LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems

Researchers present LLM Delegate Protocol (LDP), a new AI-native communication protocol for multi-agent LLM systems that introduces identity awareness, progressive payloads, and governance mechanisms. The protocol achieves 12x lower latency on simple tasks and 37% token reduction compared to existing protocols like A2A, though quality improvements remain limited in small delegate pools.

AIBullisharXiv – CS AI · Mar 116/10

🧠

Cognitively Layered Data Synthesis for Domain Adaptation of LLMs to Space Situational Awareness

Researchers developed BD-FDG, a framework for adapting large language models to complex engineering domains like space situational awareness. The method creates high-quality training datasets using structured knowledge organization and cognitive layering, resulting in SSA-LLM-8B that shows 144-176% BLEU-1 improvements while maintaining general performance.

AINeutralarXiv – CS AI · Mar 116/10

🧠

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

Research reveals that LLMs heavily concentrate their confidence scores on just three round numbers when using standard 0-100 scales, with over 78% of responses showing this pattern. The study demonstrates that using a 0-20 confidence scale significantly improves metacognitive efficiency compared to the conventional 0-100 format.

AINeutralarXiv – CS AI · Mar 116/10

🧠

Enhancing Debunking Effectiveness through LLM-based Personality Adaptation

Researchers developed a method using Large Language Models to create personalized fake news debunking messages tailored to individuals' Big Five personality traits. The study found that personalized debunking messages are more persuasive than generic ones, with traits like Openness increasing persuadability while Neuroticism decreases it.

AIBullisharXiv – CS AI · Mar 116/10

🧠

PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution

Researchers introduce PRECEPT, a new framework for AI language model agents that improves knowledge retrieval and adaptation through structured rule learning and conflict-aware memory systems. The framework shows significant performance improvements over existing methods, with 41% better first-try accuracy and enhanced compositional reasoning capabilities.

AINeutralarXiv – CS AI · Mar 116/10

🧠

Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts

Researchers propose a framework using policy-parameterized prompts to influence multi-agent LLM dialogue behavior without training. The approach treats prompts as actions and dynamically constructs them through five components to control conversation flow based on metrics like responsiveness and stance shift.

AIBullisharXiv – CS AI · Mar 116/10

🧠

Turn: A Language for Agentic Computation

Researchers have introduced Turn, a new compiled programming language specifically designed for building autonomous AI agents that use large language models. The language includes built-in features like cognitive type safety, confidence operators, and actor-based process models to address common challenges in agentic software development.

AIBullisharXiv – CS AI · Mar 116/10

🧠

Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

Researchers introduce Test-Driven AI Agent Definition (TDAD), a methodology that compiles AI agent prompts from behavioral specifications using automated testing. The approach addresses production deployment challenges by ensuring measurable behavioral compliance and preventing silent regressions in tool-using LLM agents.

AINeutralarXiv – CS AI · Mar 116/10

🧠

Arbiter: Detecting Interference in LLM Agent System Prompts

Researchers developed Arbiter, a framework to detect interference patterns in system prompts for LLM-based coding agents. Testing on major platforms (Claude, Codex, Gemini) revealed 152 findings and 21 interference patterns, with one discovery leading to a Google patch for Gemini CLI's memory system.

🏢 OpenAI🏢 Anthropic🧠 Claude

AIBullisharXiv – CS AI · Mar 116/10

🧠

DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM-TTS Pipeline and Micro-Turn Optimization

DuplexCascade introduces a VAD-free cascaded streaming pipeline that enables full-duplex speech-to-speech dialogue while maintaining LLM intelligence. The system converts traditional long utterance turns into micro-turn interactions using special control tokens to coordinate turn-taking and response timing.

AINeutralarXiv – CS AI · Mar 116/10

🧠

Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing

Researchers introduce a new framework showing that emotional tone in text systematically affects how large language models process and reason over information. They developed AURA-QA, an emotionally balanced dataset, and proposed emotional regularization techniques that improve reading comprehension performance across multiple benchmarks.

AIBullisharXiv – CS AI · Mar 116/10

🧠

TaSR-RAG: Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation

Researchers propose TaSR-RAG, a new framework that improves Retrieval-Augmented Generation systems by using taxonomy-guided structured reasoning for better evidence selection. The system decomposes complex questions into triple sub-queries and performs step-wise evidence matching, achieving up to 14% performance improvements over existing RAG baselines on multi-hop question answering benchmarks.

AIBearisharXiv – CS AI · Mar 116/10

🧠

Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs

Researchers have identified a critical flaw in Large Language Models (LLMs) where they prioritize moral reasoning over commonsense understanding, struggling to detect logical contradictions within moral dilemmas. The study introduces the CoMoral benchmark and reveals a 'narrative focus bias' where LLMs better identify contradictions attributed to secondary characters rather than primary narrators.

AINeutralarXiv – CS AI · Mar 116/10

🧠

MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

Researchers propose MM-tau-p², a new benchmark for evaluating multi-modal AI agents that adapt to user personas in customer service settings. The framework introduces 12 novel metrics to assess robustness and performance of LLM-based agents using voice and visual inputs, showing limitations even in advanced models like GPT-4 and GPT-5.

🧠 GPT-4🧠 GPT-5

AIBullisharXiv – CS AI · Mar 116/10

🧠

MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning

Researchers propose MSSR (Memory-Inspired Sampler and Scheduler Replay), a new framework for continual fine-tuning of large language models that mitigates catastrophic forgetting while maintaining adaptability. The method estimates sample-level memory strength and schedules rehearsal at adaptive intervals, showing superior performance across three backbone models and 11 sequential tasks compared to existing replay-based strategies.

AIBullisharXiv – CS AI · Mar 116/10

🧠

Towards a Neural Debugger for Python

Researchers have developed neural debuggers - AI models that can emulate traditional Python debuggers by stepping through code execution, setting breakpoints, and predicting both forward and backward program states. This breakthrough enables more interactive control over neural code interpretation compared to existing approaches that only execute programs linearly.

🏢 Meta

AINeutralarXiv – CS AI · Mar 116/10

🧠

OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering

Researchers introduced OPENXRD, a comprehensive benchmarking framework for evaluating large language models and multimodal LLMs in crystallography question answering. The study tested 74 state-of-the-art models and found that mid-sized models (7B-70B parameters) benefit most from contextual materials, while very large models often show saturation or interference.

🧠 GPT-4🧠 GPT-4.5🧠 GPT-5

AINeutralarXiv – CS AI · Mar 116/10

🧠

Debiasing International Attitudes: LLM Agents for Simulating US-China Perception Changes

Researchers developed an LLM-agent framework to model how media influences US-China attitudes from 2005-2025, testing three debiasing mechanisms to reduce AI model prejudices. The study found that devil's advocate agents were most effective at producing human-like opinion formation, while revealing geographic biases tied to AI models' origins.

🧠 GPT-4

AIBullisharXiv – CS AI · Mar 116/10

🧠

Automating Forecasting Question Generation and Resolution for AI Evaluation

Researchers developed an automated system using LLM-powered web research agents to generate and resolve forecasting questions at scale, creating 1,499 diverse real-world questions with 96% quality rate. The system demonstrates that more advanced AI models perform significantly better at forecasting tasks, with potential applications for improving AI evaluation benchmarks.

🧠 GPT-5🧠 Gemini

AIBullisharXiv – CS AI · Mar 96/10

🧠

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Researchers introduce ProEvolve, a graph-based framework that enables programmable evolution of AI agent environments for more realistic benchmarking. The system addresses current benchmark limitations by creating dynamic environments that can adapt and change, better reflecting real-world conditions where AI agents must operate.

AIBullisharXiv – CS AI · Mar 96/10

🧠

An Embodied Companion for Visual Storytelling

Researchers developed 'Companion,' an AI system that combines drawing robots with Large Language Models to create a collaborative artistic partner. The system engages in real-time bidirectional interaction through speech and sketching, with art experts validating its ability to produce works with distinct aesthetic identity and exhibition merit.

AINeutralarXiv – CS AI · Mar 96/10

🧠

Towards Neural Graph Data Management

Researchers introduce NGDBench, a comprehensive benchmark for evaluating neural networks' ability to work with graph databases across five domains including finance and medicine. The benchmark supports full Cypher query language capabilities and reveals significant limitations in current AI models when handling structured graph data, noise, and complex analytical tasks.

← PrevPage 21 of 39Next →