#llm News & Analysis

956 articles tagged with #llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

956 articles

AINeutralarXiv – CS AI · Mar 27/1020

🧠

HumanMCP: A Human-Like Query Dataset for Evaluating MCP Tool Retrieval Performance

Researchers have released HumanMCP, the first large-scale dataset designed to evaluate tool retrieval performance in Model Context Protocol (MCP) servers. The dataset addresses a critical gap by providing realistic, human-like queries paired with 2,800 tools across 308 MCP servers, improving upon existing benchmarks that lack authentic user interaction patterns.

AINeutralarXiv – CS AI · Mar 27/1014

🧠

Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis

A comprehensive study of 504 AI model configurations reveals that reasoning capabilities in large language models are highly task-dependent, with simple tasks like binary classification actually degrading by up to 19.9 percentage points while complex 27-class emotion recognition improves by up to 16.0 points. The research challenges the assumption that reasoning universally improves AI performance across all language tasks.

AIBullisharXiv – CS AI · Mar 26/109

🧠

Preference Packing: Efficient Preference Optimization for Large Language Models

Researchers propose 'preference packing,' a new optimization technique for training large language models that reduces training time by at least 37% through more efficient handling of duplicate input prompts. The method optimizes attention operations and KV cache memory usage in preference-based training methods like Direct Preference Optimization.

AIBullisharXiv – CS AI · Mar 27/1016

🧠

SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems

Researchers propose SafeGen-LLM, a new approach to enhance safety in robotic task planning by combining supervised fine-tuning with policy optimization guided by formal verification. The system demonstrates superior safety generalization across multiple domains compared to existing classical planners, reinforcement learning methods, and base large language models.

AIBullisharXiv – CS AI · Mar 27/1017

🧠

CoMind: Towards Community-Driven Agents for Machine Learning Engineering

Researchers introduce CoMind, a multi-agent AI system that leverages community knowledge to automate machine learning engineering tasks. The system achieved a 36% medal rate on 75 past Kaggle competitions and outperformed 92.6% of human competitors in eight live competitions, establishing new state-of-the-art performance.

AIBullisharXiv – CS AI · Mar 27/1024

🧠

DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

Researchers propose DUET, a new distillation-based method for LLM unlearning that removes undesirable knowledge from AI models without full retraining. The technique combines computational efficiency with security advantages, achieving better performance in both knowledge removal and utility preservation while being significantly more data-efficient than existing methods.

AIBullisharXiv – CS AI · Mar 26/1014

🧠

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Researchers propose Trust Region Masking (TRM) to address off-policy mismatch problems in Large Language Model reinforcement learning pipelines. The method provides the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL tasks by masking entire sequences that violate trust region constraints.

AIBullisharXiv – CS AI · Mar 26/1014

🧠

Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning

Researchers introduce Latent Self-Consistency (LSC), a new method for improving Large Language Model output reliability across both short and long-form reasoning tasks. LSC uses learnable token embeddings to select semantically consistent responses with only 0.9% computational overhead, outperforming existing consistency methods like Self-Consistency and Universal Self-Consistency.

AINeutralarXiv – CS AI · Mar 27/1019

🧠

Once4All: Skeleton-Guided SMT Solver Fuzzing with LLM-Synthesized Generators

Researchers developed Once4All, an LLM-assisted fuzzing framework for testing SMT solvers that addresses syntax validity issues and computational overhead. The system found 43 confirmed bugs in leading solvers Z3 and cvc5, with 40 already fixed by developers.

AIBullisharXiv – CS AI · Feb 276/105

🧠

Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

Researchers introduce InteractCS-RL, a new reinforcement learning framework that helps AI agents balance empathetic communication with cost-effective decision-making in task-oriented dialogue. The system uses a multi-granularity approach with persona-driven user interactions and cost-aware policy optimization to achieve better performance across business scenarios.

AIBullisharXiv – CS AI · Feb 276/105

🧠

dLLM: Simple Diffusion Language Modeling

Researchers introduce dLLM, an open-source framework that unifies core components of diffusion language modeling including training, inference, and evaluation. The framework enables users to reproduce, finetune, and deploy large diffusion language models like LLaDA and Dream while providing tools to build smaller models from scratch with accessible compute resources.

AIBullisharXiv – CS AI · Feb 275/107

🧠

Addressing Climate Action Misperceptions with Generative AI

A study of 1,201 climate-concerned individuals found that personalized AI conversations using climate-equipped large language models significantly improved understanding of climate action impacts and increased intentions to adopt high-impact behaviors. The personalized climate LLM outperformed web searches, unspecialized LLMs, and control groups in motivating behavior change through tailored guidance.

AINeutralarXiv – CS AI · Feb 275/105

🧠

CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines

Researchers propose Contrastive World Models (CWM), a new approach for training AI agents to better distinguish between physically feasible and infeasible actions in embodied environments. The method uses contrastive learning with hard negative examples to outperform traditional supervised fine-tuning, achieving 6.76 percentage point improvement in precision and better safety margins under stress conditions.

AINeutralarXiv – CS AI · Feb 276/106

🧠

Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs

Researchers created a 4.5k text corpus analyzing how different AI personas, including Microsoft's controversial Sydney chatbot, express views on human-AI relationships across 12 major language models. The study examines how the Sydney persona has spread memetically through training data, allowing newer models to simulate its distinctive characteristics and perspectives.

AIBullisharXiv – CS AI · Feb 276/106

🧠

Reinforcement-aware Knowledge Distillation for LLM Reasoning

Researchers propose RL-aware distillation (RLAD), a new method to efficiently transfer knowledge from large language models to smaller ones during reinforcement learning training. The approach uses Trust Region Ratio Distillation (TRRD) to selectively guide student models only when it improves policy updates, outperforming existing distillation methods across reasoning benchmarks.

AINeutralarXiv – CS AI · Feb 276/106

🧠

Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs

Researchers propose KGT, a novel framework that bridges the gap between Large Language Models and Knowledge Graph Completion by using dedicated entity tokens for full-space prediction. The approach addresses fundamental granularity mismatches through specialized tokenization, feature fusion, and decoupled prediction mechanisms.

AIBearisharXiv – CS AI · Feb 276/106

🧠

ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization

Researchers introduced ConstraintBench, a new benchmark testing whether large language models can directly solve constrained optimization problems without external solvers. The study found that even the best frontier models only achieve 65% constraint satisfaction, with feasibility being a bigger challenge than optimality.

AIBullisharXiv – CS AI · Feb 275/107

🧠

EyeLayer: Integrating Human Attention Patterns into LLM-Based Code Summarization

Researchers developed EyeLayer, a module that integrates human eye-tracking patterns into large language models to improve code summarization. The system achieved up to 13.17% improvement on BLEU-4 metrics by using human gaze data to guide AI attention mechanisms.

AIBullisharXiv – CS AI · Feb 276/105

🧠

Comparative Analysis of Neural Retriever-Reranker Pipelines for Retrieval-Augmented Generation over Knowledge Graphs in E-commerce Applications

Researchers developed improved neural retriever-reranker pipelines for Retrieval-Augmented Generation (RAG) systems over knowledge graphs in e-commerce applications. The study achieved 20.4% higher Hit@1 and 14.5% higher Mean Reciprocal Rank compared to existing benchmarks, providing a framework for production-ready RAG systems.

AIBullisharXiv – CS AI · Feb 276/106

🧠

UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

Researchers introduce UpSkill, a new training method that uses Mutual Information Skill Learning to improve large language models' ability to generate diverse correct responses across multiple attempts. The technique shows ~3% improvements in pass@k metrics on mathematical reasoning tasks using models like Llama 3.1-8B and Qwen 2.5-7B without degrading single-attempt accuracy.

AIBullisharXiv – CS AI · Feb 276/106

🧠

Integrating Machine Learning Ensembles and Large Language Models for Heart Disease Prediction Using Voting Fusion

Researchers developed a hybrid system combining machine learning ensembles with large language models for heart disease prediction, achieving 96.62% accuracy. The study found that traditional ML models (95.78% accuracy) outperformed standalone LLMs (78.9% accuracy), but combining both approaches yielded the best results for clinical decision-support tools.

AIBullisharXiv – CS AI · Feb 276/108

🧠

Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation

Researchers developed GYWI, a scientific idea generation system that combines author knowledge graphs with retrieval-augmented generation to help Large Language Models generate more controllable and traceable scientific ideas. The system significantly outperforms mainstream LLMs including GPT-4o, DeepSeek-V3, Qwen3-8B, and Gemini 2.5 in metrics like novelty, reliability, and relevance.

AIBullisharXiv – CS AI · Feb 275/107

🧠

Decoder-based Sense Knowledge Distillation

Researchers have developed Decoder-based Sense Knowledge Distillation (DSKD), a new framework that integrates lexical resources into decoder-style large language models during training. The method enhances knowledge distillation performance while enabling generative models to inherit structured semantics without requiring dictionary lookup during inference.

AIBullisharXiv – CS AI · Feb 276/106

🧠

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

Researchers have developed LLM4Cov, an offline learning framework that enables AI agents to generate high-coverage hardware verification testbenches without expensive online reinforcement learning. A compact 4B-parameter model achieved 69.2% coverage pass rate, outperforming larger models by demonstrating efficient learning from execution feedback in hardware verification tasks.

AIBullisharXiv – CS AI · Feb 276/107

🧠

Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset

Researchers released the Asta Interaction Dataset containing over 200,000 user queries from AI-powered scientific research tools, revealing how scientists interact with LLM-based research assistants. The study shows users treat these systems as collaborative research partners, submitting longer queries and using outputs as persistent artifacts for non-linear exploration.

← PrevPage 29 of 39Next →