236 articles tagged with #large-language-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท 1d ago6/10
๐ง Researchers introduce KnowRL, a reinforcement learning framework that improves large language model reasoning by using minimal, strategically-selected knowledge points rather than verbose hints. The approach achieves state-of-the-art results on reasoning benchmarks at the 1.5B parameter scale, with the trained model and code made publicly available.
AIBullisharXiv โ CS AI ยท 1d ago6/10
๐ง TimeSAF introduces a hierarchical asynchronous fusion framework that improves how large language models guide time series forecasting by decoupling semantic understanding from numerical dynamics. This addresses a fundamental architectural limitation in existing methods and demonstrates superior performance on standard benchmarks with strong generalization capabilities.
AIBullisharXiv โ CS AI ยท 1d ago6/10
๐ง Researchers introduce RALP, a novel method that uses chain-of-thought prompts with large language models to improve knowledge graph predictions, outperforming traditional embedding models by over 5% on standard benchmarks while better handling unseen entities, relations, and numerical data.
AINeutralarXiv โ CS AI ยท 1d ago6/10
๐ง Researchers propose CoDe-R, a two-stage framework using Large Language Models to improve binary decompilation by reducing logical errors and semantic misalignment. A 1.3B model using this approach achieves state-of-the-art performance on the HumanEval-Decompile benchmark, becoming the first lightweight model to exceed 50% re-executability rates.
AINeutralarXiv โ CS AI ยท 1d ago6/10
๐ง Researchers investigate on-policy distillation (OPD) dynamics in large language model training, identifying two critical success conditions: compatible thinking patterns between student and teacher models, and genuine new capabilities from the teacher. The study reveals that successful OPD relies on token-level alignment and proposes recovery strategies for failing distillation scenarios.
AINeutralarXiv โ CS AI ยท 1d ago6/10
๐ง Researchers introduce Text2Model and Text2Zinc, frameworks that use large language models to translate natural language descriptions into formal optimization and satisfaction models. The work represents the first unified approach combining both problem types with a solver-agnostic architecture, though experiments reveal LLMs remain imperfect at this task despite showing competitive performance.
AINeutralarXiv โ CS AI ยท 1d ago6/10
๐ง Researchers propose Opinion-Aware Retrieval-Augmented Generation (RAG) to address a critical bias in current LLM systems that treat subjective content as noise rather than valuable information. By formalizing the distinction between factual queries (epistemic uncertainty) and opinion queries (aleatoric uncertainty), the team develops an architecture that preserves diverse perspectives in knowledge retrieval, demonstrating 26.8% improved sentiment diversity and 42.7% better entity matching on real-world e-commerce data.
AINeutralarXiv โ CS AI ยท 1d ago6/10
๐ง A comprehensive scoping review of 52 studies examines Large Language Model-based pedagogical agents across educational contexts from November 2022 to January 2025. The research identifies four key design dimensions (interaction approach, domain scope, role complexity, system integration) and emerging trends including multi-agent systems, virtual student simulation, and integration with immersive technologies, while flagging critical research gaps around privacy, accuracy, and student autonomy.
AINeutralarXiv โ CS AI ยท 1d ago6/10
๐ง LLM-HYPER is a new framework that uses large language models as hypernetworks to generate click-through rate prediction models for cold-start ads without traditional training. The system achieved a 55.9% improvement over baseline methods in offline tests and has been successfully deployed in production on a major U.S. e-commerce platform.
AINeutralarXiv โ CS AI ยท 1d ago6/10
๐ง Researchers propose GRACE, a dynamic coreset selection framework that reduces LLM training costs by intelligently selecting representative dataset subsets. The method combines representation diversity with gradient-based metrics and uses k-NN graph propagation to adapt to evolving training dynamics, demonstrating improved efficiency across multiple benchmarks.
AIBullisharXiv โ CS AI ยท 1d ago6/10
๐ง Researchers propose Joint Flashback Adaptation, a novel method to address catastrophic forgetting in large language models during incremental task learning. The approach uses limited prompts from previous tasks combined with latent task interpolation, demonstrating improved performance across 1000+ instruction-following and reasoning tasks without requiring full replay data.
AIBullisharXiv โ CS AI ยท 1d ago6/10
๐ง Researchers introduce Mโ
, a method that automatically evolves task-specific memory systems for large language model agents by treating memory architecture as executable Python code. The approach outperforms fixed memory designs across conversation, planning, and reasoning benchmarks, suggesting that specialized memory mechanisms significantly outperform one-size-fits-all solutions.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce TimeSeriesExamAgent, a scalable framework for automatically generating time series reasoning benchmarks using LLM agents and templates. The study reveals that while large language models show promise in time series tasks, they significantly underperform in abstract reasoning and domain-specific applications across healthcare, finance, and weather domains.
AIBullisharXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce CARO, a two-stage training framework that enhances large language models' ability to perform robust content moderation through analogical reasoning. By combining retrieval-augmented generation with direct preference optimization, CARO achieves 24.9% F1 score improvement over state-of-the-art models including DeepSeek R1 and LLaMA Guard on ambiguous moderation cases.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduced NovBench, the first large-scale benchmark for evaluating how well large language models can assess research novelty in academic papers. The benchmark comprises 1,684 paper-review pairs from a leading NLP conference and reveals that current LLMs struggle with scientific novelty comprehension despite promise in peer review support.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers demonstrate that large language models exhibit excessive repetition of discourse tactics in multi-turn empathic conversations, reusing communication strategies at nearly double the human rate. They introduce MINT, a reinforcement learning framework that optimizes for both empathy quality and discourse move diversity, achieving 25.3% improvements in empathy while reducing repetitive tactics by 26.3%.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce Agent Mentor, an open-source analytics pipeline that monitors and automatically improves AI agent behavior by analyzing execution logs and iteratively refining system prompts with corrective instructions. The framework addresses variability in large language model-based agent performance caused by ambiguous prompt formulations, demonstrating consistent accuracy improvements across multiple configurations.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers propose Dramaturge, a multi-agent LLM system that uses hierarchical divide-and-conquer methodology to iteratively refine narrative scripts. The approach addresses limitations in single-pass LLM generation by coordinating global structural reviews with scene-level refinements across multiple iterations, demonstrating superior output quality compared to baseline methods.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce SciPredict, a benchmark testing whether large language models can predict scientific experiment outcomes across physics, biology, and chemistry. The study reveals that while some frontier models marginally exceed human experts (~20% accuracy), they fundamentally fail to assess prediction reliability, suggesting superhuman performance in experimental science requires not just better predictions but better calibration awareness.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce SciTune, a framework for fine-tuning large language models with human-curated scientific multimodal instructions from academic publications. The resulting LLaMA-SciTune model demonstrates superior performance on scientific benchmarks compared to state-of-the-art alternatives, with results suggesting that high-quality human-generated data outweighs the volume advantage of synthetic training data for specialized scientific tasks.
AIBullisharXiv โ CS AI ยท 2d ago6/10
๐ง Researchers developed a multi-agent LLM system that automates structural analysis workflows across multiple finite element analysis (FEA) platforms including ETABS, SAP2000, and OpenSees. Using a two-stage architecture that interprets engineering specifications and translates them into platform-specific code, the system achieved over 90% accuracy in 20 representative frame problems, addressing a critical gap in practical AI-assisted engineering deployment.
AIBullisharXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce PoTable, a novel AI framework that enhances Large Language Models' ability to reason about tabular data through systematic, stage-oriented planning before execution. The approach mimics professional data analyst workflows by breaking complex table reasoning into distinct analytical stages with clear objectives, demonstrating improved accuracy and explainability across benchmark datasets.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce LIFESTATE-BENCH, a benchmark for evaluating lifelong learning capabilities in large language models through multi-turn interactions using narrative datasets like Hamlet. Testing shows nonparametric approaches significantly outperform parametric methods, but all models struggle with catastrophic forgetting over extended interactions, revealing fundamental limitations in LLM memory and consistency.
๐ง GPT-4๐ง Llama
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce TagCC, a novel deep clustering framework that combines Large Language Models with contrastive learning to enhance tabular data analysis by incorporating semantic knowledge from feature names and values. The approach bridges the gap between statistical co-occurrence patterns and intrinsic semantic understanding, demonstrating significant performance improvements over existing methods in finance and healthcare applications.
AIBullisharXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce PODS (Policy Optimization with Down-Sampling), a technique that accelerates reinforcement learning training for large language models by selectively training on high-variance rollouts rather than all generated data. The method achieves equivalent performance to standard approaches at 1.7x faster speeds, addressing computational bottlenecks in LLM reasoning optimization.