236 articles tagged with #large-language-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv – CS AI · 2d ago6/10
🧠Researchers present Data Mixing Agent, an AI framework that uses reinforcement learning to automatically optimize how large language models balance training data from source and target domains during continual pre-training. The approach outperforms manual reweighting strategies while generalizing across different models, domains, and fields without requiring retraining.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers present the first comprehensive survey of inductive reasoning in large language models, categorizing improvement methods into post-training, test-time scaling, and data augmentation approaches. The survey establishes unified benchmarks and evaluation metrics for assessing how LLMs perform particular-to-general reasoning tasks that better align with human cognition.
AINeutralarXiv – CS AI · 2d ago6/10
🧠A study evaluating the consistency of exercise prescriptions generated by Gemini 2.5 Flash found high semantic consistency but significant variability in quantitative components like exercise intensity. The research highlights that while LLMs produce semantically similar outputs, structural constraints and expert validation are necessary before clinical deployment.
🧠 Gemini
AINeutralarXiv – CS AI · 2d ago6/10
🧠SRBench introduces a comprehensive evaluation framework for Sequential Recommendation models that combines Large Language Models with traditional neural network approaches. The benchmark addresses critical gaps in existing evaluation methodologies by incorporating fairness, stability, and efficiency metrics alongside accuracy, while establishing fair comparison mechanisms between LLM-based and neural network-based recommendation systems.
🏢 Meta
AINeutralarXiv – CS AI · 2d ago6/10
🧠A comprehensive study evaluates four state-of-the-art LLMs (GPT-4o, Claude Sonnet 4, Qwen3-235B, Kimi K2) for use as AI tutors in Nepal's K-10 curriculum, revealing significant pedagogical gaps despite high technical accuracy. The research identifies critical failure modes including inability to simplify complex concepts for young learners and poor cultural contextualization, concluding that current LLMs require human oversight and curriculum-specific fine-tuning before classroom deployment in low-resource regions.
🧠 GPT-4🧠 Claude🧠 Sonnet
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers discovered that large language models exhibit working memory limitations similar to humans, encoding multiple memory items in entangled representations that require interference control rather than direct retrieval. This finding reveals a shared computational constraint between biological and artificial systems, suggesting that working memory capacity may be a fundamental bottleneck in intelligent systems rather than a limitation unique to biological brains.
AIBullisharXiv – CS AI · 2d ago6/10
🧠Researchers developed a multi-agent LLM system that automates structural analysis workflows across multiple finite element analysis (FEA) platforms including ETABS, SAP2000, and OpenSees. Using a two-stage architecture that interprets engineering specifications and translates them into platform-specific code, the system achieved over 90% accuracy in 20 representative frame problems, addressing a critical gap in practical AI-assisted engineering deployment.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers argue that Large Language Models lack explicit empathy mechanisms, systematically failing to preserve human perspectives, affect, and context despite strong benchmark performance. The paper identifies four recurring empathic failures—sentiment attenuation, granularity mismatch, conflict avoidance, and linguistic distancing—and proposes empathy-aware objectives as essential components of LLM development.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce Object-Oriented World Modeling (OOWM), a framework that structures LLM reasoning for robotic planning by replacing linear text with explicit symbolic representations using UML diagrams and object hierarchies. The approach combines supervised fine-tuning with group relative policy optimization to achieve superior planning performance on embodied tasks, demonstrating that formal software engineering principles can enhance AI reasoning capabilities.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers evaluated whether large language models can function as text-only controllers for navigation and exploration in unknown environments under partial observability. Testing nine contemporary LLMs on ASCII gridworld tasks, they found reasoning-tuned models reliably complete navigation goals but remain inefficient compared to optimal paths, with few-shot prompting reducing invalid moves and improving path efficiency.
AIBullisharXiv – CS AI · 2d ago6/10
🧠Researchers propose Tool-Internalized Reasoning (TInR), a framework that embeds tool knowledge directly into Large Language Models rather than relying on external tool documentation during reasoning. The TInR-U model uses a three-phase training pipeline combining knowledge alignment, supervised fine-tuning, and reinforcement learning to improve reasoning efficiency and performance across various tasks.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers have developed a framework to detect and eliminate ambiguities in natural-language specifications converted to executable BPMN process models by large language models. The method identifies behavioral inconsistencies through KPI analysis, diagnoses gateway logic problems, and repairs source text through evidence-based refinement, reducing variability in regenerated model behavior.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers systematically evaluated how sampling temperature and prompting strategies affect extended reasoning performance in large language models, finding that zero-shot prompting peaks at moderate temperatures (T=0.4-0.7) while chain-of-thought performs better at extremes. The study reveals that extended reasoning benefits grow substantially with higher temperatures, suggesting that T=0 is suboptimal for reasoning tasks.
🧠 Grok
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce ASTRA, a new architecture designed to improve how large language models process and reason about complex tables through adaptive semantic tree structures. The method combines tree-based navigation with symbolic code execution to achieve state-of-the-art performance on table question-answering benchmarks, addressing fundamental limitations in how tables are currently serialized for LLMs.
AIBullishTechCrunch – AI · 4d ago6/10
🧠Anthropic's Claude AI dominated conversations at San Francisco's HumanX conference, positioning the company as a leading force in the AI industry. The prominence signals growing market interest in advanced language models and their commercial applications across enterprise and developer ecosystems.
🏢 Anthropic🧠 Claude
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers have developed a method to control how verifiable AI hallucinations are in multimodal language models by distinguishing between obvious hallucinations (easily detected by humans) and elusive ones (harder to spot). Using a dataset of 4,470 human responses, they created targeted interventions that can fine-tune which types of hallucinations occur, enabling flexible control suited to different security and usability requirements.
AIBullisharXiv – CS AI · 6d ago6/10
🧠Researchers introduce EmoMAS, a Bayesian multi-agent framework that enables small language models to perform sophisticated negotiation by treating emotional intelligence as a strategic variable. The system coordinates game-theoretic, reinforcement learning, and psychological agents to optimize negotiation outcomes while maintaining privacy through edge deployment, demonstrating performance comparable to larger models across high-stakes domains.
AINeutralarXiv – CS AI · 6d ago6/10
🧠A research study analyzes six leading large language models to identify shared cultural patterns revealed in their training data, finding consensus around themes like narrative meaning-making, status competition, and moral rationalization. The findings suggest LLMs function as 'cultural condensates' that compress how humans describe and contest their social lives across massive text datasets.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers conducted a comparative analysis of demonstration selection strategies for using large language models to predict users' next point-of-interest (POI) based on historical location data. The study found that simple heuristic methods like geographical proximity and temporal ordering outperform complex embedding-based approaches in both computational efficiency and prediction accuracy, with LLMs using these heuristics sometimes matching fine-tuned model performance without additional training.
AIBullisharXiv – CS AI · 6d ago6/10
🧠Researchers propose FLeX, a parameter-efficient fine-tuning approach combining LoRA, advanced optimizers, and Fourier-based regularization to enable cross-lingual code generation across programming languages. The method achieves 42.1% pass@1 on Java tasks compared to a 34.2% baseline, demonstrating significant improvements in multilingual transfer without full model retraining.
🧠 Llama
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers propose an attribution-driven approach to make encoder-based Large Language Models more transparent and trustworthy for network intrusion detection in Software-Defined Networks. By analyzing which traffic features drive model decisions, the study demonstrates that LLMs learn legitimate attack behavior patterns, addressing a critical barrier to deploying AI security tools in sensitive environments.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers propose G-Defense, a graph-enhanced framework that uses large language models and retrieval-augmented generation to detect fake news while providing explainable, fine-grained reasoning. The system decomposes news claims into sub-claims, retrieves competing evidence, and generates transparent explanations without requiring verified fact-checking databases.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers introduce TeamLLM, a multi-LLM collaboration framework that emulates human team structures with distinct roles to improve performance on complex, multi-step tasks. The team proposes a new CGPST benchmark for evaluating LLM performance on contextualized procedural tasks, demonstrating substantial improvements over single-perspective approaches.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers have developed a comprehensive evaluation framework for Large Language Models applied to outpatient referral systems in healthcare, revealing that LLMs offer limited advantages over simpler BERT-like models in static referral tasks but demonstrate potential in interactive dialogue scenarios. The study addresses the absence of standardized evaluation criteria for assessing LLM effectiveness in dynamic healthcare settings.
AIBearisharXiv – CS AI · 6d ago6/10
🧠A new empirical study reveals that eight major LLMs exhibit systematic biases in code generation, overusing popular libraries like NumPy in 45% of cases and defaulting to Python even when unsuitable, prioritizing familiarity over task-specific optimality. The findings highlight gaps in current LLM evaluation methodologies and underscore the need for targeted improvements in training data diversity and benchmarking standards.