#large-language-models News & Analysis

Over the past month, coverage of #large-language-models has grown significantly, with 100 articles published in the last 30 days out of 273 total indexed pieces. The discussion landscape shows predominantly neutral sentiment at 59%, though bullish perspectives account for 37% of coverage. Notably, sentiment has softened compared to the prior quarter, declining 14.2 percentage points in bullish tone. ArXiv's computer science and AI section dominates source coverage, with Llama, Gemini, and GPT-4 emerging as the most frequently discussed models. Scan the articles below for recent developments and perspectives on the topic.

sentiment · last 30d (100 articles) · -14.2pp bullish vs prior 90d

Top sources:arXiv – CS AI · 254Crypto Briefing · 2TechCrunch – AI · 2IEEE Spectrum – AI · 1Decrypt · 1

Often co-tagged with:#machine-learning #ai-research #reinforcement-learning #research #artificial-intelligence #multimodal-ai

Most-discussed entities:Llama · 7Gemini · 6GPT-4 · 6Claude · 4Anthropic · 4

416 articles

AINeutralarXiv – CS AI · 5d ago6/10

🧠

Innovation: An Almost Characterization of Hallucination

Researchers have introduced the concept of 'innovation' as a fundamental property that characterizes hallucination in large language models, showing it serves as an almost-complete mathematical characterization of when LLMs produce false information. The work extends prior research by Kalai and Vempala, establishing that innovation—the tendency to generate outputs outside training data—inevitably leads to hallucination with high probability, providing new theoretical bounds on hallucination rates.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

ContextGuard: Structured Self-Auditing for Context Learning in Language Models

Researchers introduce ContextGuard, a self-auditing framework that addresses a critical gap in large language model performance: the inability to faithfully apply complex contextual knowledge despite strong reasoning capabilities. The system identifies and corrects failures where models miss peripheral, persistent, or format-sensitive requirements while following main reasoning paths.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery

Researchers introduce Kalman Evolve, a framework that uses large language models to discover improved filtering algorithms for state estimation by optimizing both noise parameters and the update structure of classical Kalman filters. The approach addresses performance gaps in nonlinear sensing scenarios like Doppler radar and LiDAR, achieving up to 12% RMSE improvement over standard methods.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Researchers demonstrate that scale vectors in large language models, despite comprising negligible model parameters, significantly impact training performance and optimization. Through theoretical analysis and empirical validation across models from 0.12B to 2B parameters, the study proposes three complementary improvements to scale vector design that enhance training efficiency without adding computational overhead.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

DEI: Diversity in Evolutionary Inference for Quality-Diversity Search

Researchers present DEI, a distributed Quality-Diversity search framework that uses heterogeneous large language models as mutation operators to solve competitive programming tasks. A four-model ensemble achieved 124% higher performance than single-model baselines, demonstrating that model diversity—not just computational parallelism—drives superior outcomes in evolutionary AI search.

🧠 GPT-5🧠 Claude🧠 Haiku

AINeutralarXiv – CS AI · 5d ago6/10

🧠

Multi-Agent Causal Discovery Using Large Language Models

Researchers introduce MAC, a multi-agent framework that combines statistical causal discovery with large language models to identify relationships between variables more accurately than existing methods. By using autonomous agent debate and adversarial reasoning, MAC outperforms both traditional statistical and single-agent LLM approaches across multiple benchmark datasets.

🧠 Gemini

AINeutralarXiv – CS AI · 5d ago6/10

🧠

FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

Researchers introduced FrontierOR, a benchmark that tests whether leading LLMs can design efficient optimization algorithms for real-world large-scale problems. The evaluation of seven models reveals significant limitations: even frontier models outperform Gurobi (a standard solver) in only 31% of cases, highlighting a substantial gap between LLM capabilities in formulation and practical algorithmic optimization.

AIBullisharXiv – CS AI · 5d ago6/10

🧠

Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

Researchers propose Robustness of Prompting (RoP), a novel prompting strategy that enhances Large Language Models' resilience against adversarial perturbations like typos and character errors. The two-stage approach combines error correction with guided inference, demonstrating significant improvements in robustness across arithmetic, commonsense, and logical reasoning tasks while maintaining accuracy on clean inputs.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

How Reliable are LLMs for Reasoning on the Re-ranking task?

Researchers investigate whether Large Language Models reliably perform re-ranking tasks by analyzing how different training methods affect semantic understanding and reasoning transparency. The study reveals that some training approaches produce better explainability than others, suggesting LLMs may optimize for evaluation metrics rather than genuine semantic comprehension, raising concerns about their actual reliability in ranking applications.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Researchers introduced Persona2Web, the first benchmark for evaluating personalized web agents that can infer user preferences from historical behavior rather than explicit instructions. The framework tests how large language models handle ambiguous queries by leveraging user context, addressing a critical gap in current web agent capabilities.

AINeutralSimon Willison Blog · May 196/10

🧠

Gemini 3.5 Flash: more expensive, but Google plan to use it for everything

Google has released Gemini 3.5 Flash with improved capabilities but at a higher cost per token, signaling the company's strategy to deploy the model across diverse applications despite pricing pressures. This move reflects Google's commitment to scaling AI infrastructure across products, even as it increases operational expenses for users and developers relying on the API.

🧠 Gemini

AINeutralarXiv – CS AI · May 126/10

🧠

OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

Researchers introduce OracleTSC, an LLM-based traffic signal control system that combines reward hurdle mechanisms and uncertainty regularization to stabilize reinforcement learning training. The approach achieves 75% reduction in travel time while maintaining interpretability through natural language explanations, with strong cross-intersection generalization capabilities.

AINeutralarXiv – CS AI · May 125/10

🧠

What Will Happen Next: Large Models-Driven Deduction for Emergency Instances

Researchers propose WLDS, a Large Language Model-driven system for simulating and deducing emergency scenarios across multiple domains. The system addresses limitations of traditional simulation methods by using LMs to generate diverse, realistic emergency instance variations with calibration mechanisms to ensure factual accuracy and logical consistency.

AINeutralarXiv – CS AI · May 126/10

🧠

Internalizing Safety Understanding in Large Reasoning Models via Verification

Researchers propose Safety Internal (SInternal), a framework that trains large reasoning models to verify the safety of their own outputs rather than relying on external compliance mechanisms. The approach demonstrates that models can internalize safety understanding through verification tasks, significantly improving robustness against adversarial jailbreaks and out-of-domain attacks.

AINeutralarXiv – CS AI · May 126/10

🧠

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

Researchers introduce Re²Math, a new benchmark for evaluating large language models' ability to retrieve relevant mathematical theorems and lemmas from academic literature during proof construction. The benchmark reveals significant gaps in current AI systems, with the best model achieving only 7.0% accuracy despite retrieving valid statements, indicating AI struggles to verify applicability to specific proof contexts.

AINeutralarXiv – CS AI · May 126/10

🧠

From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay

Researchers introduce Neuro-Symbolic Experience Replay (NSER), a framework that enhances reinforcement learning by combining Large Language Models with symbolic logic to transform passive memory buffers into active knowledge construction systems. The approach grounds LLM-generated behavioral rules into differentiable logic representations, enabling more efficient policy optimization across multiple benchmark environments.

AIBullisharXiv – CS AI · May 126/10

🧠

Active Testing of Large Language Models via Approximate Neyman Allocation

Researchers introduce a novel active testing algorithm that reduces evaluation costs for large language models by intelligently sampling from evaluation pools using semantic entropy and approximate Neyman allocation. The method achieves up to 28% MSE reduction over uniform sampling while saving an average of 22.9% of evaluation budget across multiple benchmarks.

AINeutralarXiv – CS AI · May 126/10

🧠

LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs

LLM4Branch introduces a novel framework using large language models to automatically discover efficient branching policies for Mixed Integer Linear Programming (MILP) solvers. The approach generates executable programs via LLMs and optimizes parameters through performance feedback, achieving competitive results with state-of-the-art GPU-based methods on standard benchmarks.

AINeutralarXiv – CS AI · May 126/10

🧠

ASIA: an Autonomous System Identification Agent

ASIA is an autonomous AI agent framework that automates system identification tasks by delegating model selection, training algorithms, and hyperparameter tuning to a large language model. The framework eliminates manual trial-and-error processes in dynamical systems modeling, though empirical testing reveals concerns around test leakage and reproducibility.

AINeutralarXiv – CS AI · May 126/10

🧠

Budget-Efficient Automatic Algorithm Design via Code Graph

Researchers propose a budget-efficient automatic algorithm design framework using large language models that operates on code graphs rather than full algorithms. The approach uses LLMs to generate compact corrections—code modifications that add, replace, or remove blocks—which compose into new algorithms, reducing computational waste and improving fitness outcomes on combinatorial optimization problems.

AIBullisharXiv – CS AI · May 126/10

🧠

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

Researchers introduce Evolving-RL, a framework that optimizes how AI agents learn from past experiences to adapt to new tasks. The method jointly improves both experience extraction and utilization through reinforcement learning, achieving significant performance gains on out-of-distribution tasks without requiring test-time experience accumulation.

AINeutralarXiv – CS AI · May 126/10

🧠

PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering

PathISE is a novel framework that enables knowledge graph question-answering systems to learn effective supervision signals from answer-level labels alone, eliminating the need for expensive intermediate annotations. By using a transformer-based estimator to identify informative relation paths and distilling them into LLM path generators, the approach achieves competitive state-of-the-art performance while reducing resource requirements for training.

AINeutralarXiv – CS AI · May 126/10

🧠

MaD Physics: Evaluating information seeking under constraints in physical environments

Researchers introduce MaD Physics, a benchmark for evaluating AI agents' ability to conduct scientific discovery under realistic resource constraints. The benchmark tests agents' capacity to make informative measurements within budget limits and infer underlying physical laws, using altered physics environments to prevent reliance on training data.

🧠 Gemini

AINeutralarXiv – CS AI · May 126/10

🧠

The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning

Researchers reveal that large language models suffer from a nonlinear performance degradation when exposed to misleading information in long-context scenarios, with the majority of decline occurring when hard distractors comprise just a small fraction of the total context. This finding, termed 'The First Drop of Ink' effect, demonstrates that attention mechanisms disproportionately focus on misleading content, suggesting that upstream retrieval quality is more critical than previously understood for RAG and agentic systems.

AINeutralarXiv – CS AI · May 126/10

🧠

Artificial Intelligence in Number Theory: LLMs for Algorithm Generation and Ensemble Methods for Conjecture Verification

Researchers demonstrate that large language models like Qwen2.5-Math achieve 95%+ accuracy on algorithmic number theory problems with optimal hints, and empirically verify a folklore conjecture that Dirichlet character moduli are uniquely determined by L-function zeros using machine learning ensemble methods.

← PrevPage 8 of 17Next →