y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#language-models News & Analysis

350 articles tagged with #language-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

350 articles
AINeutralarXiv – CS AI · 6d ago6/10
🧠

Hallucination as output-boundary misclassification: a composite abstention architecture for language models

Researchers propose a composite architecture combining instruction-based refusal with a structural abstention gate to reduce hallucinations in large language models. The system uses a support deficit score derived from self-consistency, paraphrase stability, and citation coverage to block unreliable outputs, achieving better accuracy than either mechanism alone across multiple models.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models

Researchers introduce Text2DistBench, a new benchmark for evaluating how well large language models understand distributional information—like trends and preferences across text collections—rather than just factual details. Built from YouTube comments about movies and music, the benchmark reveals that while LLMs outperform random baselines, their performance varies significantly across different distribution types, highlighting both capabilities and gaps in current AI systems.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

Researchers introduce DOVE, a distributional evaluation framework that measures how well large language models align with cultural values through open-ended text generation rather than multiple-choice tests. The framework uses rate-distortion optimization to create a value codebook and unbalanced optimal transport to assess alignment, demonstrating 31.56% correlation with downstream tasks across 12 LLMs while requiring only 500 samples per culture.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models

Researchers introduce chain-of-illocution (CoI) prompting to improve source faithfulness in retrieval-augmented language models, achieving up to 63% gains in source adherence for programming education tasks. The study reveals that standard RAG systems exhibit low fidelity to source materials, with non-RAG models performing worse, while a user study confirms improved faithfulness does not compromise user satisfaction.

AIBullisharXiv – CS AI · 6d ago6/10
🧠

$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models

Researchers introduce S³ (Stratified Scaling Search), a test-time scaling method for diffusion language models that improves output quality by reallocating compute during the denoising process rather than simple best-of-K sampling. The technique uses a lightweight verifier to evaluate and selectively resample candidate trajectories at each step, demonstrating consistent performance gains across mathematical reasoning and knowledge tasks without requiring model retraining.

AIBullisharXiv – CS AI · Apr 76/10
🧠

Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

Researchers introduce Profile-Then-Reason (PTR), a new framework for AI language agents that use external tools, which reduces computational overhead by pre-planning workflows rather than recomputing after each step. The approach limits language model calls to 2-3 times maximum and shows superior performance in 16 of 24 test configurations compared to reactive execution methods.

AIBullisharXiv – CS AI · Apr 76/10
🧠

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

Researchers introduce vocabulary dropout, a technique to prevent diversity collapse in co-evolutionary language model training where one model generates problems and another solves them. The method sustains proposer diversity and improves mathematical reasoning performance by +4.4 points on average in Qwen3 models.

AIBullisharXiv – CS AI · Apr 76/10
🧠

DP-OPD: Differentially Private On-Policy Distillation for Language Models

Researchers have developed DP-OPD (Differentially Private On-Policy Distillation), a new framework for training privacy-preserving language models that significantly improves performance over existing methods. The approach simplifies the training pipeline by eliminating the need for DP teacher training and offline synthetic text generation while maintaining strong privacy guarantees.

🏢 Perplexity
AINeutralarXiv – CS AI · Apr 76/10
🧠

What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features

Researchers challenge the assumption that multilingual AI reasoning should simply mimic English patterns, finding that effective reasoning features vary significantly across languages. The study analyzed Large Reasoning Models across 10 languages and discovered that English-derived reasoning approaches may not translate effectively to other languages, suggesting need for adaptive, language-specific AI training methods.

AIBearisharXiv – CS AI · Apr 76/10
🧠

Individual and Combined Effects of English as a Second Language and Typos on LLM Performance

Research reveals that Large Language Models (LLMs) experience greater performance degradation when facing English as a Second Language (ESL) inputs combined with typographical errors, compared to either factor alone. The study tested eight ESL variants with three levels of typos, finding that evaluations on clean English may overestimate real-world model performance.

AIBearisharXiv – CS AI · Apr 76/10
🧠

Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not

A new study reveals that large language models fail to integrate world knowledge with syntactic structure for ambiguity resolution in the same way humans do. Researchers tested Turkish language models on relative-clause attachment ambiguities and found that while humans reliably use plausibility to guide interpretation, LLMs show weak, unstable, or reversed responses to the same plausibility cues.

AIBearisharXiv – CS AI · Apr 66/10
🧠

DeltaLogic: Minimal Premise Edits Reveal Belief-Revision Failures in Logical Reasoning Models

Researchers introduce DeltaLogic, a new benchmark that tests AI models' ability to revise their logical conclusions when presented with minimal changes to premises. The study reveals that language models like Qwen and Phi-4 struggle with belief revision even when they perform well on initial reasoning tasks, showing concerning inertia patterns where models fail to update conclusions when evidence changes.

AIBullisharXiv – CS AI · Apr 66/10
🧠

LLM Reasoning with Process Rewards for Outcome-Guided Steps

Researchers introduce PROGRS, a new framework that improves mathematical reasoning in large language models by using process reward models while maintaining focus on outcome correctness. The approach addresses issues with current reinforcement learning methods that can reward fluent but incorrect reasoning steps.

AINeutralarXiv – CS AI · Apr 66/10
🧠

Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints

A replication study found that simple vocabulary constraints like banning filler words ('very', 'just') improved AI reasoning performance more than complex linguistic restrictions like E-Prime. The research suggests any constraint that disrupts default generation patterns acts as an output regularizer, with shallow constraints being most effective.

AIBullisharXiv – CS AI · Apr 66/10
🧠

R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

Researchers introduce R2-Write, a new AI framework that improves large language models' performance on open-ended writing tasks by incorporating explicit reflection and revision patterns. The study reveals that existing reasoning models show limited gains in creative writing compared to mathematical tasks, prompting the development of an automated system with writer-judge interactions and process reward mechanisms.

AIBullisharXiv – CS AI · Mar 276/10
🧠

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

Researchers developed a multi-answer reinforcement learning approach that trains language models to generate multiple plausible answers with confidence estimates in a single forward pass, rather than collapsing to one dominant answer. The method shows improved diversity and accuracy across question-answering, medical diagnosis, and coding benchmarks while being more computationally efficient than existing approaches.

AIBullisharXiv – CS AI · Mar 266/10
🧠

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Researchers propose Future Summary Prediction (FSP), a new pretraining method for large language models that predicts compact representations of long-term future text sequences. FSP outperforms traditional next-token prediction and multi-token prediction methods in math, reasoning, and coding benchmarks when tested on 3B and 8B parameter models.

AINeutralarXiv – CS AI · Mar 266/10
🧠

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

Researchers discovered that Llama3-8b-Instruct can reliably recognize its own generated text through a specific vector in its neural network that activates during self-authorship recognition. The study demonstrates this self-recognition ability can be controlled by manipulating the identified vector to make the model claim or disclaim authorship of any text.

🧠 Llama
AIBearisharXiv – CS AI · Mar 266/10
🧠

Visuospatial Perspective Taking in Multimodal Language Models

Research reveals that multimodal language models have significant deficits in visuospatial perspective-taking, particularly in Level 2 VPT which requires adopting another person's viewpoint. The study used two human psychology tasks to evaluate MLMs' ability to understand and reason from alternative spatial perspectives.

AINeutralarXiv – CS AI · Mar 266/10
🧠

Qworld: Question-Specific Evaluation Criteria for LLMs

Researchers introduce Qworld, a new method for evaluating large language models that generates question-specific criteria using recursive expansion trees instead of static rubrics. The approach covers 89% of expert-authored criteria and reveals capability differences across 11 frontier LLMs that traditional evaluation methods miss.

AIBullisharXiv – CS AI · Mar 266/10
🧠

Navigating the Concept Space of Language Models

Researchers have developed Concept Explorer, a scalable interactive system for exploring features from sparse autoencoders (SAEs) trained on large language models. The tool uses hierarchical neighborhood embeddings to organize thousands of AI model features into interpretable concept clusters, enabling better discovery and analysis of how language models understand concepts.

AIBullisharXiv – CS AI · Mar 266/10
🧠

MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare

Researchers have introduced MedAidDialog, a multilingual medical dialogue dataset covering seven languages, and developed MedAidLM, a conversational AI model for preliminary medical consultations. The system uses parameter-efficient fine-tuning on small language models to enable deployment without high-end computational infrastructure while incorporating patient context for personalized consultations.

AINeutralarXiv – CS AI · Mar 266/10
🧠

Is Multilingual LLM Watermarking Truly Multilingual? Scaling Robustness to 100+ Languages via Back-Translation

Researchers demonstrate that current multilingual watermarking methods for LLMs fail to maintain robustness across medium- and low-resource languages, particularly under translation attacks. They introduce STEAM, a new detection method using Bayesian optimization that improves watermark detection across 133 languages with significant performance gains.

AIBullishApple Machine Learning · Mar 256/10
🧠

Thinking into the Future: Latent Lookahead Training for Transformers

Researchers propose Latent Lookahead Training, a new method for training transformer language models that allows exploration of multiple token continuations rather than committing to single tokens at each step. The paper was accepted at ICLR 2026's Workshop on Latent & Implicit Thinking, addressing limitations in current autoregressive language model training approaches.

← PrevPage 8 of 14Next →