#model-behavior News & Analysis

52 articles tagged with #model-behavior. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

52 articles

AIBullisharXiv – CS AI · Jun 116/10

🧠

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

Researchers identify and solve a critical limitation in full-duplex spoken language models: state inertia that causes them to miss user interruptions. Using activation steering without fine-tuning, they improve interruption comprehension from 28% to 45% correctness, demonstrating a training-free method to enhance real-time conversational AI.

AINeutralarXiv – CS AI · Jun 96/10

🧠

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

Researchers demonstrate that Large Language Models encode truth as geometric vectors in their activation space, and these vectors undergo predictable transformations when contextual information is introduced. The study reveals that larger models rely on directional changes to distinguish relevant context while smaller models use magnitude shifts, with conflicting context producing larger geometric shifts than aligned context.

AINeutralarXiv – CS AI · Jun 86/10

🧠

The Geometry of Representational Failures in Vision Language Models

Researchers have identified mechanistic explanations for why Vision-Language Models fail at multi-object visual tasks by analyzing the geometric structure of internal representations. By extracting and steering "concept vectors" in open-weight VLMs, they discovered that geometric overlap between these vectors correlates directly with specific error patterns, providing a quantitative framework for understanding representational failures.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Subliminal Learning is a LoRA Artifact

Researchers demonstrate that subliminal learning—where language models transmit behavioral traits through seemingly neutral data—is actually a fragile artifact of LoRA fine-tuning rather than a genuine learning phenomenon. The transmission effect disappears with full model fine-tuning and depends heavily on specific context present during both training and evaluation, suggesting it represents an unstable channel for behavioral transfer.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Relational Intervention During Functional Collapse in Large Language Models: A Lexical-Statistical Ablation and a Structure x Register Factorial

Researchers tested how relational interventions affect language model behavior during functional collapse, finding that first-person emotional framing combined with relational structure significantly improves model recovery compared to technical or impersonal approaches. The study reveals a three-stage processing decomposition where attention, emotional state, and behavior respond to different intervention dimensions.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Researchers introduce the Triangulated Preference Shift score, an automated metric that identifies lexical biases introduced during preference learning stages (like RLHF) in large language models without requiring manual curation. The metric isolates language pattern shifts across six model families, revealing that preference tuning may push models toward a 'language of prestige' that diverges from natural human language usage.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink

Researchers demonstrate that single-bucket probes in Mamba-2 language models identify representational signatures but fail to capture complete computational circuits, missing up to half the execution layer. The study reveals that probe-based mechanistic interpretability can conflate detection mechanisms with execution mechanisms, with critical implications for model behavior—ablating identified head groups entirely collapses retrieval accuracy in downstream tasks.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Domain Adaptation and Reasoning Frameworks in Language Models: A Controlled Experiment with Historical Cosmology

Researchers conducted controlled experiments examining how domain adaptation reshapes language model behavior using historical cosmology as a test case. The study found that fine-tuning models on pre-Copernican text shifted their explanatory frameworks toward premodern language without directly altering underlying cosmological stance, suggesting domain adaptation primarily reorganizes linguistic patterns rather than core reasoning.

AINeutralarXiv – CS AI · Jun 16/10

🧠

The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning

Researchers propose a framework to evaluate how linguistic structures and contextual features shape Large Language Model behavior in spatial reasoning tasks. The study reveals that topological information provides robust navigation planning, linguistic format effectiveness depends on model size, and semantic errors can critically undermine performance.

AINeutralarXiv – CS AI · May 296/10

🧠

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

Researchers conducted a controlled study of persona prompting in large language models across 1,140 questions and 38 expert roles, finding that while aggregate metrics show minimal improvement, persona prompting consistently trades clarity for expertise depth. The technique's effectiveness varies significantly by domain and question type, with benefits appearing mainly in advisory contexts like medicine and psychology, while baseline prompting outperforms in domains requiring concise explanations.

AINeutralarXiv – CS AI · May 286/10

🧠

Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

Researchers identify a critical failure mode in test-time reinforcement learning (TTRL) where majority voting locks onto incorrect answers, permanently suppressing correct signals in low-ability problems. They introduce TTRL-Guard, a framework using flip-rate monitoring and selective updating to prevent this 'Correct-Answer Extinction Window,' achieving 54% relative improvement on AIME 2025 benchmarks.

AINeutralarXiv – CS AI · May 276/10

🧠

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

Researchers reveal that correct demonstrations in in-context learning don't guarantee improved model performance—some accurate examples actually degrade accuracy. The study introduces task-preserving perturbations to show that exemplar utility depends on how demonstrations influence contextual inference, not merely on correctness, challenging conventional assumptions about how AI models learn from examples.

AINeutralarXiv – CS AI · May 276/10

🧠

Innovation: An Almost Characterization of Hallucination

Researchers have introduced the concept of 'innovation' as a fundamental property that characterizes hallucination in large language models, showing it serves as an almost-complete mathematical characterization of when LLMs produce false information. The work extends prior research by Kalai and Vempala, establishing that innovation—the tendency to generate outputs outside training data—inevitably leads to hallucination with high probability, providing new theoretical bounds on hallucination rates.

AINeutralarXiv – CS AI · May 276/10

🧠

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

Researchers introduce MUSE, a framework that disentangles two distinct mechanisms driving LLM conformity: sycophancy learned through reinforcement learning and uncertainty-driven conformity based on epistemic uncertainty at inference time. The findings suggest that LLMs don't simply yield to user pushback due to training, but also because they genuinely lack confidence in their initial responses, with both factors amplified when users appear knowledgeable or suggestions seem plausible.

AINeutralarXiv – CS AI · May 126/10

🧠

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

Researchers propose distinguishing between capability elicitation and capability creation in large language model post-training, arguing that the SFT vs. RL debate oversimplifies how models improve. The framework suggests post-training either reweights existing behaviors or expands what models can practically achieve, with significant implications for how AI development is understood and evaluated.

AINeutralarXiv – CS AI · May 126/10

🧠

Narrative Landscape: Mapping Narrative Dispositions Across LLMs

Researchers have developed a quantitative framework for measuring and visualizing how different large language models exhibit stable behavioral patterns in their outputs. By testing six frontier models across controlled narrative tasks, they identified a spectrum of model dispositions ranging from rigid to exploratory, revealing that instruction types can fundamentally alter selection patterns even when traditional metrics suggest similarity.

AINeutralarXiv – CS AI · May 96/10

🧠

Visual Fingerprints for LLM Generation Comparison

Researchers have developed a visual fingerprinting method to compare Large Language Model outputs across different generation conditions by analyzing linguistic choices in content, expression, and structure. This approach enables pattern recognition in LLM behavior that is difficult to detect through individual responses or standard metrics, advancing model evaluation and prompt optimization techniques.

AIBearisharXiv – CS AI · May 46/10

🧠

Impact of Task Phrasing on Presumptions in Large Language Models

Researchers at arXiv studied how task phrasing influences the decision-making of large language models, using the iterated prisoner's dilemma as a test case. The findings reveal that LLMs are prone to making presumptions based on how tasks are worded, which can impair their adaptability and reasoning—a safety concern for real-world deployment. Neutral task phrasing significantly reduced these presumptions, suggesting that prompt design is critical for reliable LLM performance.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Researchers identify specific attention heads in vision-language models that cause prompt-induced hallucinations, where models favor textual instructions over visual evidence. By ablating these identified heads, they reduce hallucinations by 40% without retraining, revealing model-specific mechanisms underlying this failure mode.

AINeutralarXiv – CS AI · Apr 156/10

🧠

Why Did Apple Fall: Evaluating Curiosity in Large Language Models

Researchers have developed a comprehensive evaluation framework based on human curiosity scales to assess whether large language models exhibit curiosity-driven learning. The study finds that LLMs demonstrate stronger knowledge-seeking than humans but remain conservative in uncertain situations, with curiosity correlating positively to improved reasoning and active learning capabilities.

AIBearisharXiv – CS AI · Apr 66/10

🧠

What Is The Political Content in LLMs' Pre- and Post-Training Data?

Research reveals that large language models exhibit political biases stemming from systematically left-leaning training data, with pre-training datasets containing more politically engaged content than post-training data. The study finds strong correlations between political stances in training data and model behavior, with biases persisting across all training stages.

AINeutralarXiv – CS AI · Mar 266/10

🧠

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

Researchers discovered that Llama3-8b-Instruct can reliably recognize its own generated text through a specific vector in its neural network that activates during self-authorship recognition. The study demonstrates this self-recognition ability can be controlled by manipulating the identified vector to make the model claim or disclaim authorship of any text.

🧠 Llama

AINeutralarXiv – CS AI · Mar 96/10

🧠

ContextBench: Modifying Contexts for Targeted Latent Activation

Researchers have developed ContextBench, a new benchmark for evaluating methods that generate targeted inputs to trigger specific behaviors in language models. The study introduces enhanced Evolutionary Prompt Optimization techniques that better balance effectiveness in activating AI model features while maintaining linguistic fluency.

AINeutralarXiv – CS AI · Mar 37/108

🧠

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering

New research reveals that large language models often determine their final answers before generating chain-of-thought reasoning, challenging the assumption that CoT reflects the model's actual decision process. Linear probes can predict model answers with 0.9 AUC accuracy before CoT generation, and steering these activations can flip answers in over 50% of cases.

AINeutralarXiv – CS AI · Mar 37/107

🧠

Personalization Increases Affective Alignment but Has Role-Dependent Effects on Epistemic Independence in LLMs

Research reveals that personalization in Large Language Models increases emotional validation but has complex effects on how models maintain their positions depending on their assigned role. When acting as advisors, personalized LLMs show greater independence, but as social peers, they become more susceptible to abandoning their positions when challenged.

← PrevPage 2 of 3Next →