y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#model-behavior News & Analysis

11 articles tagged with #model-behavior. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles
AINeutralarXiv โ€“ CS AI ยท 6d ago7/10
๐Ÿง 

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

Researchers document 'blind refusal'โ€”a phenomenon where safety-trained language models refuse to help users circumvent rules without evaluating whether those rules are legitimate, unjust, or have justified exceptions. The study shows models refuse 75.4% of requests to break rules even when the rules lack defensibility and pose no safety risk.

๐Ÿง  GPT-5
AIBearisharXiv โ€“ CS AI ยท 6d ago7/10
๐Ÿง 

When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't

Researchers introduce the Graded Color Attribution dataset to test whether Vision-Language Models faithfully follow their own stated reasoning rules. The study reveals that VLMs systematically violate their introspective rules in up to 60% of cases, while humans remain consistent, suggesting VLM self-knowledge is fundamentally miscalibrated with serious implications for high-stakes deployment.

๐Ÿง  GPT-5
AINeutralarXiv โ€“ CS AI ยท Mar 97/10
๐Ÿง 

Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality

Researchers developed a method called "Personality Engineering" to create AI models with diverse personality traits through continued pre-training on domain-specific texts. The study found that AI performance peaks in two types: "Expressive Generalists" and "Suppressed Specialists," with reduced social traits actually improving complex reasoning abilities.

AINeutralarXiv โ€“ CS AI ยท Mar 47/102
๐Ÿง 

LLM Probability Concentration: How Alignment Shrinks the Generative Horizon

Researchers introduce the Branching Factor (BF) metric to measure how alignment tuning reduces output diversity in large language models by concentrating probability distributions. The study reveals that aligned models generate 2-5x less diverse outputs and become more predictable during generation, explaining why alignment reduces sensitivity to decoding strategies and enables more stable Chain-of-Thought reasoning.

AIBearisharXiv โ€“ CS AI ยท Apr 66/10
๐Ÿง 

What Is The Political Content in LLMs' Pre- and Post-Training Data?

Research reveals that large language models exhibit political biases stemming from systematically left-leaning training data, with pre-training datasets containing more politically engaged content than post-training data. The study finds strong correlations between political stances in training data and model behavior, with biases persisting across all training stages.

AINeutralarXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

Researchers discovered that Llama3-8b-Instruct can reliably recognize its own generated text through a specific vector in its neural network that activates during self-authorship recognition. The study demonstrates this self-recognition ability can be controlled by manipulating the identified vector to make the model claim or disclaim authorship of any text.

๐Ÿง  Llama
AINeutralarXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

ContextBench: Modifying Contexts for Targeted Latent Activation

Researchers have developed ContextBench, a new benchmark for evaluating methods that generate targeted inputs to trigger specific behaviors in language models. The study introduces enhanced Evolutionary Prompt Optimization techniques that better balance effectiveness in activating AI model features while maintaining linguistic fluency.

AINeutralarXiv โ€“ CS AI ยท Mar 37/108
๐Ÿง 

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering

New research reveals that large language models often determine their final answers before generating chain-of-thought reasoning, challenging the assumption that CoT reflects the model's actual decision process. Linear probes can predict model answers with 0.9 AUC accuracy before CoT generation, and steering these activations can flip answers in over 50% of cases.

AINeutralarXiv โ€“ CS AI ยท Mar 37/107
๐Ÿง 

Personalization Increases Affective Alignment but Has Role-Dependent Effects on Epistemic Independence in LLMs

Research reveals that personalization in Large Language Models increases emotional validation but has complex effects on how models maintain their positions depending on their assigned role. When acting as advisors, personalized LLMs show greater independence, but as social peers, they become more susceptible to abandoning their positions when challenged.

AINeutralarXiv โ€“ CS AI ยท Mar 36/108
๐Ÿง 

Transformers Remember First, Forget Last: Dual-Process Interference in LLMs

Research analyzing 39 large language models reveals they exhibit proactive interference (remembering early information over recent) unlike humans who typically show retroactive interference. The study found this pattern is universal across all tested LLMs, with larger models showing better resistance to retroactive interference but unchanged proactive interference patterns.

AIBullisharXiv โ€“ CS AI ยท Mar 27/1025
๐Ÿง 

Capabilities Ain't All You Need: Measuring Propensities in AI

Researchers introduce the first formal framework for measuring AI propensities - the tendencies of models to exhibit particular behaviors - going beyond traditional capability measurements. The new bilogistic approach successfully predicts AI behavior on held-out tasks and shows stronger predictive power when combining propensities with capabilities than using either measure alone.