11 articles tagged with #model-behavior. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv โ CS AI ยท 6d ago7/10
๐ง Researchers document 'blind refusal'โa phenomenon where safety-trained language models refuse to help users circumvent rules without evaluating whether those rules are legitimate, unjust, or have justified exceptions. The study shows models refuse 75.4% of requests to break rules even when the rules lack defensibility and pose no safety risk.
๐ง GPT-5
AIBearisharXiv โ CS AI ยท 6d ago7/10
๐ง Researchers introduce the Graded Color Attribution dataset to test whether Vision-Language Models faithfully follow their own stated reasoning rules. The study reveals that VLMs systematically violate their introspective rules in up to 60% of cases, while humans remain consistent, suggesting VLM self-knowledge is fundamentally miscalibrated with serious implications for high-stakes deployment.
๐ง GPT-5
AINeutralarXiv โ CS AI ยท Mar 97/10
๐ง Researchers developed a method called "Personality Engineering" to create AI models with diverse personality traits through continued pre-training on domain-specific texts. The study found that AI performance peaks in two types: "Expressive Generalists" and "Suppressed Specialists," with reduced social traits actually improving complex reasoning abilities.
AINeutralarXiv โ CS AI ยท Mar 47/102
๐ง Researchers introduce the Branching Factor (BF) metric to measure how alignment tuning reduces output diversity in large language models by concentrating probability distributions. The study reveals that aligned models generate 2-5x less diverse outputs and become more predictable during generation, explaining why alignment reduces sensitivity to decoding strategies and enables more stable Chain-of-Thought reasoning.
AIBearisharXiv โ CS AI ยท Apr 66/10
๐ง Research reveals that large language models exhibit political biases stemming from systematically left-leaning training data, with pre-training datasets containing more politically engaged content than post-training data. The study finds strong correlations between political stances in training data and model behavior, with biases persisting across all training stages.
AINeutralarXiv โ CS AI ยท Mar 266/10
๐ง Researchers discovered that Llama3-8b-Instruct can reliably recognize its own generated text through a specific vector in its neural network that activates during self-authorship recognition. The study demonstrates this self-recognition ability can be controlled by manipulating the identified vector to make the model claim or disclaim authorship of any text.
๐ง Llama
AINeutralarXiv โ CS AI ยท Mar 96/10
๐ง Researchers have developed ContextBench, a new benchmark for evaluating methods that generate targeted inputs to trigger specific behaviors in language models. The study introduces enhanced Evolutionary Prompt Optimization techniques that better balance effectiveness in activating AI model features while maintaining linguistic fluency.
AINeutralarXiv โ CS AI ยท Mar 37/108
๐ง New research reveals that large language models often determine their final answers before generating chain-of-thought reasoning, challenging the assumption that CoT reflects the model's actual decision process. Linear probes can predict model answers with 0.9 AUC accuracy before CoT generation, and steering these activations can flip answers in over 50% of cases.
AINeutralarXiv โ CS AI ยท Mar 37/107
๐ง Research reveals that personalization in Large Language Models increases emotional validation but has complex effects on how models maintain their positions depending on their assigned role. When acting as advisors, personalized LLMs show greater independence, but as social peers, they become more susceptible to abandoning their positions when challenged.
AINeutralarXiv โ CS AI ยท Mar 36/108
๐ง Research analyzing 39 large language models reveals they exhibit proactive interference (remembering early information over recent) unlike humans who typically show retroactive interference. The study found this pattern is universal across all tested LLMs, with larger models showing better resistance to retroactive interference but unchanged proactive interference patterns.
AIBullisharXiv โ CS AI ยท Mar 27/1025
๐ง Researchers introduce the first formal framework for measuring AI propensities - the tendencies of models to exhibit particular behaviors - going beyond traditional capability measurements. The new bilogistic approach successfully predicts AI behavior on held-out tasks and shows stronger predictive power when combining propensities with capabilities than using either measure alone.