AIBearisharXiv – CS AI · 1d ago7/10
🧠Researchers identify a critical failure mode called Cherry-pick Override (CCO) where large language model judges make unsafe directional commitments when evaluating mixed evidence containing both supporting and refuting claims. The study demonstrates that LLM judges incorrectly return definitive verdicts on over 84% of conflicting-evidence cases instead of acknowledging ambiguity, with panel voting amplifying rather than mitigating this bias.
AIBearisharXiv – CS AI · 2d ago7/10
🧠A research study compares how human annotators and large language models (GPT-4o-mini, Llama-3.3-70B) assign political ideology labels to news articles, finding that fine-tuned GPT-4o-mini models develop spurious correlations between sentiment and ideology that don't exist in human judgment. This reveals a critical vulnerability in using LLM annotations as training data for downstream tasks.
🧠 GPT-4🧠 Llama
AINeutralarXiv – CS AI · May 297/10
🧠Researchers introduce PRAIB, a benchmark framework that evaluates how Large Language Models perform peer review compared to human reviewers. Analysis of 11,000 LLM-generated reviews across major AI conferences reveals significant behavioral divergences: LLM ratings show less variability, positive bias, overconfidence, and frequently miss atomic weaknesses that human reviewers catch.
AIBearisharXiv – CS AI · May 277/10
🧠Researchers demonstrate BITE, a black-box adversarial attack framework that exploits stylistic biases in LLM judges to artificially inflate evaluation scores while preserving semantic meaning. The attack achieves over 65% success rates across diverse LLM judges and tasks, exposing fundamental vulnerabilities in using language models for objective evaluation.
AIBearisharXiv – CS AI · May 127/10
🧠A comprehensive empirical study reveals that weight pruning—a technique for compressing large language models for edge devices—paradoxically amplifies bias while preserving performance metrics. The research shows activation-aware pruning methods maintain perplexity but increase stereotype reliance by up to 84%, suggesting current evaluation methods fail to detect fairness degradation in compressed models.
🏢 Perplexity
AINeutralarXiv – CS AI · Mar 127/10
🧠Researchers discover that the 'Lost in the Middle' phenomenon in transformer models - where AI performs poorly on middle context but well on beginning and end content - is an inherent architectural property present even before training begins. The U-shaped performance bias stems from the mathematical structure of causal decoders with residual connections, creating a 'factorial dead zone' in middle positions.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers developed a framework separating language proficiency from cultural knowledge access in large language models across 13 locales and 80 models. The study reveals that while English outperforms local languages on culture-agnostic questions, local languages consistently show advantages for accessing culture-specific knowledge once proficiency gaps are controlled for. This finding challenges the assumption that weaker local-language LLM performance indicates weaker cultural knowledge.
AINeutralarXiv – CS AI · May 276/10
🧠Research comparing 120 base and aligned language model pairs reveals that alignment training makes models more normative but less descriptive of actual human behavior. Base models predict real human choices in multi-round strategic games 10 times better, while aligned models excel only in single-shot, textbook scenarios where human behavior follows rational expectations.
AIBearisharXiv – CS AI · Apr 136/10
🧠Researchers introduce OmniBehavior, a benchmark for evaluating large language models' ability to simulate real-world human behavior across complex, long-horizon scenarios. The study reveals that current LLMs struggle with authentic behavioral simulation and exhibit systematic biases toward homogenized, overly-positive personas rather than capturing individual differences and realistic long-tail behaviors.
AIBearisharXiv – CS AI · Apr 106/10
🧠Researchers studied how persona vectors—AI steering techniques that inject personality traits into large language models—affect educational applications like essay generation and automated grading. The study found that persona steering significantly degrades answer quality, with substantially larger negative impacts on open-ended humanities tasks compared to factual science questions, and reveals that AI scorers exhibit predictable bias patterns based on assigned personality traits.
AINeutralarXiv – CS AI · Mar 36/103
🧠Researchers identified 'internal bias' as a key cause of overthinking in AI reasoning models, where models form preliminary guesses that conflict with systematic reasoning. The study found that excessive attention to input questions triggers redundant reasoning steps, and current mitigation methods have proven ineffective.