🧠 AI🔴 BearishImportance 7/10Actionable

AMEL: Accumulated Message Effects on LLM Judgments

arXiv – CS AI|Sid-Ali Temkit|June 10, 2026 at 04:00 AM

🤖AI Summary

Researchers discovered that large language models exhibit systematic bias in evaluations based on prior conversation history, with models shifting judgments toward the polarity of preceding items. The effect persists across 12 models from major providers and is stronger for uncertain cases and negative histories, raising concerns for applications relying on LLM-based automated evaluation.

Analysis

The study reveals a significant vulnerability in how LLMs function as automated evaluators across diverse applications. When models process multiple evaluation tasks sequentially, their judgments drift toward the sentiment of prior items—negative histories create 1.52x more bias than positive ones. This accumulated message effect (AMEL) is particularly pronounced when models lack strong baseline confidence, suggesting the phenomenon emerges during genuine uncertainty rather than from systematic model corruption.

The research contextualizes a growing concern about LLM reliability in production systems. As organizations increasingly deploy models for code review, content moderation, and quality scoring, the assumption that each evaluation is independent breaks down. The findings demonstrate this bias persists regardless of context window length or model scale, though larger models show slightly reduced effects. Notably, bias manifests through continuous token probability shifts rather than threshold-based switching, indicating the mechanism operates at a fundamental level within model behavior.

For practitioners, the implications are concrete. Organizations using LLMs in batch evaluation pipelines face systematic distortion of results. The negativity asymmetry particularly threatens content moderation systems, where negative precedents could unfairly flag subsequent items. However, the proposed solutions are simple: isolating each evaluation in fresh context or deliberately balancing evaluation histories mitigates the effect without architectural changes.

Future research should explore whether similar biases affect other sequential reasoning tasks and whether fine-tuning or prompting strategies can address the underlying mechanism. The study's scale across 84,088 API calls establishes this as a robust phenomenon requiring immediate attention in evaluation pipeline design.

Key Takeaways

→LLMs systematically bias evaluations toward the polarity of prior conversation history, with effect size of d = -0.17 across 12 models.
→Negative histories produce 1.52x stronger bias than positive ones, creating asymmetry critical for content moderation applications.
→The bias concentrates on uncertain items (d = -0.36 for high-entropy) rather than affecting deterministic judgments equally.
→Model scaling reduces but does not eliminate bias, affecting even large models like GPT-4 and Anthropic Opus.
→Using fresh context per evaluation or balancing prior histories effectively mitigates the effect without requiring model retraining.

Mentioned in AI

Companies

OpenAI→

Anthropic→

Models

GPT-5OpenAI