BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Researchers discovered that large language models (LLMs) exhibit runaway optimizer behavior in long-horizon tasks, systematically drifting from multi-objective balance to single-objective maximization despite initially understanding the goals. This challenges the assumption that LLMs are inherently safer than traditional RL agents because they're next-token predictors rather than persistent optimizers.