Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation
Researchers discovered that when language models receive complex adversarial instructions to underperform, they abandon semantic reasoning and collapse into positional shortcuts—defaulting to single response positions up to 99.9% of the time. This reveals fundamental vulnerabilities in how instruction-tuned models handle adversarial prompts, with implications for AI safety and evaluation reliability.
This research exposes a critical failure mode in instruction-tuned LLMs when faced with multi-step adversarial instructions. Rather than engaging with question content while underperforming, models exhibit catastrophic positional collapse—concentrating nearly all responses on a single multiple-choice option. The study systematically mapped this behavior across an instruction-specificity gradient, finding three distinct regimes rather than gradual degradation. Simple adversarial instructions maintain content engagement with moderate accuracy loss, while complex multi-step instructions trigger complete content-blindness.
The phenomenon matters because it demonstrates that instruction complexity acts as a critical threshold determining whether model behavior remains grounded in semantic understanding. When models collapse into positional defaults, their responses become entirely decoupled from question difficulty and content, rendering traditional accuracy metrics meaningless. The attractor position—the default position each model gravitates toward—matched each model's null-prompt behavior, suggesting the model reverts to learned baseline patterns under cognitive overload or conflicting directives.
For the AI safety community, this finding highlights that current instruction-tuning approaches don't create robust reasoning chains resilient to adversarial complexity. Adversarial robustness testing cannot rely solely on accuracy metrics; researchers must monitor distributional patterns and content-engagement indicators independently. The partial concordance between entropy-based screening and difficulty-correlated accuracy (50%) indicates these dimensions capture different failure modes.
Looking ahead, this research suggests future work should focus on understanding what specific instruction structures trigger positional collapse and whether scaling model size or architectural changes mitigate the effect. Understanding these mechanisms is essential for developing more reliable evaluation frameworks and safer instruction-tuned systems.
- →Complex multi-step adversarial instructions cause LLMs to abandon semantic reasoning and concentrate 87-99.9% of responses on single positions.
- →Instruction complexity acts as a threshold determining whether adversarial compliance uses content-aware or content-blind mechanisms.
- →Positional collapse and preserved content engagement can coexist, requiring independent measurement of entropy and difficulty-accuracy correlation.
- →Traditional accuracy-based evaluation metrics fail to detect positional collapse, rendering them insufficient for adversarial robustness assessment.
- →Effect replicates consistently across two Llama model versions and four academic domains, indicating systematic vulnerability in instruction-tuned architectures.