AINeutralarXiv – CS AI · 18h ago7/10
🧠
Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models
Researchers introduce a diagnostic framework for identifying why reasoning language models fail to follow instruction hierarchies in agentic workflows. Testing reveals three distinct failure modes—instruction identification, conflict resolution, and response realization—with models showing different dominant failures across architectures. Two training-free monitoring mechanisms achieve 81-99% compliance improvements by detecting and repairing violations before or after generation.
🧠 GPT-5🧠 Claude🧠 Sonnet