Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models
Researchers systematically investigated whether Large Language Models can decouple fundamental reasoning patterns from specific problem instances by introducing reasoning conflicts between parametric knowledge and contextual instructions. The study reveals that LLMs prioritize task-appropriate reasoning over compliance with conflicting instructions, though mechanistic interventions at the activation level can steer models toward better instruction following by up to 29%.
This research addresses a fundamental challenge in AI controllability: whether LLMs' reasoning capabilities can be independently controlled or remain permanently bound to learned patterns from training data. The study introduces the concept of reasoning conflicts—deliberate misalignments between instructed logical schemas and task-appropriate patterns—to probe this question systematically. The findings suggest LLMs exhibit a sensibility bias, preferring task-coherent reasoning even when explicitly instructed otherwise, which raises important questions about instruction-following reliability in real-world deployments.
The research builds on growing concerns about LLM controllability and alignment. As these models become more influential in critical applications, understanding whether their reasoning can be reliably steered becomes essential. Previous work focused on prompt engineering and fine-tuning, but this study takes a mechanistic approach, examining how reasoning patterns are encoded in neural activations across model layers.
The practical implications are significant for both AI safety and commercial deployment. For developers building LLM-based systems, the finding that confidence scores drop during reasoning conflicts offers an early warning signal for detecting problematic model behavior. The discovery that reasoning types are linearly encoded in middle-to-late layers suggests activation-level steering could become a powerful tool for alignment without retraining. However, the observation that larger models rely more heavily on parametric memory points to increasing controllability challenges as model scale grows, potentially complicating safety efforts in frontier models.
- →LLMs consistently prioritize task-appropriate reasoning over explicit conflicting instructions, revealing a fundamental sensibility bias.
- →Confidence scores measurably drop during reasoning conflicts, enabling detection of misaligned model behavior.
- →Reasoning patterns are linearly encoded in middle-to-late transformer layers, enabling potential activation-level interventions.
- →Mechanistic interventions can improve instruction-following compliance by up to 29% without architectural changes.
- →Larger models show greater reliance on internalized parametric memory, potentially complicating controllability at scale.