Answer Engineering: Local Trajectory Editing for Protocol-Constrained Decision Making in Large Language Models
Researchers present Answer Engineering, a runtime technique that improves large language model compliance with procedural protocols by editing reasoning trajectories during generation. Testing on clinical decision-making shows the method increased protocol adherence from 25-54% to 78-84% without retraining models, addressing a critical safety gap in high-stakes domains.
Answer Engineering addresses a fundamental challenge in deploying large language models to regulated domains: models often generate confident but procedurally incorrect outputs even when capable of sound reasoning. The technique represents a pragmatic middle ground between full retraining and unguided generation, using deterministic runtime interventions to steer model outputs toward protocol compliance.
The research emerges from growing recognition that LLMs excel at reasoning but struggle with systematic rule adherence in specialized fields. Clinical decision-making serves as an ideal test case because protocols are explicit, outcomes are measurable, and errors carry direct consequences. The benchmark results reveal a critical insight: step-by-step reasoning alone actually worsened performance on some tasks, shifting rather than eliminating errors. This finding challenges assumptions that chain-of-thought prompting universally improves reliability.
The 80.7% balanced accuracy achieved through local trajectory editing represents meaningful progress for high-stakes applications. The approach's appeal lies in its deployment efficiency—no model retraining required—making it immediately applicable to existing systems. However, the paper identifies significant limitations: the method depends on comprehensive rule coverage, reliable trigger mechanisms, and addressing underlying diagnosis-first generation biases that persist despite interventions.
For the AI industry, this work validates runtime control as a practical safety mechanism while exposing the gap between reasoning capability and protocol adherence. The findings suggest that production LLM deployments in regulated sectors may require layered approaches combining multiple intervention points rather than relying solely on instruction-tuning or prompting. Future development will likely focus on generalizing these techniques across domains and automatically deriving rule sets from protocol documentation.
- →Answer Engineering improves clinical protocol compliance from 25-54% to 78-84% without retraining models through runtime trajectory editing
- →Step-by-step reasoning shifted errors rather than eliminating them, suggesting chain-of-thought alone is insufficient for procedural compliance
- →The deterministic approach provides auditable runtime control, addressing transparency requirements in regulated industries
- →Method effectiveness depends on comprehensive rule coverage and trigger reliability, revealing scalability limitations
- →Results support layered safety architectures combining multiple intervention mechanisms for high-stakes LLM deployment