AINeutralarXiv – CS AI · 15h ago6/10
🧠
ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents
Researchers introduce ProcCtrlBench, a new evaluation framework for LLM coding agents that measures execution-process quality rather than just final outcomes. The benchmark identifies 11 types of execution defects and introduces 'control preservation' metrics to assess whether AI agents maintain interpretability, interruptibility, and reversibility during code execution.