AIBearisharXiv โ CS AI ยท 7h ago7/10
๐ง
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Researchers introduce TraceSafe-Bench, a benchmark evaluating how well LLM guardrails detect safety risks across multi-step tool-using trajectories. The study reveals that guardrail effectiveness depends more on structural reasoning capabilities than semantic safety training, and that general-purpose LLMs outperform specialized safety models in detecting mid-execution vulnerabilities.