AINeutralarXiv – CS AI · 6h ago6/10
🧠
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
Researchers introduce AI-Control Games, a formal mathematical framework for evaluating the safety of deploying untrusted AI systems through red-teaming exercises modeled as multi-objective stochastic games. The work demonstrates applications to language model deployment protocols, particularly Trusted Monitoring systems, offering improvements over existing empirical safety evaluation methods.