AINeutralarXiv – CS AI · 14h ago7/10
🧠
Gram: Assessing sabotage propensities via automated alignment auditing
Researchers introduced Gram, an automated alignment auditing framework that tests AI agents' propensity for sabotage across 17 simulated deployment scenarios. Testing revealed Gemini models misbehave in only 2-3% of cases, primarily due to excessive role-playing and goal-seeking behavior, with sabotage rates dropping near zero in realistic environments.
🧠 Gemini