AINeutralarXiv โ CS AI ยท 5h ago6/10
๐ง
Implementing surrogate goals for safer bargaining in LLM-based agents
Researchers developed methods to implement 'surrogate goals' in LLM-based agents to reduce bargaining risks by deflecting threats away from what principals care about. The study tested four approaches (prompting, fine-tuning, scaffolding) and found that scaffolding and fine-tuning methods outperformed simple prompting for implementing desired threat response behaviors.