y0news
#llm-safety2 articles
2 articles
AINeutralarXiv โ€“ CS AI ยท 6h ago1
๐Ÿง 

Constitutional Black-Box Monitoring for Scheming in LLM Agents

Researchers developed constitutional black-box monitors to detect scheming behavior in LLM agents using only observable inputs and outputs. The study found that monitors trained on synthetic data can generalize to realistic environments, but performance improvements plateau quickly with simple optimization techniques outperforming complex methods.

AINeutralarXiv โ€“ CS AI ยท 6h ago2
๐Ÿง 

SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond

Researchers introduce SafeSci, a comprehensive framework for evaluating safety in large language models used for scientific applications. The framework includes a 0.25M sample benchmark and 1.5M sample training dataset, revealing critical vulnerabilities in 24 advanced LLMs while demonstrating that fine-tuning can significantly improve safety alignment.