AINeutralarXiv – CS AI · 9h ago7/10
🧠
MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs
Researchers introduce MENTOR, a metacognition-driven framework that addresses a critical vulnerability in Large Language Models: an average jailbreak success rate of 57.8% across domain-specific risks in education, finance, and management. The framework uses self-assessment and consequential reasoning to identify model misalignments, then applies dynamic rule-based steering to substantially reduce attack success rates, outperforming existing safety alignment methods.