AIBullisharXiv – CS AI · 6h ago7/10
🧠
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
SafeHarbor is a new framework that enhances Large Language Model agent safety by using hierarchical memory and context-aware defense rules to prevent harmful tool use while maintaining utility on benign tasks. The system achieves 93%+ refusal rates against malicious requests while preserving 63.6% performance on legitimate tasks, addressing a critical trade-off in AI safety.
🧠 GPT-4