AIBearisharXiv – CS AI · 15h ago7/10
🧠
Furina: Fragmented Uncertainty-Driven Refusal Instability Attack
Researchers have discovered that safety mechanisms in large language models operate within an instability region where small input variations cause unpredictable refusal behaviors rather than consistent outputs. The Furina jailbreak attack exploits this vulnerability by using fragmented prompts to amplify uncertainty, outperforming existing attacks on safety benchmarks and highlighting a fundamental weakness in current AI safety defenses.