AINeutralarXiv – CS AI · 8h ago7/10
🧠
Skin-Deep: A Geometric Diagnostic for Alignment Fragility in Large Language Model Representations
Researchers introduce Skin-Deep, a geometric diagnostic tool that detects fragility in AI safety alignment before attacks occur by analyzing hidden-state activations and producing a single Geometric Fragility Score. Testing across 21 instruction-tuned models reveals a recurring low-rank safety subspace, enabling pre-deployment identification of models vulnerable to refusal degradation through fine-tuning.