y0news
← Feed
←Back to feed
🧠 AIβšͺ NeutralImportance 7/10

Skin-Deep: A Geometric Diagnostic for Alignment Fragility in Large Language Model Representations

arXiv – CS AI|Dongyub Jude Lee, Jungseob Lee, Seungyoon Lee, Seongtae Hong, Suhyune Son, Sugyeong Eo, Jaehyung Seo, Heuiseok Lim|
πŸ€–AI Summary

Researchers introduce Skin-Deep, a geometric diagnostic tool that detects fragility in AI safety alignment before attacks occur by analyzing hidden-state activations and producing a single Geometric Fragility Score. Testing across 21 instruction-tuned models reveals a recurring low-rank safety subspace, enabling pre-deployment identification of models vulnerable to refusal degradation through fine-tuning.

Analysis

The research addresses a critical vulnerability in deployed large language models: safety alignment can be erased through benign fine-tuning, creating a deployment risk for open-weight models that pass safety tests at release but fail afterward. While prior work documented these refusal failures, existing approaches required actually running attacks to discover fragility. Skin-Deep changes this by enabling detection of alignment fragility before any intervention occurs, using geometric analysis of the model's internal representations.

The work builds on growing understanding that AI safety behavior operates through identifiable geometric structures in neural networks. By compressing complex layer-wise safety geometry into a single Geometric Fragility Score, the researchers create a practical diagnostic tool. Their analysis across multiple model families and alignment recipes reveals a consistent pattern: safety mechanisms operate through low-rank subspaces. Direction ablation experiments provide causal evidence that removing specific directions in these subspaces directly weakens refusal capabilities.

For developers and organizations deploying open-weight models, this research provides immediate practical value. The ability to identify fragile refusal before deployment prevents the costly scenario where a model passes safety evaluation but becomes unsafe under downstream use. The GFS metric predicted which initially safe models would retain refusal best after LoRA fine-tuning, demonstrating genuine predictive utility.

The findings suggest future work may develop more sophisticated interventions targeting these safety subspaces. Understanding that safety operates through compressible geometric structures opens possibilities for more robust alignment techniques and better understanding of why current methods remain brittle. This research establishes a foundation for more rigorous safety evaluation protocols in LLM deployment pipelines.

Key Takeaways
  • β†’Skin-Deep detects alignment fragility in LLMs before attacks by analyzing hidden-state geometry without requiring intervention
  • β†’A recurring low-rank safety subspace exists across different model families and alignment approaches
  • β†’The Geometric Fragility Score predicts which safe models will retain refusal behavior after fine-tuning
  • β†’Direction ablations causally prove that identified safety subspaces directly underlie refusal behavior
  • β†’This pre-deployment diagnostic enables organizations to flag unsafe models before they reach production
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles