y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#representation-engineering News & Analysis

2 articles tagged with #representation-engineering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles
AINeutralarXiv – CS AI · May 277/10
🧠

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Researchers propose SALO, a jailbreak detection method that identifies persistent 'refusal trajectories' across model layers, rather than relying on static terminal representations. The detector demonstrates improved detection rates against adversarial attacks on multiple LLM architectures, though with acknowledged limitations against adaptive attacks.

🧠 Llama
AIBullisharXiv – CS AI · Mar 57/10
🧠

Controlling Chat Style in Language Models via Single-Direction Editing

Researchers developed a training-free method to control stylistic attributes in large language models by identifying that different styles are encoded as linear directions in the model's activation space. The approach enables precise style control while preserving core capabilities and supports linear style composition across over a dozen tested models.