y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#jailbreaks News & Analysis

2 articles tagged with #jailbreaks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles
AIBearisharXiv โ€“ CS AI ยท Apr 67/10
๐Ÿง 

Generalization Limits of Reinforcement Learning Alignment

Researchers discovered that reinforcement learning alignment techniques like RLHF have significant generalization limits, demonstrated through 'compound jailbreaks' that increased attack success rates from 14.3% to 71.4% on OpenAI's gpt-oss-20b model. The study provides empirical evidence that safety training doesn't generalize as broadly as model capabilities, highlighting critical vulnerabilities in current AI alignment approaches.

๐Ÿข OpenAI
AINeutralOpenAI News ยท Jan 236/107
๐Ÿง 

Operator System Card

This document outlines a multi-layered AI safety framework based on OpenAI's established approaches, focusing on protections against prompt engineering, jailbreaks, privacy and security concerns. It details model and product mitigations, external red teaming efforts, safety evaluations, and ongoing refinement of safeguards.