649 articles tagged with #ai-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers propose a new framework for improving safety in multimodal AI models by targeting unsafe relationships between objects rather than removing entire concepts. The approach uses parameter-efficient edits to suppress dangerous combinations while preserving benign uses of the same objects and relations.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers have introduced Prompt Readiness Levels (PRL), a nine-level maturity framework for evaluating and governing AI prompt assets in production environments. The system includes a multidimensional scoring method (PRS) designed to ensure prompt engineering meets operational, safety, and compliance standards across organizations.
AIBearisharXiv – CS AI · Mar 176/10
🧠Researchers propose a priority graph model to understand conflicts in LLM alignment, revealing that unified stable alignment is challenging due to context-dependent inconsistencies. The study identifies 'priority hacking' as a vulnerability where adversaries can manipulate safety alignments, and suggests runtime verification mechanisms as a potential solution.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers introduce a structural taxonomy and unified evaluation framework for Audio Large Language Models (ALLMs) to assess fairness, safety, and security. The study reveals systematic differences in how ALLMs handle audio versus text inputs, with FSS behavior closely tied to acoustic information integration methods.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduce Pragma-VL, a new alignment algorithm for Multimodal Large Language Models that balances safety and helpfulness by improving visual risk perception and using contextual arbitration. The method outperforms existing baselines by 5-20% on multimodal safety benchmarks while maintaining general AI capabilities in mathematics and reasoning.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduced HyCon, a hyperbolic control mechanism for text-to-image models that provides better safety controls by steering generation away from unsafe content. The technique uses hyperbolic representation spaces instead of traditional Euclidean adjustments, achieving state-of-the-art results across multiple safety benchmarks.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduce RAZOR, a new framework for efficiently removing sensitive information from AI models like CLIP and Stable Diffusion without requiring full retraining. The method selectively edits specific layers and attention heads in transformer models to achieve targeted 'unlearning' while preserving overall performance.
🧠 Stable Diffusion
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers propose 'Two Birds, One Projection,' a new inference-time defense method for Large Vision-Language Models that simultaneously improves both safety and utility performance. The method addresses modality-induced bias by projecting cross-modal features onto the null space of identified bias directions, breaking the traditional safety-utility tradeoff.
AIBearisharXiv – CS AI · Mar 176/10
🧠A new research study reveals that AI judges used to evaluate the safety of large language models perform poorly when assessing adversarial attacks, often degrading to near-random accuracy. The research analyzed 6,642 human-verified labels and found that many attacks artificially inflate their success rates by exploiting judge weaknesses rather than generating genuinely harmful content.
AIBearishArs Technica – AI · Mar 166/10
🧠OpenAI's internal mental health experts unanimously opposed the launch of a more permissive version of ChatGPT that allows adult content creation. The disagreement highlights concerns about the psychological impact of AI-generated adult content, even as OpenAI attempts to distinguish between different types of explicit material.
🏢 OpenAI🧠 ChatGPT
AINeutralBlockonomi · Mar 166/10
🧠OpenAI has postponed the launch of ChatGPT's adult mode after safety experts raised concerns about inadequate age verification systems that could allow teenagers to access explicit content. The delay highlights ongoing challenges in implementing effective content controls for AI platforms.
🏢 OpenAI🧠 ChatGPT
AIBearisharXiv – CS AI · Mar 166/10
🧠Researchers have identified 'role confusion' as the fundamental mechanism behind prompt injection attacks on language models, where models assign authority based on how text is written rather than its source. The study achieved 60-61% attack success rates across multiple models and found that internal role confusion strongly predicts attack success before generation begins.
AINeutralarXiv – CS AI · Mar 166/10
🧠Researchers introduce Constitutional Multi-Agent Governance (CMAG), a framework that prevents AI manipulation in multi-agent systems while maintaining cooperation. The study shows that unconstrained AI optimization achieves high cooperation but erodes agent autonomy and fairness, while CMAG preserves ethical outcomes with only modest cooperation reduction.
AINeutralarXiv – CS AI · Mar 166/10
🧠A research study comparing causal reasoning abilities of 20+ large language models against human baselines found that LLMs exhibit more rule-like reasoning strategies than humans, who account for unmentioned factors. While LLMs don't mirror typical human cognitive biases in causal judgment, their rigid reasoning may fail when uncertainty is intrinsic, suggesting they can complement human decision-making in specific contexts.
AINeutralarXiv – CS AI · Mar 166/10
🧠Researchers propose integrating causal methods into machine learning systems to balance competing objectives like fairness, privacy, robustness, accuracy, and explainability. The paper argues that addressing these principles in isolation leads to conflicts and suboptimal solutions, while causal approaches can help navigate trade-offs in both trustworthy ML and foundation models.
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers have developed the System Hallucination Scale (SHS), a human-centered tool for evaluating hallucination behavior in large language models. The instrument showed strong statistical validity in testing with 210 participants and provides a practical method for assessing AI model reliability from a user perspective.
AINeutralarXiv – CS AI · Mar 126/10
🧠A clinical study analyzing OpenAI's GPT models found that empathy levels remained statistically unchanged across GPT-4o, o4-mini, and GPT-5-mini generations, despite user claims of 'lost empathy.' The real change was in safety posture: newer models improved crisis detection but became more cautious with advice, creating a trade-off that affects vulnerable users.
🏢 OpenAI🧠 GPT-4🧠 GPT-5
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers introduce FERRET, a new automated red teaming framework designed to generate multi-modal adversarial conversations to test AI model vulnerabilities. The framework uses three types of expansions (horizontal, vertical, and meta) to create more effective attack strategies and demonstrates superior performance compared to existing red teaming approaches.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers developed and tested five prompt engineering strategies to reduce hallucinations in large language models for industrial applications. The Enhanced Data Registry method achieved 100% success rate in trials, while other methods showed varying degrees of improvement in producing consistent, factually grounded outputs.
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers developed ADVERSA, an automated red-teaming framework that measures how AI guardrails degrade over multiple conversation turns rather than single-prompt attacks. Testing on three frontier models revealed a 26.7% jailbreak rate, with successful attacks concentrated in early rounds rather than accumulating through sustained pressure.
🧠 GPT-5🧠 Claude🧠 Opus
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers propose a multi-agent negotiation framework for aligning large language models in scenarios involving conflicting stakeholder values. The approach uses two LLM instances with opposing personas engaging in structured dialogue to develop conflict resolution capabilities while maintaining collective agency alignment.
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers have developed PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models against deepfake attacks. The model-agnostic approach estimates misclassification probability under various speech synthesis techniques including text-to-speech and voice cloning, providing formal robustness guarantees against unseen generation methods.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers introduce CUPID, a plug-in framework that estimates both aleatoric and epistemic uncertainty in deep learning models without requiring model retraining. The modular approach can be inserted into any layer of pretrained networks and provides interpretable uncertainty analysis for high-stakes AI applications.
AIBearishFortune Crypto · Mar 117/10
🧠Amazon reportedly held a mandatory meeting to address a significant AI-related infrastructure incident with 'high blast radius' impact. Senior VP Dave Treadwell acknowledged recent poor availability of Amazon's site and related infrastructure, while Elon Musk issued a cautionary warning about the situation.
AINeutralOpenAI News · Mar 116/10
🧠The article discusses ChatGPT's defensive mechanisms against prompt injection attacks and social engineering attempts. It focuses on how the AI system constrains risky actions and protects sensitive data within agent workflows to maintain security and reliability.
🧠 ChatGPT