#ai-safety News & Analysis

649 articles tagged with #ai-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

649 articles

AINeutralarXiv – CS AI · Mar 176/10

🧠

Relationship-Aware Safety Unlearning for Multimodal LLMs

Researchers propose a new framework for improving safety in multimodal AI models by targeting unsafe relationships between objects rather than removing entire concepts. The approach uses parameter-efficient edits to suppress dangerous combinations while preserving benign uses of the same objects and relations.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Prompt Readiness Levels (PRL): a maturity scale and scoring framework for production grade prompt assets

Researchers have introduced Prompt Readiness Levels (PRL), a nine-level maturity framework for evaluating and governing AI prompt assets in production environments. The system includes a multidimensional scoring method (PRS) designed to ensure prompt engineering meets operational, safety, and compliance standards across organizations.

AIBearisharXiv – CS AI · Mar 176/10

🧠

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

Researchers propose a priority graph model to understand conflicts in LLM alignment, revealing that unified stable alignment is challenging due to context-dependent inconsistencies. The study identifies 'priority hacking' as a vulnerability where adversaries can manipulate safety alignments, and suggests runtime verification mechanisms as a potential solution.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Evaluation of Audio Language Models for Fairness, Safety, and Security

Researchers introduce a structural taxonomy and unified evaluation framework for Audio Large Language Models (ALLMs) to assess fairness, safety, and security. The study reveals systematic differences in how ALLMs handle audio versus text inputs, with FSS behavior closely tied to acoustic information integration methods.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Pragma-VL: Towards a Pragmatic Arbitration of Safety and Helpfulness in MLLMs

Researchers introduce Pragma-VL, a new alignment algorithm for Multimodal Large Language Models that balances safety and helpfulness by improving visual risk perception and using contextual arbitration. The method outperforms existing baselines by 5-20% on multimodal safety benchmarks while maintaining general AI capabilities in mathematics and reasoning.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Not All Latent Spaces Are Flat: Hyperbolic Concept Control

Researchers introduced HyCon, a hyperbolic control mechanism for text-to-image models that provides better safety controls by steering generation away from unsafe content. The technique uses hyperbolic representation spaces instead of traditional Euclidean adjustments, achieving state-of-the-art results across multiple safety benchmarks.

AIBullisharXiv – CS AI · Mar 176/10

🧠

RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models

Researchers introduce RAZOR, a new framework for efficiently removing sensitive information from AI models like CLIP and Stable Diffusion without requiring full retraining. The method selectively edits specific layers and attention heads in transformer models to achieve targeted 'unlearning' while preserving overall performance.

🧠 Stable Diffusion

AIBullisharXiv – CS AI · Mar 176/10

🧠

Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

Researchers propose 'Two Birds, One Projection,' a new inference-time defense method for Large Vision-Language Models that simultaneously improves both safety and utility performance. The method addresses modality-induced bias by projecting cross-modal features onto the null space of identified bias directions, breaking the traditional safety-utility tradeoff.

AIBearisharXiv – CS AI · Mar 176/10

🧠

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

A new research study reveals that AI judges used to evaluate the safety of large language models perform poorly when assessing adversarial attacks, often degrading to near-random accuracy. The research analyzed 6,642 human-verified labels and found that many attacks artificially inflate their success rates by exploiting judge weaknesses rather than generating genuinely harmful content.

AIBearishArs Technica – AI · Mar 166/10

🧠

OpenAI’s own mental health experts unanimously opposed “naughty” ChatGPT launch

OpenAI's internal mental health experts unanimously opposed the launch of a more permissive version of ChatGPT that allows adult content creation. The disagreement highlights concerns about the psychological impact of AI-generated adult content, even as OpenAI attempts to distinguish between different types of explicit material.

🏢 OpenAI🧠 ChatGPT

AINeutralBlockonomi · Mar 166/10

🧠

ChatGPT Adult Mode Postponed After Safety Experts Raise Teen Access Concerns

OpenAI has postponed the launch of ChatGPT's adult mode after safety experts raised concerns about inadequate age verification systems that could allow teenagers to access explicit content. The delay highlights ongoing challenges in implementing effective content controls for AI platforms.

🏢 OpenAI🧠 ChatGPT

AIBearisharXiv – CS AI · Mar 166/10

🧠

Prompt Injection as Role Confusion

Researchers have identified 'role confusion' as the fundamental mechanism behind prompt injection attacks on language models, where models assign authority based on how text is written rather than its source. The study achieved 60-61% attack success rates across multiple models and found that internal role confusion strongly predicts attack success before generation begins.

AINeutralarXiv – CS AI · Mar 166/10

🧠

LLM Constitutional Multi-Agent Governance

Researchers introduce Constitutional Multi-Agent Governance (CMAG), a framework that prevents AI manipulation in multi-agent systems while maintaining cooperation. The study shows that unconstrained AI optimization achieves high cooperation but erodes agent autonomy and fairness, while CMAG preserves ethical outcomes with only modest cooperation reduction.

AINeutralarXiv – CS AI · Mar 166/10

🧠

Do LLMs Share Human-Like Biases? Causal Reasoning Under Prior Knowledge, Irrelevant Context, and Varying Compute Budgets

A research study comparing causal reasoning abilities of 20+ large language models against human baselines found that LLMs exhibit more rule-like reasoning strategies than humans, who account for unmentioned factors. While LLMs don't mirror typical human cognitive biases in causal judgment, their rigid reasoning may fail when uncertainty is intrinsic, suggesting they can complement human decision-making in specific contexts.

AINeutralarXiv – CS AI · Mar 166/10

🧠

Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models

Researchers propose integrating causal methods into machine learning systems to balance competing objectives like fairness, privacy, robustness, accuracy, and explainability. The paper argues that addressing these principles in isolation leads to conflicts and suboptimal solutions, while causal approaches can help navigate trade-offs in both trustworthy ML and foundation models.

AINeutralarXiv – CS AI · Mar 126/10

🧠

The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

Researchers have developed the System Hallucination Scale (SHS), a human-centered tool for evaluating hallucination behavior in large language models. The instrument showed strong statistical validity in testing with 210 participants and provides a practical method for assessing AI model reliability from a user perspective.

AINeutralarXiv – CS AI · Mar 126/10

🧠

Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

A clinical study analyzing OpenAI's GPT models found that empathy levels remained statistically unchanged across GPT-4o, o4-mini, and GPT-5-mini generations, despite user claims of 'lost empathy.' The real change was in safety posture: newer models improved crisis detection but became more cautious with advice, creating a trade-off that affects vulnerable users.

🏢 OpenAI🧠 GPT-4🧠 GPT-5

AINeutralarXiv – CS AI · Mar 126/10

🧠

FERRET: Framework for Expansion Reliant Red Teaming

Researchers introduce FERRET, a new automated red teaming framework designed to generate multi-modal adversarial conversations to test AI model vulnerabilities. The framework uses three types of expansions (horizontal, vertical, and meta) to create more effective attack strategies and demonstrates superior performance compared to existing red teaming approaches.

AIBullisharXiv – CS AI · Mar 126/10

🧠

Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

Researchers developed and tested five prompt engineering strategies to reduce hallucinations in large language models for industrial applications. The Enhanced Data Registry method achieved 100% success rate in trials, while other methods showed varying degrees of improvement in producing consistent, factually grounded outputs.

AINeutralarXiv – CS AI · Mar 126/10

🧠

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Researchers developed ADVERSA, an automated red-teaming framework that measures how AI guardrails degrade over multiple conversation turns rather than single-prompt attacks. Testing on three frontier models revealed a 26.7% jailbreak rate, with successful attacks concentrated in early rounds rather than accumulating through sustained pressure.

🧠 GPT-5🧠 Claude🧠 Opus

AIBullisharXiv – CS AI · Mar 126/10

🧠

Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

Researchers propose a multi-agent negotiation framework for aligning large language models in scenarios involving conflicting stakeholder values. The approach uses two LLM instances with opposing personas engaging in structured dialogue to develop conflict resolution capabilities while maintaining collective agency alignment.

AINeutralarXiv – CS AI · Mar 126/10

🧠

Probabilistic Verification of Voice Anti-Spoofing Models

Researchers have developed PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models against deepfake attacks. The model-agnostic approach estimates misclassification probability under various speech synthesis techniques including text-to-speech and voice cloning, providing formal robustness guarantees against unseen generation methods.

AIBullisharXiv – CS AI · Mar 126/10

🧠

CUPID: A Plug-in Framework for Joint Aleatoric and Epistemic Uncertainty Estimation with a Single Model

Researchers introduce CUPID, a plug-in framework that estimates both aleatoric and epistemic uncertainty in deep learning models without requiring model retraining. The modular approach can be inserted into any layer of pretrained networks and provides interpretable uncertainty analysis for high-stakes AI applications.

AIBearishFortune Crypto · Mar 117/10

🧠

‘Proceed with caution’: Elon Musk offers warning after Amazon reportedly held mandatory meeting to address ‘high blast radius’ AI-related incident

Amazon reportedly held a mandatory meeting to address a significant AI-related infrastructure incident with 'high blast radius' impact. Senior VP Dave Treadwell acknowledged recent poor availability of Amazon's site and related infrastructure, while Elon Musk issued a cautionary warning about the situation.

AINeutralOpenAI News · Mar 116/10

🧠

Designing AI agents to resist prompt injection

The article discusses ChatGPT's defensive mechanisms against prompt injection attacks and social engineering attempts. It focuses on how the AI system constrains risky actions and protects sensitive data within agent workflows to maintain security and reliability.

🧠 ChatGPT

← PrevPage 18 of 26Next →