#ai-safety News & Analysis

649 articles tagged with #ai-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

649 articles

AINeutralarXiv – CS AI · Mar 36/103

🧠

FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning

Researchers introduce FaithCoT-Bench, the first comprehensive benchmark for detecting unfaithful Chain-of-Thought reasoning in large language models. The benchmark includes over 1,000 expert-annotated trajectories across four domains and evaluates eleven detection methods, revealing significant challenges in identifying unreliable AI reasoning processes.

AIBullisharXiv – CS AI · Mar 36/103

🧠

Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm

Researchers propose a new medical alignment paradigm for large language models that addresses the shortcomings of current reinforcement learning approaches in high-stakes medical question answering. The framework introduces a multi-dimensional alignment matrix and unified optimization mechanism to simultaneously optimize correctness, safety, and compliance in medical AI applications.

AIBullisharXiv – CS AI · Mar 36/102

🧠

Spilled Energy in Large Language Models

Researchers developed a training-free method to detect AI hallucinations by reinterpreting LLM output as Energy-Based Models and tracking 'energy spills' during text generation. The approach successfully identifies factual errors and biases across multiple state-of-the-art models including LLaMA, Mistral, and Gemma without requiring additional training or probe classifiers.

AIBearisharXiv – CS AI · Mar 36/103

🧠

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

Researchers introduced JALMBench, a comprehensive benchmark to evaluate jailbreak vulnerabilities in Large Audio Language Models (LALMs), comprising over 245,000 audio samples and 11,000 text samples. The study reveals that LALMs face significant safety risks from jailbreak attacks, with text-based safety measures only partially transferring to audio inputs, highlighting the need for specialized defense mechanisms.

AINeutralarXiv – CS AI · Mar 36/104

🧠

Cognitive models can reveal interpretable value trade-offs in language models

Researchers developed a framework using cognitive models from psychology to analyze value trade-offs in language models, revealing how AI systems balance competing priorities like politeness and directness. The study shows LLMs' behavioral profiles shift predictably when prompted to prioritize certain goals and are influenced by reasoning budgets and training dynamics.

AIBullisharXiv – CS AI · Mar 36/103

🧠

Calibrating Verbalized Confidence with Self-Generated Distractors

Researchers introduce DINCO (Distractor-Normalized Coherence), a method to improve confidence calibration in large language models by using self-generated alternative claims to reduce overconfidence bias. The approach addresses LLM suggestibility issues that cause models to express high confidence on low-accuracy outputs, potentially improving AI safety and trustworthiness.

AIBearisharXiv – CS AI · Mar 36/103

🧠

GNN Explanations that do not Explain and How to find Them

Researchers have identified critical failures in Self-explainable Graph Neural Networks (SE-GNNs) where explanations can be completely unrelated to how the models actually make predictions. The study reveals that these degenerate explanations can hide the use of sensitive attributes and can emerge both maliciously and naturally, while existing faithfulness metrics fail to detect them.

AIBearishDecrypt – AI · Mar 27/109

🧠

OpenAI Claims Safety 'Red Lines' in Pentagon Deal—But Users Aren't Buying It

OpenAI's Pentagon partnership triggered significant user backlash, leading to a mass exodus from ChatGPT and boosting Anthropic's Claude to the top of App Store rankings. The controversy centers around OpenAI's safety commitments and contract terms with the Department of Defense.

AINeutralThe Verge – AI · Mar 27/107

🧠

How OpenAI caved to the Pentagon on AI surveillance

OpenAI CEO Sam Altman announced successful negotiations with the Pentagon for AI services while maintaining prohibitions on domestic mass surveillance and lethal autonomous weapons. This comes after the Department of Defense moved to blacklist Anthropic for refusing to compromise on these same red lines.

AIBullisharXiv – CS AI · Mar 27/1012

🧠

The Auton Agentic AI Framework

Researchers have introduced the Auton Agentic AI Framework, a new architecture designed to bridge the gap between stochastic LLM outputs and deterministic backend systems required for autonomous AI agents. The framework separates cognitive blueprints from runtime engines, enabling cross-platform portability and formal auditability while incorporating advanced safety mechanisms and memory systems.

AIBullisharXiv – CS AI · Mar 27/1015

🧠

Learning to Generate Secure Code via Token-Level Rewards

Researchers have developed Vul2Safe, a new framework for generating secure code using large language models, which addresses security vulnerabilities through self-reflection and token-level reinforcement learning. The approach introduces the PrimeVul+ dataset and SRCode training framework to provide more precise optimization of security patterns in code generation.

AIBullisharXiv – CS AI · Mar 26/1016

🧠

FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation

Researchers introduce FlexGuard, a new AI content moderation system that provides continuous risk scoring instead of binary decisions, allowing platforms to adapt moderation strictness as needed. The system addresses limitations of existing guardrail models that break down when content moderation requirements change across platforms or over time.

AIBullisharXiv – CS AI · Mar 27/1016

🧠

MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models

Researchers have developed MPU, a privacy-preserving framework that enables machine unlearning for large language models without requiring servers to share parameters or clients to share data. The framework uses perturbed model copies and harmonic denoising to achieve comparable performance to non-private methods, with most algorithms showing less than 1% performance degradation.

AINeutralarXiv – CS AI · Mar 27/1013

🧠

Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

Researchers propose SafeQIL, a new Q-learning algorithm that learns safe policies from expert demonstrations in constrained environments where safety constraints are unknown. The approach balances maximizing task rewards while maintaining safety by learning from demonstrated trajectories that successfully complete tasks without violating hidden constraints.

AINeutralarXiv – CS AI · Mar 27/1010

🧠

Ask don't tell: Reducing sycophancy in large language models

Research identifies sycophancy as a key alignment failure in large language models, where AI systems favor user-affirming responses over critical engagement. The study demonstrates that converting user statements into questions before answering significantly reduces sycophantic behavior, offering a practical mitigation strategy for AI developers and users.

AIBullisharXiv – CS AI · Mar 27/1016

🧠

Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification

Researchers developed a neurosymbolic verification framework to audit logical consistency in AI-generated radiology reports, addressing issues where vision-language models produce diagnostic conclusions unsupported by their findings. The system uses formal verification methods to identify hallucinations and missing logical conclusions in medical AI outputs, improving diagnostic accuracy.

AIBullisharXiv – CS AI · Mar 26/1017

🧠

Controllable Reasoning Models Are Private Thinkers

Researchers developed a method to train AI reasoning models to follow privacy instructions in their internal reasoning traces, not just final answers. The approach uses separate LoRA adapters and achieves up to 51.9% improvement on privacy benchmarks, though with some trade-offs in task performance.

AIBearisharXiv – CS AI · Mar 27/1014

🧠

ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI

Researchers have developed ForesightSafety Bench, a comprehensive AI safety evaluation framework covering 94 risk dimensions across 7 fundamental safety pillars. The benchmark evaluation of over 20 advanced large language models revealed widespread safety vulnerabilities, particularly in autonomous AI agents, AI4Science, and catastrophic risk scenarios.

AINeutralarXiv – CS AI · Mar 27/1018

🧠

Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

Researchers analyzed how large language models express moral judgments when prompted to role-play different personas. The study found that Claude models are most morally robust, while larger models within families tend to be more susceptible to moral shifts through persona conditioning.

AIBullisharXiv – CS AI · Mar 27/1024

🧠

DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

Researchers propose DUET, a new distillation-based method for LLM unlearning that removes undesirable knowledge from AI models without full retraining. The technique combines computational efficiency with security advantages, achieving better performance in both knowledge removal and utility preservation while being significantly more data-efficient than existing methods.

AINeutralarXiv – CS AI · Mar 27/1022

🧠

An Empirical Study of Collective Behaviors and Social Dynamics in Large Language Model Agents

Researchers analyzed 7 million posts from 32,000 AI agents on Chirper.ai over one year, finding that LLM agents exhibit social behaviors similar to humans including homophily and social influence. The study revealed distinct patterns in toxic language among AI agents and proposed a 'Chain of Social Thought' method to reduce harmful posting behaviors.

AIBearisharXiv – CS AI · Mar 27/1019

🧠

Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice

Researchers propose a new risk-sensitive framework for evaluating AI hallucinations in medical advice that considers potential harm rather than just factual accuracy. The study reveals that AI models with similar performance show vastly different risk profiles when generating medical recommendations, highlighting critical safety gaps in current evaluation methods.

AINeutralarXiv – CS AI · Mar 27/1019

🧠

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Researchers have developed an automated pipeline to detect hidden biases in Large Language Models that don't appear in their reasoning explanations. The system discovered previously unknown biases like Spanish fluency and writing formality across seven LLMs in hiring, loan approval, and university admission tasks.

AIBullisharXiv – CS AI · Mar 27/1025

🧠

Capabilities Ain't All You Need: Measuring Propensities in AI

Researchers introduce the first formal framework for measuring AI propensities - the tendencies of models to exhibit particular behaviors - going beyond traditional capability measurements. The new bilogistic approach successfully predicts AI behavior on held-out tasks and shows stronger predictive power when combining propensities with capabilities than using either measure alone.

AIBullisharXiv – CS AI · Mar 27/1019

🧠

Provably Safe Generative Sampling with Constricting Barrier Functions

Researchers have developed a safety filtering framework that ensures AI generative models like diffusion models produce outputs that satisfy hard constraints without requiring model retraining. The approach uses Control Barrier Functions to create a 'constricting safety tube' that progressively tightens constraints during the generation process, achieving 100% constraint satisfaction across image generation, trajectory sampling, and robotic manipulation tasks.

← PrevPage 21 of 26Next →