#ai-safety News & Analysis

649 articles tagged with #ai-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

649 articles

AIBullisharXiv – CS AI · Apr 146/10

🧠

CARO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation

Researchers introduce CARO, a two-stage training framework that enhances large language models' ability to perform robust content moderation through analogical reasoning. By combining retrieval-augmented generation with direct preference optimization, CARO achieves 24.9% F1 score improvement over state-of-the-art models including DeepSeek R1 and LLaMA Guard on ambiguous moderation cases.

AINeutralarXiv – CS AI · Apr 146/10

🧠

AI Integrity: A New Paradigm for Verifiable AI Governance

Researchers introduce AI Integrity, a new governance framework that verifies the reasoning processes of AI systems rather than just evaluating outcomes. The approach defines an Authority Stack—a four-layer model of values, epistemological standards, source preferences, and data criteria—and proposes the PRISM framework to measure integrity through six core metrics, addressing a critical gap in existing AI Ethics, Safety, and Alignment paradigms.

AINeutralarXiv – CS AI · Apr 146/10

🧠

PRISM Risk Signal Framework: Hierarchy-Based Red Lines for AI Behavioral Risk

Researchers introduce PRISM, a framework that detects AI behavioral risks by analyzing underlying reasoning hierarchies rather than individual harmful outputs. The system identifies 27 risk signals across value prioritization, evidence weighting, and information source trust, using forced-choice data from 7 AI models to distinguish between structurally dangerous, context-dependent, and balanced AI reasoning patterns.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems

Researchers propose a reactor-model-of-computation approach using the Lingua Franca framework to address nondeterminism challenges in AI-powered human-in-the-loop cyber-physical systems. The study uses an agentic driving coach as a case study to demonstrate how foundation models like LLMs can be deployed in safety-critical applications while maintaining deterministic behavior despite unpredictable human and environmental variables.

AIBullisharXiv – CS AI · Apr 146/10

🧠

CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

Researchers introduce CoSToM, a framework that uses causal tracing and activation steering to improve Theory of Mind alignment in large language models. The work addresses a critical gap between LLMs' internal knowledge and external behavior, demonstrating that targeted interventions in specific neural layers can enhance social reasoning capabilities and dialogue quality.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Closed-Form Concept Erasure via Double Projections

Researchers present a novel closed-form method for concept erasure in generative AI models that removes unwanted concepts without iterative training. The technique uses linear transformations and two sequential projection steps to safely edit pretrained models like Stable Diffusion and FLUX while preserving unrelated concepts, completing the process in seconds.

🧠 Stable Diffusion

AIBearishFortune Crypto · Apr 137/10

🧠

Meet the man accused of throwing a Molotov cocktail at Sam Altman: a 20-year-old AI doomer

A 20-year-old individual was arrested and accused of throwing a Molotov cocktail at OpenAI CEO Sam Altman, with authorities discovering documents expressing concerns about AI existential risks and humanity's impending extinction. The incident highlights escalating tensions between AI safety advocates and prominent tech leaders, raising questions about how ideological extremism intersects with legitimate concerns about artificial intelligence development.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models

Researchers analyzed how large language models decide whether to act on predictions or escalate to humans, finding that models use inconsistent and miscalibrated thresholds across five real-world domains. Supervised fine-tuning on chain-of-thought reasoning proved most effective at establishing robust escalation policies that generalize across contexts, suggesting escalation behavior requires explicit characterization before AI system deployment.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Building Better Environments for Autonomous Cyber Defence

Workshop participants from academia, industry, and government convened in November 2025 to establish best practices for designing reinforcement learning environments in autonomous cyber defence. The resulting framework and guidelines address a critical gap in documented knowledge about RL environment development for network security applications, including critical infrastructure protection.

AINeutralarXiv – CS AI · Apr 136/10

🧠

PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

Researchers introduce PerMix-RLVR, a training method that enables large language models to maintain persona flexibility while preserving task robustness. The approach addresses a fundamental trade-off in reinforcement learning with verifiable rewards, where models become less responsive to persona prompts but gain improved performance on objective tasks.

AINeutralarXiv – CS AI · Apr 136/10

🧠

CONDESION-BENCH: Conditional Decision-Making of Large Language Models in Compositional Action Space

Researchers introduce CONDESION-BENCH, a new benchmark for evaluating how large language models make decisions in complex, real-world scenarios with compositional actions and conditional constraints. The benchmark addresses limitations in existing decision-making frameworks by incorporating variable-level, contextual, and allocation-level restrictions that better reflect actual decision-making environments.

AIBearisharXiv – CS AI · Apr 136/10

🧠

GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking

Researchers introduce GRM, a frequency-selective jailbreak framework that exploits vulnerabilities in audio large language models while maintaining utility preservation. By strategically perturbing specific frequency bands rather than entire spectrums, GRM achieves 88.46% jailbreak success rates with better trade-offs between attack effectiveness and transcription quality compared to existing methods.

AINeutralFortune Crypto · Apr 106/10

🧠

Anthropic is limiting access to its latest AI model, Mythos. The real risks may already be out there

Anthropic has restricted access to its latest AI model, Mythos, but the article suggests similar capabilities may already be publicly available through other channels. This highlights the ongoing tension between AI safety measures and the reality that advanced capabilities cannot be contained once developed.

🏢 Anthropic

AINeutralAI News · Apr 106/10

🧠

Why companies like Apple are building AI agents with limits

Apple, Qualcomm, and other tech companies are developing next-generation AI agents intentionally designed with built-in limitations rather than unrestricted capabilities. These agents can perform tasks like app navigation, bookings, and service management, but operate within controlled parameters that prioritize safety and user privacy over maximum autonomy.

AINeutralFortune Crypto · Apr 106/10

🧠

What Anthropic’s too-dangerous-to-release AI model means for its upcoming IPO

Anthropic has developed an advanced AI model deemed too risky to publicly release, raising questions about responsible AI deployment and corporate liability as the company prepares for its IPO. This decision highlights the tension between innovation capabilities and safety concerns that will likely influence investor perception and regulatory scrutiny.