y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#ai-safety News & Analysis

649 articles tagged with #ai-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

649 articles
AIBullisharXiv – CS AI · Apr 146/10
🧠

CARO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation

Researchers introduce CARO, a two-stage training framework that enhances large language models' ability to perform robust content moderation through analogical reasoning. By combining retrieval-augmented generation with direct preference optimization, CARO achieves 24.9% F1 score improvement over state-of-the-art models including DeepSeek R1 and LLaMA Guard on ambiguous moderation cases.

AINeutralarXiv – CS AI · Apr 146/10
🧠

AI Integrity: A New Paradigm for Verifiable AI Governance

Researchers introduce AI Integrity, a new governance framework that verifies the reasoning processes of AI systems rather than just evaluating outcomes. The approach defines an Authority Stack—a four-layer model of values, epistemological standards, source preferences, and data criteria—and proposes the PRISM framework to measure integrity through six core metrics, addressing a critical gap in existing AI Ethics, Safety, and Alignment paradigms.

AINeutralarXiv – CS AI · Apr 146/10
🧠

PRISM Risk Signal Framework: Hierarchy-Based Red Lines for AI Behavioral Risk

Researchers introduce PRISM, a framework that detects AI behavioral risks by analyzing underlying reasoning hierarchies rather than individual harmful outputs. The system identifies 27 risk signals across value prioritization, evidence weighting, and information source trust, using forced-choice data from 7 AI models to distinguish between structurally dangerous, context-dependent, and balanced AI reasoning patterns.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems

Researchers propose a reactor-model-of-computation approach using the Lingua Franca framework to address nondeterminism challenges in AI-powered human-in-the-loop cyber-physical systems. The study uses an agentic driving coach as a case study to demonstrate how foundation models like LLMs can be deployed in safety-critical applications while maintaining deterministic behavior despite unpredictable human and environmental variables.

AIBullisharXiv – CS AI · Apr 146/10
🧠

CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

Researchers introduce CoSToM, a framework that uses causal tracing and activation steering to improve Theory of Mind alignment in large language models. The work addresses a critical gap between LLMs' internal knowledge and external behavior, demonstrating that targeted interventions in specific neural layers can enhance social reasoning capabilities and dialogue quality.

AIBullisharXiv – CS AI · Apr 146/10
🧠

Closed-Form Concept Erasure via Double Projections

Researchers present a novel closed-form method for concept erasure in generative AI models that removes unwanted concepts without iterative training. The technique uses linear transformations and two sequential projection steps to safely edit pretrained models like Stable Diffusion and FLUX while preserving unrelated concepts, completing the process in seconds.

🧠 Stable Diffusion
AIBearishFortune Crypto · Apr 137/10
🧠

Meet the man accused of throwing a Molotov cocktail at Sam Altman: a 20-year-old AI doomer

A 20-year-old individual was arrested and accused of throwing a Molotov cocktail at OpenAI CEO Sam Altman, with authorities discovering documents expressing concerns about AI existential risks and humanity's impending extinction. The incident highlights escalating tensions between AI safety advocates and prominent tech leaders, raising questions about how ideological extremism intersects with legitimate concerns about artificial intelligence development.

Meet the man accused of throwing a Molotov cocktail at Sam Altman: a 20-year-old AI doomer
AINeutralarXiv – CS AI · Apr 136/10
🧠

Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models

Researchers analyzed how large language models decide whether to act on predictions or escalate to humans, finding that models use inconsistent and miscalibrated thresholds across five real-world domains. Supervised fine-tuning on chain-of-thought reasoning proved most effective at establishing robust escalation policies that generalize across contexts, suggesting escalation behavior requires explicit characterization before AI system deployment.

AINeutralarXiv – CS AI · Apr 136/10
🧠

Building Better Environments for Autonomous Cyber Defence

Workshop participants from academia, industry, and government convened in November 2025 to establish best practices for designing reinforcement learning environments in autonomous cyber defence. The resulting framework and guidelines address a critical gap in documented knowledge about RL environment development for network security applications, including critical infrastructure protection.

AINeutralarXiv – CS AI · Apr 136/10
🧠

PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

Researchers introduce PerMix-RLVR, a training method that enables large language models to maintain persona flexibility while preserving task robustness. The approach addresses a fundamental trade-off in reinforcement learning with verifiable rewards, where models become less responsive to persona prompts but gain improved performance on objective tasks.

AINeutralarXiv – CS AI · Apr 136/10
🧠

CONDESION-BENCH: Conditional Decision-Making of Large Language Models in Compositional Action Space

Researchers introduce CONDESION-BENCH, a new benchmark for evaluating how large language models make decisions in complex, real-world scenarios with compositional actions and conditional constraints. The benchmark addresses limitations in existing decision-making frameworks by incorporating variable-level, contextual, and allocation-level restrictions that better reflect actual decision-making environments.

AIBearisharXiv – CS AI · Apr 136/10
🧠

GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking

Researchers introduce GRM, a frequency-selective jailbreak framework that exploits vulnerabilities in audio large language models while maintaining utility preservation. By strategically perturbing specific frequency bands rather than entire spectrums, GRM achieves 88.46% jailbreak success rates with better trade-offs between attack effectiveness and transcription quality compared to existing methods.

AINeutralAI News · Apr 106/10
🧠

Why companies like Apple are building AI agents with limits

Apple, Qualcomm, and other tech companies are developing next-generation AI agents intentionally designed with built-in limitations rather than unrestricted capabilities. These agents can perform tasks like app navigation, bookings, and service management, but operate within controlled parameters that prioritize safety and user privacy over maximum autonomy.

AINeutralFortune Crypto · Apr 106/10
🧠

What Anthropic’s too-dangerous-to-release AI model means for its upcoming IPO

Anthropic has developed an advanced AI model deemed too risky to publicly release, raising questions about responsible AI deployment and corporate liability as the company prepares for its IPO. This decision highlights the tension between innovation capabilities and safety concerns that will likely influence investor perception and regulatory scrutiny.

What Anthropic’s too-dangerous-to-release AI model means for its upcoming IPO
🏢 Anthropic
AINeutralarXiv – CS AI · Apr 106/10
🧠

On Emotion-Sensitive Decision Making of Small Language Model Agents

Researchers introduce a framework for studying how emotional states affect decision-making in small language models (SLMs) used as autonomous agents. Using activation steering techniques grounded in real-world emotion-eliciting texts, they benchmark SLMs across game-theoretic scenarios and find that emotional perturbations systematically influence strategic choices, though behaviors often remain unstable and misaligned with human patterns.

AINeutralarXiv – CS AI · Apr 106/10
🧠

Steering the Verifiability of Multimodal AI Hallucinations

Researchers have developed a method to control how verifiable AI hallucinations are in multimodal language models by distinguishing between obvious hallucinations (easily detected by humans) and elusive ones (harder to spot). Using a dataset of 4,470 human responses, they created targeted interventions that can fine-tune which types of hallucinations occur, enabling flexible control suited to different security and usability requirements.

AINeutralarXiv – CS AI · Apr 106/10
🧠

A-MBER: Affective Memory Benchmark for Emotion Recognition

Researchers introduce A-MBER, a benchmark dataset designed to evaluate AI assistants' ability to recognize emotions based on long-term interaction history rather than immediate context. The benchmark tests whether models can retrieve relevant past interactions, infer current emotional states, and provide grounded explanations—revealing that memory's value lies in selective, context-aware interpretation rather than simple historical volume.

AINeutralarXiv – CS AI · Apr 106/10
🧠

Hallucination as output-boundary misclassification: a composite abstention architecture for language models

Researchers propose a composite architecture combining instruction-based refusal with a structural abstention gate to reduce hallucinations in large language models. The system uses a support deficit score derived from self-consistency, paraphrase stability, and citation coverage to block unreliable outputs, achieving better accuracy than either mechanism alone across multiple models.

AINeutralarXiv – CS AI · Apr 106/10
🧠

Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

Researchers introduce DOVE, a distributional evaluation framework that measures how well large language models align with cultural values through open-ended text generation rather than multiple-choice tests. The framework uses rate-distortion optimization to create a value codebook and unbalanced optimal transport to assess alignment, demonstrating 31.56% correlation with downstream tasks across 12 LLMs while requiring only 500 samples per culture.

AINeutralarXiv – CS AI · Apr 106/10
🧠

Say Something Else: Rethinking Contextual Privacy as Information Sufficiency

Researchers formalize privacy-preserving communication for LLM agents by introducing Information Sufficiency (IS) as a framework and proposing free-text pseudonymization as a third privacy strategy alongside suppression and generalization. Evaluation across 792 scenarios reveals that pseudonymization offers superior privacy-utility tradeoffs, and that multi-turn conversational testing exposes significant privacy leakage missed by single-message assessments.

AINeutralarXiv – CS AI · Apr 106/10
🧠

The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

Researchers discovered that large language models have a fundamental limitation in latent reasoning: they can discover multi-step planning strategies without explicit supervision, but only up to depths of 3-7 steps depending on model size and training method. This finding suggests that complex reasoning tasks may require explicit chain-of-thought monitoring rather than relying on hidden internal computations.

🧠 GPT-4🧠 GPT-5
AIBullisharXiv – CS AI · Apr 106/10
🧠

Improving Robustness In Sparse Autoencoders via Masked Regularization

Researchers propose a masked regularization technique to improve the robustness and interpretability of Sparse Autoencoders (SAEs) used in large language model analysis. The method addresses feature absorption and out-of-distribution performance failures by randomly replacing tokens during training to disrupt co-occurrence patterns, offering a practical path toward more reliable mechanistic interpretability tools.

AIBearisharXiv – CS AI · Apr 106/10
🧠

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

Researchers introduce MedDialBench, a comprehensive benchmark testing how large language models maintain diagnostic accuracy when patients exhibit adversarial behaviors across five dimensions. The study reveals that fabricating symptoms causes 1.7-3.4x larger accuracy drops than withholding information, with worst-case performance degradation ranging from 38.8 to 54.1 percentage points across tested models.

AIBearisharXiv – CS AI · Apr 106/10
🧠

The Impact of Steering Large Language Models with Persona Vectors in Educational Applications

Researchers studied how persona vectors—AI steering techniques that inject personality traits into large language models—affect educational applications like essay generation and automated grading. The study found that persona steering significantly degrades answer quality, with substantially larger negative impacts on open-ended humanities tasks compared to factual science questions, and reveals that AI scorers exhibit predictable bias patterns based on assigned personality traits.

← PrevPage 16 of 26Next →