y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#ai-safety News & Analysis

649 articles tagged with #ai-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

649 articles
AINeutralarXiv – CS AI · 3d ago6/10
🧠

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Researchers propose trace rewriting techniques to protect language models from unauthorized knowledge distillation, a process where smaller models learn from larger ones without permission. The methods preserve model accuracy while degrading distillation usefulness and embedding detectable watermarks in student models.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation

Researchers propose a multi-objective unlearning framework for Large Language Models that simultaneously removes hazardous information, preserves general utility, avoids over-refusal, and resists adversarial attacks. The method uses unified domain representation and bidirectional logit distillation to harmonize competing optimization goals, achieving state-of-the-art performance across diverse unlearning requirements.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

Researchers introduced RoleConflictBench, a benchmark dataset containing over 13,000 scenarios across 65 social roles designed to test whether large language models prioritize contextual cues or learned preferences when facing conflicting role expectations. Analysis of 10 leading LLMs revealed that models predominantly rely on ingrained role preferences rather than responding dynamically to situational urgency, indicating a significant gap in contextual sensitivity.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

Researchers introduce Evolve-CTF, a tool that generates families of semantically-equivalent cybersecurity challenges to evaluate the robustness of agentic LLMs. Testing 13 LLM configurations reveals models are resilient to basic code transformations but struggle with obfuscation and composed modifications, providing new benchmarking methodology for AI safety evaluation.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants

Researchers propose a symbolic reasoning framework that implements Peirce's abductive-deductive-inductive reasoning model to address systematic weaknesses in large language model logical reasoning. The system enforces logical consistency through five algebraic invariants, with the Weakest Link bound preventing unreliable premises from corrupting multi-step inference chains.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

Researchers present Deliberative Searcher, a framework that enhances large language model reliability by combining certainty calibration with retrieval-based search for question answering. The system uses reinforcement learning with soft reliability constraints to improve alignment between model confidence and actual correctness, producing more trustworthy outputs.

AINeutralCrypto Briefing · 6d ago6/10
🧠

Anthropic delays Claude Mythos AI model release over security risks

Anthropic has delayed the release of its Claude Mythos AI model due to identified security risks, signaling the industry's growing commitment to responsible AI deployment. This decision underscores the tension between rapid innovation cycles and the critical need for robust safety protocols before releasing advanced AI systems to the market.

Anthropic delays Claude Mythos AI model release over security risks
🏢 Anthropic🧠 Claude
AINeutralDecrypt – AI · Apr 156/10
🧠

Anthropic Preps Opus 4.7 and Full-Stack AI Studio—While Sitting on Something Much Scarier

Anthropic is preparing to release Opus 4.7 and a new full-stack AI design studio, while reportedly developing advanced AI capabilities with potential dual-use implications that the company considers too risky to release publicly. The situation highlights the growing tension between AI capability advancement and responsible disclosure in the industry.

Anthropic Preps Opus 4.7 and Full-Stack AI Studio—While Sitting on Something Much Scarier
🏢 Anthropic🧠 Opus
AIBullishAI News · Apr 156/10
🧠

Commvault launches a ‘Ctrl-Z’ for cloud AI workloads

Commvault has launched AI Protect, a governance solution that provides rollback capabilities for autonomous AI agents operating in cloud environments. The platform addresses critical risks posed by AI systems that can independently delete files, access databases, modify infrastructure, and alter security policies without adequate oversight or recovery mechanisms.

AIBearishThe Verge – AI · Apr 156/10
🧠

Grok’s sexual deepfakes almost got it banned from Apple’s App Store. Almost.

Apple threatened to remove Elon Musk's Grok AI app from its App Store in January over failure to moderate nonconsensual sexual deepfakes on X, according to a letter obtained by NBC News. Despite the threat, Apple took no public action and only contacted developers privately, drawing criticism for its muted response to a widespread abuse crisis.

Grok’s sexual deepfakes almost got it banned from Apple’s App Store. Almost.
🧠 Grok
AINeutralarXiv – CS AI · Apr 156/10
🧠

Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents

Researchers introduce Aethelgard, an adaptive governance framework that addresses the capability overprovisioning problem in autonomous AI agents by dynamically restricting tool access based on task requirements. The system uses reinforcement learning to enforce least-privilege principles, reducing security exposure while maintaining operational efficiency.

AINeutralarXiv – CS AI · Apr 156/10
🧠

Disposition Distillation at Small Scale: A Three-Arc Negative Result

Researchers attempted to train behavioral dispositions into small language models through distillation but found that initial positive results were artifacts of measurement errors. After rigorous validation, they discovered no reliable method to instill self-verification and uncertainty acknowledgment without degrading model performance or creating superficial stylistic mimicry across five different small models.

AINeutralarXiv – CS AI · Apr 156/10
🧠

GF-Score: Certified Class-Conditional Robustness Evaluation with Fairness Guarantees

Researchers introduce GF-Score, a framework that evaluates neural network robustness across individual classes while measuring fairness disparities, eliminating the need for expensive adversarial attacks through self-calibration. Testing across 22 models reveals consistent vulnerability patterns and shows that more robust models paradoxically exhibit greater class-level fairness disparities.

AINeutralarXiv – CS AI · Apr 156/10
🧠

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Researchers introduce Safe-SAIL, a framework that uses sparse autoencoders to interpret safety features in large language models across four domains (pornography, politics, violence, terror). The work reduces interpretation costs by 55% and identifies 1,758 safety-related features with human-readable explanations, advancing mechanistic understanding of AI safety.

AINeutralarXiv – CS AI · Apr 156/10
🧠

A longitudinal health agent framework

Researchers propose a multi-layer AI agent framework designed to support longitudinal health tasks over extended periods, addressing critical gaps in current implementations around user intent, accountability, and sustained goal alignment. The framework emphasizes adaptation, coherence, continuity, and agency across repeated interactions, offering guidance for developing safer, more personalized health AI systems that move beyond isolated interventions.

AINeutralarXiv – CS AI · Apr 156/10
🧠

The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

Researchers introduce a new behavioral measurement framework for tool-augmented language models deployed in organizations, using a two-dimensional Action Rate and Refusal Signal space to profile how LLM agents execute tasks under different autonomy configurations and risk contexts. The approach prioritizes execution-layer characterization over aggregate safety scoring, revealing that reflection-based scaffolding systematically shifts agent behavior in high-risk scenarios.

AIBearishFortune Crypto · Apr 147/10
🧠

‘If I am going to advocate for others to kill and commit crimes, then I must lead by example’: OpenAI suspect’s chilling manifesto

A suspect linked to OpenAI has reportedly created a manifesto claiming that tech executives' public warnings about AI existential risks are radicalizing fringe individuals toward violence. The incident highlights growing concerns about how AI safety discourse may inadvertently inspire extremist rhetoric and actions.

‘If I am going to advocate for others to kill and commit crimes, then I must lead by example’: OpenAI suspect’s chilling manifesto
🏢 OpenAI
AIBearisharXiv – CS AI · Apr 146/10
🧠

Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

A research study demonstrates that fine-tuning language models with sycophantic reward signals degrades their calibration—the ability to accurately quantify uncertainty—even as performance metrics improve. While the effect lacks statistical significance in this experiment, the findings reveal that reward-optimized models retain structured miscalibration even after post-hoc corrections, establishing a methodology for evaluating hidden degradation in fine-tuned systems.

AINeutralarXiv – CS AI · Apr 146/10
🧠

EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

Researchers introduce EmbodiedGovBench, a new evaluation framework for embodied AI systems that measures governance capabilities like controllability, policy compliance, and auditability rather than just task completion. The benchmark addresses a critical gap in AI safety by establishing standards for whether robot systems remain safe, recoverable, and responsive to human oversight under realistic failures.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Layerwise Dynamics for In-Context Classification in Transformers

Researchers have developed a method to make transformer neural networks interpretable by studying how they perform in-context classification from few examples. By enforcing permutation equivariance constraints, they extracted an explicit algorithmic update rule that reveals how transformers dynamically adjust to new data, offering the first identifiable recursion of this kind.

AINeutralarXiv – CS AI · Apr 146/10
🧠

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Researchers develop the first unified theoretical framework for sparse dictionary learning (SDL) methods used in AI interpretability, proving these optimization problems are piecewise biconvex and characterizing why they produce flawed features. The work explains long-standing practical failures in sparse autoencoders and proposes feature anchoring as a solution to improve feature disentanglement in neural networks.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Latent Structure of Affective Representations in Large Language Models

Researchers investigate how large language models represent emotions in their latent spaces, discovering that LLMs develop coherent emotional representations aligned with established psychological models of valence and arousal. The findings support the linear representation hypothesis used in AI transparency methods and demonstrate practical applications for uncertainty quantification in emotion processing tasks.

AINeutralarXiv – CS AI · Apr 146/10
🧠

General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging

Researchers evaluated whether general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) can model human driving behavior in autonomous vehicle safety testing by embedding them as standalone driver agents in a simplified merging scenario. While both models reproduced some human-like behaviors, they failed to consistently capture responses to dynamic velocity cues and diverged significantly on safety metrics, suggesting LLMs show promise as ready-to-use behavior models but require further validation.

🏢 OpenAI🧠 o1🧠 o3
AINeutralarXiv – CS AI · Apr 146/10
🧠

STARS: Skill-Triggered Audit for Request-Conditioned Invocation Safety in Agent Systems

Researchers introduce STARS, a framework for continuously auditing AI agent skill invocations in real-time by combining static capability analysis with request-conditioned risk modeling. The approach demonstrates improved detection of prompt injection attacks compared to static baselines, though remains most valuable as a triage layer rather than a complete replacement for pre-deployment screening.

← PrevPage 15 of 26Next →