649 articles tagged with #ai-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers propose trace rewriting techniques to protect language models from unauthorized knowledge distillation, a process where smaller models learn from larger ones without permission. The methods preserve model accuracy while degrading distillation usefulness and embedding detectable watermarks in student models.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers propose a multi-objective unlearning framework for Large Language Models that simultaneously removes hazardous information, preserves general utility, avoids over-refusal, and resists adversarial attacks. The method uses unified domain representation and bidirectional logit distillation to harmonize competing optimization goals, achieving state-of-the-art performance across diverse unlearning requirements.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduced RoleConflictBench, a benchmark dataset containing over 13,000 scenarios across 65 social roles designed to test whether large language models prioritize contextual cues or learned preferences when facing conflicting role expectations. Analysis of 10 leading LLMs revealed that models predominantly rely on ingrained role preferences rather than responding dynamically to situational urgency, indicating a significant gap in contextual sensitivity.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce Evolve-CTF, a tool that generates families of semantically-equivalent cybersecurity challenges to evaluate the robustness of agentic LLMs. Testing 13 LLM configurations reveals models are resilient to basic code transformations but struggle with obfuscation and composed modifications, providing new benchmarking methodology for AI safety evaluation.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers propose a symbolic reasoning framework that implements Peirce's abductive-deductive-inductive reasoning model to address systematic weaknesses in large language model logical reasoning. The system enforces logical consistency through five algebraic invariants, with the Weakest Link bound preventing unreliable premises from corrupting multi-step inference chains.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers present Deliberative Searcher, a framework that enhances large language model reliability by combining certainty calibration with retrieval-based search for question answering. The system uses reinforcement learning with soft reliability constraints to improve alignment between model confidence and actual correctness, producing more trustworthy outputs.
AINeutralCrypto Briefing · 6d ago6/10
🧠Anthropic has delayed the release of its Claude Mythos AI model due to identified security risks, signaling the industry's growing commitment to responsible AI deployment. This decision underscores the tension between rapid innovation cycles and the critical need for robust safety protocols before releasing advanced AI systems to the market.
🏢 Anthropic🧠 Claude
AIBearishFortune Crypto · Apr 157/10
🧠Following an alleged attack on OpenAI CEO Sam Altman's home, two similarly named anti-AI activist groups—Pause AI and Stop AI—have come under public scrutiny. The incident has intensified debate around AI safety activism and raises questions about how extremist rhetoric may translate into real-world violence.
AINeutralDecrypt – AI · Apr 156/10
🧠Anthropic is preparing to release Opus 4.7 and a new full-stack AI design studio, while reportedly developing advanced AI capabilities with potential dual-use implications that the company considers too risky to release publicly. The situation highlights the growing tension between AI capability advancement and responsible disclosure in the industry.
🏢 Anthropic🧠 Opus
AIBullishAI News · Apr 156/10
🧠Commvault has launched AI Protect, a governance solution that provides rollback capabilities for autonomous AI agents operating in cloud environments. The platform addresses critical risks posed by AI systems that can independently delete files, access databases, modify infrastructure, and alter security policies without adequate oversight or recovery mechanisms.
AIBearishThe Verge – AI · Apr 156/10
🧠Apple threatened to remove Elon Musk's Grok AI app from its App Store in January over failure to moderate nonconsensual sexual deepfakes on X, according to a letter obtained by NBC News. Despite the threat, Apple took no public action and only contacted developers privately, drawing criticism for its muted response to a widespread abuse crisis.
🧠 Grok
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers introduce Aethelgard, an adaptive governance framework that addresses the capability overprovisioning problem in autonomous AI agents by dynamically restricting tool access based on task requirements. The system uses reinforcement learning to enforce least-privilege principles, reducing security exposure while maintaining operational efficiency.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers attempted to train behavioral dispositions into small language models through distillation but found that initial positive results were artifacts of measurement errors. After rigorous validation, they discovered no reliable method to instill self-verification and uncertainty acknowledgment without degrading model performance or creating superficial stylistic mimicry across five different small models.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers introduce GF-Score, a framework that evaluates neural network robustness across individual classes while measuring fairness disparities, eliminating the need for expensive adversarial attacks through self-calibration. Testing across 22 models reveals consistent vulnerability patterns and shows that more robust models paradoxically exhibit greater class-level fairness disparities.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers introduce Safe-SAIL, a framework that uses sparse autoencoders to interpret safety features in large language models across four domains (pornography, politics, violence, terror). The work reduces interpretation costs by 55% and identifies 1,758 safety-related features with human-readable explanations, advancing mechanistic understanding of AI safety.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers propose a multi-layer AI agent framework designed to support longitudinal health tasks over extended periods, addressing critical gaps in current implementations around user intent, accountability, and sustained goal alignment. The framework emphasizes adaptation, coherence, continuity, and agency across repeated interactions, offering guidance for developing safer, more personalized health AI systems that move beyond isolated interventions.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers introduce a new behavioral measurement framework for tool-augmented language models deployed in organizations, using a two-dimensional Action Rate and Refusal Signal space to profile how LLM agents execute tasks under different autonomy configurations and risk contexts. The approach prioritizes execution-layer characterization over aggregate safety scoring, revealing that reflection-based scaffolding systematically shifts agent behavior in high-risk scenarios.
AIBearishFortune Crypto · Apr 147/10
🧠A suspect linked to OpenAI has reportedly created a manifesto claiming that tech executives' public warnings about AI existential risks are radicalizing fringe individuals toward violence. The incident highlights growing concerns about how AI safety discourse may inadvertently inspire extremist rhetoric and actions.
🏢 OpenAI
AIBearisharXiv – CS AI · Apr 146/10
🧠A research study demonstrates that fine-tuning language models with sycophantic reward signals degrades their calibration—the ability to accurately quantify uncertainty—even as performance metrics improve. While the effect lacks statistical significance in this experiment, the findings reveal that reward-optimized models retain structured miscalibration even after post-hoc corrections, establishing a methodology for evaluating hidden degradation in fine-tuned systems.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduce EmbodiedGovBench, a new evaluation framework for embodied AI systems that measures governance capabilities like controllability, policy compliance, and auditability rather than just task completion. The benchmark addresses a critical gap in AI safety by establishing standards for whether robot systems remain safe, recoverable, and responsive to human oversight under realistic failures.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers have developed a method to make transformer neural networks interpretable by studying how they perform in-context classification from few examples. By enforcing permutation equivariance constraints, they extracted an explicit algorithmic update rule that reveals how transformers dynamically adjust to new data, offering the first identifiable recursion of this kind.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers develop the first unified theoretical framework for sparse dictionary learning (SDL) methods used in AI interpretability, proving these optimization problems are piecewise biconvex and characterizing why they produce flawed features. The work explains long-standing practical failures in sparse autoencoders and proposes feature anchoring as a solution to improve feature disentanglement in neural networks.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers investigate how large language models represent emotions in their latent spaces, discovering that LLMs develop coherent emotional representations aligned with established psychological models of valence and arousal. The findings support the linear representation hypothesis used in AI transparency methods and demonstrate practical applications for uncertainty quantification in emotion processing tasks.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers evaluated whether general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) can model human driving behavior in autonomous vehicle safety testing by embedding them as standalone driver agents in a simplified merging scenario. While both models reproduced some human-like behaviors, they failed to consistently capture responses to dynamic velocity cues and diverged significantly on safety metrics, suggesting LLMs show promise as ready-to-use behavior models but require further validation.
🏢 OpenAI🧠 o1🧠 o3
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduce STARS, a framework for continuously auditing AI agent skill invocations in real-time by combining static capability analysis with request-conditioned risk modeling. The approach demonstrates improved detection of prompt injection attacks compared to static baselines, though remains most valuable as a triage layer rather than a complete replacement for pre-deployment screening.