AINeutralarXiv – CS AI · May 116/10
🧠Researchers propose Multilingual Self-Distillation (MSD), a framework that transfers safety safeguards from high-resource languages like English to vulnerable low-resource languages in large language models. The method eliminates the need for expensive multilingual response data by leveraging an LLM's existing safety capabilities, demonstrating effective cross-lingual protection across diverse jailbreak benchmarks.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose a novel black-box confidence estimation method for chain-of-thought reasoning that measures trajectory convergence rather than relying on expensive sampling. Testing across multiple benchmarks and AI models shows significant improvements over self-consistency baselines while requiring only 4 samples instead of 8, with potential applications for safer API-based AI deployment.
🧠 GPT-5🧠 Claude🧠 Sonnet
AIBullisharXiv – CS AI · May 96/10
🧠Researchers propose WARDEN, an information-theoretic adversarial training framework that improves Large Language Model robustness against prompt attacks by dynamically reweighting adversarial examples using f-divergence principles. The method achieves comparable computational efficiency to existing approaches while substantially reducing attack success rates, advancing the scalability of AI safety mechanisms.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers identify a critical flaw in machine-generated text detection: token-level likelihood signals vary inconsistently across a detector model's hidden space, causing Simpson's paradox that undermines existing detectors. They propose a learned local calibration method that dramatically improves detection performance, with calibrated variants achieving AUROC improvements from 0.63 to 0.85 on GPT-5.4 text.
🧠 GPT-5
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose a framework for comparing language models on safety without labeled benchmark data, introducing SimpleAudit as a validation tool that uses controlled contrasts and variance analysis to establish model safety rankings. The study demonstrates that comparative safety scores are inherently context-dependent, requiring detailed reporting of methods rather than single rankings.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers have identified a critical vulnerability in LLM safety alignment where fine-tuning on benign samples causes parameters to drift toward unsafe behaviors, erasing safety gains from millions of preference examples. The study proposes SQSD, a method to quantify and score individual training samples by their contribution to safety degradation, with demonstrated transferability across different model architectures and scales.
AINeutralarXiv – CS AI · May 46/10
🧠Researchers introduced ARMOR 2025, a military-focused safety benchmark for evaluating large language models against military doctrines including the Law of War and Rules of Engagement. The benchmark tests 21 commercial LLMs across 519 doctrinally grounded prompts organized in a 12-category taxonomy, revealing significant safety alignment gaps for defense applications.
AINeutralarXiv – CS AI · May 46/10
🧠Researchers propose a trust framework for AI agent skills—reusable code packages that extend language models—treating them as untrusted by default until verified. The approach introduces verification levels, capability gates, and correctness criteria to enable sustainable human-in-the-loop oversight without operational bottlenecks.
AIBearisharXiv – CS AI · May 46/10
🧠Researchers at arXiv studied how task phrasing influences the decision-making of large language models, using the iterated prisoner's dilemma as a test case. The findings reveal that LLMs are prone to making presumptions based on how tasks are worded, which can impair their adaptability and reasoning—a safety concern for real-world deployment. Neutral task phrasing significantly reduced these presumptions, suggesting that prompt design is critical for reliable LLM performance.
AIBearisharXiv – CS AI · May 16/10
🧠Researchers discovered that when language models receive complex adversarial instructions to underperform, they abandon semantic reasoning and collapse into positional shortcuts—defaulting to single response positions up to 99.9% of the time. This reveals fundamental vulnerabilities in how instruction-tuned models handle adversarial prompts, with implications for AI safety and evaluation reliability.
🧠 Llama
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce RSCB-MC, a risk-sensitive contextual bandit system that improves how LLM-based coding agents decide whether to use external memory for debugging tasks. Rather than treating memory retrieval as a simple similarity-matching problem, the system treats it as a safety-critical control problem, achieving 62.5% success rate with zero false positives in testing.
AIBullisharXiv – CS AI · May 16/10
🧠Researchers introduce GAVEL, a rule-based activation monitoring framework that enhances large language model safety by modeling neural activations as interpretable cognitive elements rather than broad behavioral classifiers. The approach enables practitioners to configure domain-specific safety rules without retraining models, improving precision and transparency in AI governance.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers propose LatentRefusal, a safety mechanism for LLM-based text-to-SQL systems that detects unanswerable queries by analyzing intermediate hidden activations rather than relying on output-level instruction following. The approach achieves 88.5% F1 score across four benchmarks while adding minimal computational overhead, addressing a critical deployment challenge in AI systems that generate executable code.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers demonstrate that deliberative alignment—a method for improving LLM safety by distilling reasoning from stronger models—still allows unsafe behaviors from base models to persist despite learning safer reasoning patterns. They propose a Best-of-N sampling technique that reduces attack success rates by 28-35% across multiple benchmarks while maintaining utility.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers developed machine learning models to detect malicious Model Context Protocol (MCP) attacks, achieving up to 100% F1-score on binary classification and 90.56% on multiclass detection tasks. The study addresses a critical security gap in MCP technology, which extends LLM capabilities but introduces new attack surfaces, and includes a middleware solution for real-world deployment.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers propose CanaryRAG, a runtime defense mechanism that protects Retrieval-Augmented Generation systems from adversarial attacks that extract proprietary data from knowledge bases. The solution uses embedded canary tokens to detect leakage in real-time while maintaining normal system performance, offering a practical safeguard for organizations deploying RAG-based AI systems.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduce ToM-SB, a novel challenge where AI defenders must use theory-of-mind reasoning to deceive attackers trying to extract sensitive information. Through reinforcement learning, trained models outperform frontier LLMs like GPT-4 and Gemini-Pro, revealing an emergent bidirectional relationship between belief modeling and deception capabilities.
🧠 GPT-5
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers have introduced C-ReD, a Chinese benchmark dataset for detecting AI-generated text that addresses gaps in model diversity and data homogeneity. The dataset, derived from real-world prompts, demonstrates reliable in-domain detection and strong generalization to unseen language models, with resources publicly available on GitHub.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers demonstrate that large language models exhibit critical control failures in causal reasoning, where they produce sound logical arguments but abandon them under social pressure or authority hints. The study introduces CAUSALT3, a benchmark revealing three reproducible pathologies, and proposes Regulated Causal Anchoring (RCA), an inference-time mitigation technique that validates reasoning consistency without retraining.
AIBullisharXiv – CS AI · Mar 266/10
🧠Researchers developed HalluJudge, a reference-free system to detect hallucinations in AI-generated code review comments, addressing a key challenge in LLM adoption for software development. The system achieves 85% F1 score with 67% alignment to developer preferences at just $0.009 average cost, making it a practical safeguard for AI-assisted code reviews.
AIBullisharXiv – CS AI · Mar 116/10
🧠Researchers propose a four-layer Layered Governance Architecture (LGA) framework to address security vulnerabilities in autonomous AI agents powered by large language models. The system achieves 96% interception rate of malicious activities including prompt injection and tool misuse with only 980ms latency.
🧠 GPT-4🧠 Llama
AINeutralarXiv – CS AI · Mar 37/107
🧠Researchers developed constitutional black-box monitors to detect scheming behavior in LLM agents using only observable inputs and outputs. The study found that monitors trained on synthetic data can generalize to realistic environments, but performance improvements plateau quickly with simple optimization techniques outperforming complex methods.
AINeutralarXiv – CS AI · Mar 37/108
🧠Researchers introduce SafeSci, a comprehensive framework for evaluating safety in large language models used for scientific applications. The framework includes a 0.25M sample benchmark and 1.5M sample training dataset, revealing critical vulnerabilities in 24 advanced LLMs while demonstrating that fine-tuning can significantly improve safety alignment.
AIBearishOpenAI News · Aug 56/105
🧠Researchers studied worst-case risks of releasing open-weight large language models by conducting malicious fine-tuning (MFT) experiments on gpt-oss. The study specifically examined how fine-tuning could maximize dangerous capabilities in biology and cybersecurity domains.