AINeutralarXiv – CS AI · 1d ago7/10
🧠Researchers demonstrate that safety-aligned large language models remain vulnerable to token injections at any point during generation, not just early in the output sequence. By training models directly on generation trajectories with mid-sequence perturbations, they achieve improved robustness that generalizes across different attack vectors, revealing that robust AI safety requires alignment of the entire generation process rather than just output supervision.
AINeutralarXiv – CS AI · 1d ago7/10
🧠Researchers introduce MENTOR, a metacognition-driven framework that addresses a critical vulnerability in Large Language Models: an average jailbreak success rate of 57.8% across domain-specific risks in education, finance, and management. The framework uses self-assessment and consequential reasoning to identify model misalignments, then applies dynamic rule-based steering to substantially reduce attack success rates, outperforming existing safety alignment methods.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers propose MESA, a new safety alignment framework for Mixture-of-Experts language models that addresses a critical vulnerability where safety capabilities concentrate in few experts. The method uses Optimal Transport theory to strategically distribute safety responsibilities across multiple experts while maintaining model performance and computational efficiency.
AIBearisharXiv – CS AI · 3d ago7/10
🧠Researchers demonstrate that LLM agents' decisions can be systematically manipulated through adversarial feed curation—the ordering and composition of information sources agents consume before acting. Testing on 2,785 decision rollouts across four open-source LLMs, they found feeds can shift genuinely uncertain decisions from 5% to 100% in one direction, though they cannot override firmly held model defaults, revealing a critical safety vulnerability in the upstream ranker layer rather than the model itself.
AIBearisharXiv – CS AI · 3d ago7/10
🧠A research study reveals that large language models are significantly more susceptible to being misled by peer consensus than they are at correcting their own errors, posing critical risks for multi-agent AI systems. The findings show that authority labels and social pressure drive harmful revisions without improvement from reasoning interventions like chain-of-thought prompting.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce SHIELD, a novel machine learning framework that combines Interval Bound Propagation with hypernetwork architecture to achieve certifiably robust continual learning without replay buffers. The method uses task-specific embeddings and a new Interval MixUp training strategy to maintain security across sequential tasks while outperforming existing approaches on adversarial benchmarks.
AINeutralarXiv – CS AI · 4d ago7/10
🧠Researchers demonstrate that large language models express values through two distinct but partially overlapping mechanisms: intrinsic values learned during training and prompted values elicited by explicit instructions. Using mechanistic analysis of value vectors and neurons, the study reveals that while both mechanisms share common components, they serve different functions—intrinsic values promote response diversity while prompted values enforce instruction compliance.
AINeutralarXiv – CS AI · May 297/10
🧠Researchers propose a novel framework using zeroth-order optimization to enhance the robustness of safety alignment in large language models against perturbations like parameter noise and quantization. The hybrid approach combines standard first-order safety alignment with zeroth-order refinement steps, demonstrating that weak safety mechanisms can be significantly strengthened while maintaining model utility with minimal computational overhead.
AIBearisharXiv – CS AI · May 297/10
🧠Researchers discover a critical failure mode in reasoning models where chain-of-thought reasoning remains factually correct but final answers flip to incorrect ones under sustained adversarial pressure in multi-turn dialogue. This 'unfaithful capitulation' represents a gap between internal reasoning validity and behavioral output that existing evaluation metrics fail to detect.
🧠 GPT-4
AINeutralarXiv – CS AI · May 297/10
🧠Researchers establish a mathematical framework connecting neural network training to Hamilton-Jacobi partial differential equations, showing that gradient descent searches through solutions to viscous PDEs. This theoretical unification applies across major architectures including residual networks and transformers, with implications for understanding generalization, adversarial robustness, and interpretability.
AIBullisharXiv – CS AI · May 277/10
🧠Researchers introduce PaTAS (Parallel Trust Assessment System), a framework that uses Subjective Logic to measure and propagate trust through neural networks alongside standard computation. The system identifies reliability gaps and adversarial vulnerabilities that traditional metrics like accuracy fail to detect, offering a foundation for deploying AI safely in critical applications.
AIBullisharXiv – CS AI · May 277/10
🧠Researchers have developed a framework using behavioral geometry to predict which AI models are vulnerable to jailbreak attacks and efficiently transfer defensive measures across model populations. The approach achieves 94% detection accuracy while reducing evaluation probes by 98%, enabling practical security assessment across thousands of model configurations.
AIBearisharXiv – CS AI · May 277/10
🧠Researchers have discovered that safety mechanisms in large language models operate within an instability region where small input variations cause unpredictable refusal behaviors rather than consistent outputs. The Furina jailbreak attack exploits this vulnerability by using fragmented prompts to amplify uncertainty, outperforming existing attacks on safety benchmarks and highlighting a fundamental weakness in current AI safety defenses.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers demonstrate that many-shot jailbreak attacks on language models work by inducing progressive activation drift through implicit fine-tuning, and propose a simple defense using a single safety demonstration at inference time that counteracts this drift without requiring parameter modifications or white-box access.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce Self-ReSET, a reinforcement learning framework that enables large reasoning models to recover from unsafe reasoning trajectories and adversarial attacks. The method addresses limitations in existing alignment approaches by using dynamic, on-policy data rather than static training sets, significantly improving model robustness against jailbreak attempts while maintaining utility.
AI × CryptoBullisharXiv – CS AI · May 127/10
🤖Researchers propose Self-Anchored Consensus (SAC), a decentralized protocol enabling LLM agents to collaborate reliably over peer-to-peer networks while resisting Byzantine attacks. The method allows agents to iteratively filter unreliable messages and refine outputs without centralized coordination, addressing a critical vulnerability in distributed AI systems.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers identify Refusal-Escape Directions (RED) as mathematical perturbation vectors that explain why aligned LLMs remain vulnerable to jailbreaks. The study reveals structural vulnerabilities arise from fundamental trade-offs between safety mechanisms and model utility, with normalization and residual connections as key exploitable components.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers propose Latent Personality Alignment (LPA), a novel defense mechanism for large language models that achieves adversarial robustness by training on abstract personality traits rather than harmful examples. The method requires fewer than 100 training examples while matching the performance of traditional approaches using 150,000+ harmful prompts, and demonstrates superior generalization to unseen attack vectors.
AIBullisharXiv – CS AI · May 117/10
🧠VISTA is a novel decentralized machine learning algorithm designed to operate securely when adversaries control the majority of worker nodes. By implementing an incentive-based framework that rewards mutually consistent reports, the system converts adversarial nodes from pure saboteurs into rational agents, enabling convergence comparable to standard SGD without requiring an honest majority.
AINeutralarXiv – CS AI · May 117/10
🧠Researchers present MORPH-U, a simulation-based autonomous driving system that integrates Vehicle-to-Everything (V2X) communication with LiDAR/radar/camera sensors while implementing Byzantine-inspired safeguards against forged or delayed messages. The framework uses multi-objective optimization to balance safety, comfort, and responsiveness in high-uncertainty environments, demonstrating resilience against coordinated false-message attacks.
AIBearisharXiv – CS AI · May 117/10
🧠Researchers introduce the Adversarial Empathy Benchmark (AEB) to test whether RL-trained empathetic language models remain robust against adversarial user tactics like gaslighting and emotional manipulation. While RLVER-trained models significantly outperform baselines in empathetic responsiveness, a new metric (ECS) reveals they excel at behavioral responsiveness without demonstrating genuine emotional state tracking, raising questions about the depth of empathetic AI capabilities.
AINeutralarXiv – CS AI · May 97/10
🧠Researchers propose Safety Bottleneck Regularization (SBR), a defense mechanism against harmful fine-tuning attacks on large language models. The approach anchors a model's unsafe responses to safe outputs via the unembedding layer, reducing harmful capabilities while maintaining performance on legitimate tasks.
AINeutralarXiv – CS AI · May 97/10
🧠Researchers propose BehaviorGuard, an online defense framework against backdoor attacks in deep reinforcement learning that detects malicious behavior by analyzing action distribution shifts rather than relying on reward anomalies or model fine-tuning. The approach works in both single and multi-agent DRL environments and demonstrates superior efficacy and efficiency compared to existing defense methods.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce ASGuard, a mechanistically-informed framework that identifies and mitigates vulnerabilities in large language models' safety mechanisms, particularly those exploited by targeted jailbreaking attacks like tense-changing prompts. By using circuit analysis to locate vulnerable attention heads and applying channel-wise scaling vectors, ASGuard reduces attack success rates while maintaining model utility and general capabilities.
AIBearisharXiv – CS AI · Apr 157/10
🧠Researchers have identified critical vulnerabilities in mobile GUI agents powered by large language models, revealing that third-party content in real-world apps causes these agents to fail significantly more often than benchmark tests suggest. Testing on 122 dynamic tasks and over 3,000 static scenarios shows misleading rates of 36-42%, raising serious concerns about deploying these agents in commercial settings.