AINeutralarXiv – CS AI · 5d ago7/10
🧠Researchers investigated how prompt tone affects Large Language Model accuracy across multiple models and datasets, finding that tonal variations produce systematic yet model-dependent performance shifts. Testing ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite on 50-620 multiple-choice questions, they discovered some models show statistically significant accuracy changes while others experience large swings, with sensitivity varying by subject domain. The findings highlight that LLM reliability cannot be assumed tone-robust in production deployments.
🧠 ChatGPT🧠 Gemini
AINeutralarXiv – CS AI · 5d ago7/10
🧠Researchers introduce DistractionIF, a benchmark revealing that larger language models are paradoxically less robust to instruction-like noise in reference text, with performance degrading up to 30 points as scale increases. The study demonstrates that reinforcement learning via Group Relative Policy Optimization can restore robustness by 15.5% while maintaining instruction-following capability.
🏢 Perplexity
AIBearisharXiv – CS AI · May 97/10
🧠Researchers found that large language models frequently arrive at correct code predictions through flawed reasoning, with performance dropping up to 70% when code undergoes semantics-preserving mutations. The study reveals substantial gaps between apparent accuracy and genuine semantic understanding, questioning the reliability of LLMs for critical programming tasks.
AIBearisharXiv – CS AI · Apr 157/10
🧠Researchers demonstrate that instruction-tuned large language models suffer severe performance degradation when subject to simple lexical constraints like banning a single punctuation mark or common word, losing 14-48% of response quality. This fragility stems from a planning failure where models couple task competence to narrow surface-form templates, affecting both open-weight and commercially deployed closed-weight models like GPT-4o-mini.
🧠 GPT-4
AIBullisharXiv – CS AI · Apr 137/10
🧠Researchers propose Distributionally Robust Token Optimization (DRTO), a method combining reinforcement learning from human feedback with robust optimization to improve large language model consistency across distribution shifts. The approach demonstrates 9.17% improvement on GSM8K and 2.49% on MathQA benchmarks, addressing LLM vulnerabilities to minor input variations.
AIBearisharXiv – CS AI · Mar 56/10
🧠Researchers introduce ObfusQAte, a new framework to test Large Language Model robustness when faced with obfuscated or disguised factual questions. The study reveals that LLMs tend to fail or generate hallucinated responses when confronted with increasingly complex variations of questions across three dimensions of obfuscation.
AINeutralarXiv – CS AI · 1d ago6/10
🧠Researchers introduce AdvCL, a novel framework that repurposes adversarial perturbations to improve continual learning in large language models by addressing forgetting, limited transfer, and adversarial vulnerability. The approach combines three modules—Intra-Smooth, Proto-Clip, and Inter-Align—to provide geometric control signals that stabilize model adaptation across sequential tasks while maintaining robustness.
AIBearisharXiv – CS AI · 2d ago6/10
🧠Researchers demonstrate that toxic language in prompts significantly degrades the factual accuracy of large language models, even when semantic content remains identical. By analyzing internal model activations, they identify that toxicity amplifies perturbation-sensitive nodes while leaving core reasoning pathways relatively stable, revealing a critical vulnerability in LLM reliability.
AIBullisharXiv – CS AI · 5d ago6/10
🧠Researchers propose a debiasing fine-tuning method to improve Large Language Model robustness against semantically-neutral prompt variations without expensive full retraining. The approach identifies perturbation-induced bias in neural network outputs and demonstrates theoretical and experimental evidence that targeted debiasing can enhance model resilience to prompt alterations.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduce Context-Driven Decomposition (CDD), a diagnostic tool that reveals how retrieval-augmented generation (RAG) systems blindly follow retrieved context even when it contradicts their underlying knowledge. Testing across multiple AI models shows CDD can improve accuracy to 64% on adversarial scenarios, though improvements don't consistently transfer across different model families, suggesting RAG systems resolve conflicts through fundamentally different mechanisms.
🧠 Claude🧠 Gemini
AIBearisharXiv – CS AI · May 116/10
🧠Researchers discovered that Large Language Models exhibit a U-shaped performance degradation curve when processing text with word-boundary corruption, termed the 'Text Uncanny Valley.' This reveals a critical vulnerability in LLM robustness: performance worsens at moderate corruption levels before improving again at extreme corruption, suggesting models struggle during transitions between word-level and character-level processing modes.
🧠 Gemini
AIBearisharXiv – CS AI · Apr 106/10
🧠Researchers introduce MedDialBench, a comprehensive benchmark testing how large language models maintain diagnostic accuracy when patients exhibit adversarial behaviors across five dimensions. The study reveals that fabricating symptoms causes 1.7-3.4x larger accuracy drops than withholding information, with worst-case performance degradation ranging from 38.8 to 54.1 percentage points across tested models.