#llm-robustness News & Analysis

12 articles tagged with #llm-robustness. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

12 articles

AINeutralarXiv – CS AI · 5d ago7/10

🧠

Mind Your Tone: Does Tone Alter LLM Performance?

Researchers investigated how prompt tone affects Large Language Model accuracy across multiple models and datasets, finding that tonal variations produce systematic yet model-dependent performance shifts. Testing ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite on 50-620 multiple-choice questions, they discovered some models show statistically significant accuracy changes while others experience large swings, with sensitivity varying by subject domain. The findings highlight that LLM reliability cannot be assumed tone-robust in production deployments.

🧠 ChatGPT🧠 Gemini

AINeutralarXiv – CS AI · 5d ago7/10

🧠

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

Researchers introduce DistractionIF, a benchmark revealing that larger language models are paradoxically less robust to instruction-like noise in reference text, with performance degrading up to 30 points as scale increases. The study demonstrates that reinforcement learning via Group Relative Policy Optimization can restore robustness by 15.5% while maintaining instruction-following capability.

🏢 Perplexity

AIBearisharXiv – CS AI · May 97/10

🧠

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

Researchers found that large language models frequently arrive at correct code predictions through flawed reasoning, with performance dropping up to 70% when code undergoes semantics-preserving mutations. The study reveals substantial gaps between apparent accuracy and genuine semantic understanding, questioning the reliability of LLMs for critical programming tasks.

AIBearisharXiv – CS AI · Apr 157/10

🧠

One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Researchers demonstrate that instruction-tuned large language models suffer severe performance degradation when subject to simple lexical constraints like banning a single punctuation mark or common word, losing 14-48% of response quality. This fragility stems from a planning failure where models couple task competence to narrow surface-form templates, affecting both open-weight and commercially deployed closed-weight models like GPT-4o-mini.

🧠 GPT-4

AIBullisharXiv – CS AI · Apr 137/10

🧠

Distributionally Robust Token Optimization in RLHF

Researchers propose Distributionally Robust Token Optimization (DRTO), a method combining reinforcement learning from human feedback with robust optimization to improve large language model consistency across distribution shifts. The approach demonstrates 9.17% improvement on GSM8K and 2.49% on MathQA benchmarks, addressing LLM vulnerabilities to minor input variations.

AIBearisharXiv – CS AI · Mar 56/10

🧠

ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

Researchers introduce ObfusQAte, a new framework to test Large Language Model robustness when faced with obfuscated or disguised factual questions. The study reveals that LLMs tend to fail or generate hallucinated responses when confronted with increasingly complex variations of questions across three dimensions of obfuscation.

AINeutralarXiv – CS AI · 1d ago6/10

🧠

Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment

Researchers introduce AdvCL, a novel framework that repurposes adversarial perturbations to improve continual learning in large language models by addressing forgetting, limited transfer, and adversarial vulnerability. The approach combines three modules—Intra-Smooth, Proto-Clip, and Inter-Align—to provide geometric control signals that stabilize model adaptation across sequential tasks while maintaining robustness.

AIBearisharXiv – CS AI · 2d ago6/10

🧠

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

Researchers demonstrate that toxic language in prompts significantly degrades the factual accuracy of large language models, even when semantic content remains identical. By analyzing internal model activations, they identify that toxicity amplifies perturbation-sensitive nodes while leaving core reasoning pathways relatively stable, revealing a critical vulnerability in LLM reliability.

AIBullisharXiv – CS AI · 5d ago6/10

🧠

Harnessing non-adversarial robustness in large language models

Researchers propose a debiasing fine-tuning method to improve Large Language Model robustness against semantically-neutral prompt variations without expensive full retraining. The approach identifies perturbation-induced bias in neural network outputs and demonstrates theoretical and experimental evidence that targeted debiasing can enhance model resilience to prompt alterations.

AINeutralarXiv – CS AI · May 276/10

🧠

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

Researchers introduce Context-Driven Decomposition (CDD), a diagnostic tool that reveals how retrieval-augmented generation (RAG) systems blindly follow retrieved context even when it contradicts their underlying knowledge. Testing across multiple AI models shows CDD can improve accuracy to 64% on adversarial scenarios, though improvements don't consistently transfer across different model families, suggesting RAG systems resolve conflicts through fundamentally different mechanisms.

🧠 Claude🧠 Gemini

AIBearisharXiv – CS AI · May 116/10

🧠

The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval

Researchers discovered that Large Language Models exhibit a U-shaped performance degradation curve when processing text with word-boundary corruption, termed the 'Text Uncanny Valley.' This reveals a critical vulnerability in LLM robustness: performance worsens at moderate corruption levels before improving again at extreme corruption, suggesting models struggle during transitions between word-level and character-level processing modes.

🧠 Gemini

AIBearisharXiv – CS AI · Apr 106/10

🧠

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

Researchers introduce MedDialBench, a comprehensive benchmark testing how large language models maintain diagnostic accuracy when patients exhibit adversarial behaviors across five dimensions. The study reveals that fabricating symptoms causes 1.7-3.4x larger accuracy drops than withholding information, with worst-case performance degradation ranging from 38.8 to 54.1 percentage points across tested models.