AIBearisharXiv โ CS AI ยท 7h ago7/10
๐ง
Brittlebench: Quantifying LLM robustness via prompt sensitivity
Researchers introduce Brittlebench, a new evaluation framework that reveals frontier AI models experience up to 12% performance degradation when faced with minor prompt variations like typos or rephrasing. The study shows that semantics-preserving input perturbations can account for up to half of a model's performance variance, highlighting significant robustness issues in current language models.