AIBearisharXiv – CS AI · 3h ago7/10
🧠
Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG
Researchers identify a critical failure mode in Retrieval-Augmented Generation (RAG) evaluation called 'citation laundering,' where topically relevant sources are presented as evidence for claims they don't actually support. The team introduces FORCEBENCH, a diagnostic benchmark that tests whether AI evaluators can distinguish between evidence-calibrated claims and over-generalized ones, revealing that current evaluation methods fail to detect warrant mismatches in 24-47% of cases.