Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection
Researchers propose Evidence Graph Consistency (EGC), a framework to detect hallucinations in Retrieval-Augmented Generation systems by analyzing structural relationships among evidence pieces. Testing across six LLMs reveals a critical finding: the method works as expected for Llama-2 but shows reversed diagnostic signals for GPT-4, GPT-3.5, and Mistral-7B, suggesting hallucination patterns differ fundamentally across model families.
The study addresses a fundamental problem in AI systems: hallucinations persist even when language models have access to retrieved evidence. Traditional approaches measure similarity between generated answers and source passages, treating evidence as isolated data points rather than interconnected claims. EGC introduces a more sophisticated approach by constructing local evidence graphs that capture structural relationships, then computing five consistency metrics as hallucination indicators.
This research emerges from growing recognition that RAG systems, while improving factuality, remain imperfect. The evaluation on RAGTruth's question-answering dataset, testing 5,767 responses across six models, provides substantial empirical evidence. However, the most significant finding is troubling: graph consistency features that correctly identify hallucinations in Llama-2 systematically reverse their diagnostic value for GPT-4, GPT-3.5, and Mistral-7B. This suggests these model families encode and generate hallucinations through fundamentally different mechanisms.
For practitioners developing AI applications relying on RAG, this reveals a critical constraint: hallucination detection methods validated on one model family cannot be assumed universal. Organizations cannot deploy a single detection framework across their LLM infrastructure without potential false positives or undetected hallucinations. The findings indicate that model-agnostic hallucination detection through embedding-based consistency measures is unreliable, requiring either model-specific calibration or entirely different detection architectures. This complexity increases development costs and maintenance burden for production AI systems. Future work must either develop model-family-specific detection strategies or discover deeper, truly universal indicators of hallucination that transcend architectural differences.
- βEGC framework detects hallucinations by analyzing structural relationships in evidence graphs rather than flat similarity metrics
- βGraph consistency features work correctly for Llama-2 but show reversed diagnostic signals in GPT-4, GPT-3.5, and Mistral-7B
- βHallucination patterns differ fundamentally across model families, making universal detection methods unreliable
- βEmbedding-based graph consistency cannot serve as a model-independent hallucination detection signal
- βProduction AI systems require model-specific calibration for effective hallucination detection