QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples
Researchers introduce QO-Bench, a diagnostic benchmark for evaluating retrieval-augmented generation (RAG) systems on structured database-style queries over text. The benchmark reveals that current RAG systems excel at finding relevant passages but fail to preserve typed values needed for query operators like joins and counting, identifying operator execution rather than retrieval as the core bottleneck.
QO-Bench addresses a critical gap in how retrieval-augmented generation systems are evaluated and deployed. While existing RAG benchmarks prioritize semantic relevance scoring, they overlook a fundamental challenge: extracting structured information from unstructured text requires preserving specific data types and values that operators depend on. The benchmark's design—using deterministic gold answers from typed event tuples rather than LLM judges—enables precise diagnosis of where systems fail at the operator level.
The research reveals a significant architectural problem in current RAG paradigms. Systems optimized for semantic similarity successfully retrieve contextually relevant passages but strip away the precise typed values that database-style queries require. This creates a mismatch between what retrieval systems optimize for and what downstream tasks actually need. The two-axis framework—distinguishing index-time preservation versus query-time execution—provides clarity on why different paradigms succeed or fail across operator types.
The findings have implications for developers building knowledge systems over structured domains like finance, legal compliance, and scientific research. Similarity-based retrieval outperforms on simple filters and projections, while extraction-to-SQL approaches excel at intersection and aggregation queries. However, the long-context oracle results showing performance far from saturation indicate that stronger language models alone won't solve the problem; the system architecture itself must be redesigned to preserve operator-relevant information throughout the pipeline.
Future RAG systems will likely need hybrid approaches that maintain type information through retrieval and integrate query-operator semantics into both indexing and execution phases. This work redirects focus from retrieval quality metrics toward end-to-end query correctness.
- →RAG systems retrieve relevant passages but discard typed values required for database-style query operators
- →Operator execution is a core bottleneck that stronger language models cannot eliminate alone
- →Similarity retrieval excels at filters and projections while extraction-to-SQL performs better on joins and aggregations
- →Current evaluation metrics based on passage relevance fail to measure query-operator correctness
- →System architecture must preserve type information from indexing through execution phases