Beyond Questions: Evaluating What Large Language Models (Actually) Know
Researchers introduce BeQu, a new benchmark that evaluates LLM knowledge through open-ended prompts rather than predefined questions, addressing availability bias in existing benchmarks. The paradigm shift from narrow question-answering to characterizing naturally expressed knowledge provides deeper insights into parametric knowledge across 10,000 entities and multiple language models.
This research addresses a fundamental measurement problem in AI development: existing knowledge benchmarks suffer from availability bias by only testing knowledge that researchers explicitly choose to query. BeQu introduces open knowledge evaluation, fundamentally reframing how the field assesses what LLMs actually know versus what benchmark designers decide to test. Rather than asking "What is the birth date of M.L. King?" and checking for that specific answer, the methodology prompts models with "Tell me everything you know about M.L. King," capturing the broader knowledge landscape models naturally surface.
The benchmark's design—pairing 10,000 entities with reference corpora for statement verification—enables systematic evaluation across reasoning effort, model scale, prompt formats, and knowledge domains. This systematic approach moves beyond single-metric evaluations toward comprehensive characterization of knowledge expression patterns. The research arrives at a critical juncture in AI development, where scaling has plateaued for some applications and understanding what models genuinely know becomes essential for identifying capability gaps and knowledge distribution biases.
For the AI research community, BeQu provides an evaluation framework addressing longstanding questions about knowledge retention and recall mechanisms in transformer-based models. The methodology's transferability across model architectures and scales could establish new evaluation standards. Developers building AI systems for knowledge-intensive applications—from question-answering systems to retrieval-augmented generation—gain diagnostic tools for identifying knowledge vulnerabilities. The publicly available leaderboard creates competitive pressure for models to demonstrate genuine knowledge comprehension rather than benchmark-specific optimization, potentially accelerating progress toward more reliable and transparent AI systems.
- →Open knowledge evaluation shifts LLM assessment from predefined questions to open-ended prompts, revealing naturally expressed knowledge rather than benchmark-designer bias.
- →BeQu benchmark covers 10,000 entities with reference corpora for verification, enabling systematic analysis across model scales, reasoning effort, and knowledge domains.
- →The methodology addresses availability bias in existing benchmarks by characterizing what models choose to surface rather than only what researchers ask.
- →Results provide diagnostic insights into knowledge distribution, retention patterns, and domain-specific gaps across language models of varying scales.
- →Publicly available leaderboard and data encourage research reproducibility and competitive development of more knowledge-aware AI systems.