When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
Researchers propose a framework for comparing language models on safety without labeled benchmark data, introducing SimpleAudit as a validation tool that uses controlled contrasts and variance analysis to establish model safety rankings. The study demonstrates that comparative safety scores are inherently context-dependent, requiring detailed reporting of methods rather than single rankings.
This research addresses a critical gap in AI deployment workflows: evaluating model safety when no established benchmarks exist for a specific language, industry, or regulatory context. The authors formalize 'benchmarkless comparative safety scoring' and establish validation criteria that replace traditional ground-truth labels with an instrumental-validity chain measuring responsiveness to safe-versus-ablated contrasts, variance dominance, and stability across reruns. The approach recognizes that safety evaluation is not a static property but a relational measurement dependent on scenario selection, evaluation rubrics, and auditor choices.
The framework emerged from practical deployment pressures where organizations need safety comparisons before comprehensive labeled datasets become available. This reflects broader challenges in AI governance: regulators and enterprises operating in new domains cannot wait for perfect benchmarks before making deployment decisions. SimpleAudit's validation results—achieving AUROC values between 0.89 and 1.00 on Norwegian language data—demonstrate technical feasibility, though the finding that 'the safer model depends on scenario category and risk measure' reveals fundamental limitations of comparative scoring.
For practitioners and deployers, this work establishes methodological rigor for safety assessments while cautioning against oversimplification. The Norwegian public-sector case comparing Borealis and Gemma 3 illustrates how safety conclusions vary based on measurement choices. The research directly impacts procurement decisions, regulatory compliance, and risk management in AI deployment. Organizations implementing model selection processes must now account for the transparency requirements outlined here: reporting scores with associated methodology, uncertainty bounds, auditor identity, and judge information rather than relying on simplified rankings.
- →Safety comparisons without labeled benchmarks require transparent reporting of methodology, auditor choice, judge identity, and uncertainty rather than collapsed rankings.
- →SimpleAudit demonstrates that controlled safe-versus-ablated contrasts effectively separate model safety performance with high statistical reliability.
- →Model safety claims are context-dependent, varying significantly by scenario category and risk measure rather than representing absolute model properties.
- →Instrumental validity chains combining responsiveness testing, variance analysis, and rerun stability can replace ground-truth labels for deployment evidence.
- →Procurement and regulatory decisions using comparative safety scores must account for substantial methodological variation across audit configurations.