DEMM-Bench: A Cross-Regime Benchmark for Agent-Runtime Governance-Evidence Sufficiency
DEMM-Bench introduces a benchmark framework for evaluating whether evidence records in agent-runtime systems sufficiently answer governance questions about specific decisions. Using the Decision Evidence Maturity Model, researchers tested 64 cases across eight evidence regimes and found that existing baselines overclaim sufficiency in 50-75% of cases, while a property-level scorer achieved 56.25% accuracy with zero overclaims.
DEMM-Bench addresses a critical gap in AI system accountability by establishing measurable standards for decision-evidence sufficiency in agent-runtime environments. Agent systems continuously generate multiple evidence sources—traces, ledgers, provenance graphs, and policy logs—yet practitioners lack systematic methods to verify these records actually answer governance questions rather than merely existing in abundance. This research introduces mathematical rigor to what has been largely an informal assessment process, using the Decision Evidence Maturity Model to evaluate whether evidence across eight distinct regimes can reconstruct decision-level properties. The benchmark's findings reveal significant overclaiming in industry-standard approaches: trace-present and schema-present baselines falsely claim sufficiency in three-quarters of test cases, while ledger-present methods overclaim in half. The redacted property-level candidate scorer demonstrates the importance of targeted evaluation, achieving perfect precision with 56.25% recall on the 64-case test set. This research matters because AI governance and regulatory compliance increasingly depend on audit trails and decision traceability. Organizations deploying autonomous agents face mounting pressure to demonstrate transparent, auditable decision-making, particularly in financial services and healthcare sectors. The benchmark provides a reproducible evaluation framework with publicly deposited datasets and adapters, enabling heterogeneous systems to standardize evidence assessment. As regulatory bodies worldwide impose stricter AI transparency requirements, tools that quantify evidence sufficiency become essential infrastructure. The work establishes baseline expectations for decision accountability and identifies where existing record-keeping falls short, guiding improvements in agent-runtime instrumentation and evidence collection strategies across different deployment contexts.
- →DEMM-Bench benchmarks whether agent-runtime evidence records sufficiently reconstruct governance decisions across eight evidence regimes
- →Industry baseline methods overclaim sufficiency in 50-75% of cases, revealing gaps between claimed and actual decision traceability
- →Property-level evaluation achieved zero overclaims at 56.25% accuracy, demonstrating the value of targeted evidence assessment
- →Reproducible benchmark datasets and adapters enable standardized evaluation of decision-evidence maturity across heterogeneous systems
- →Framework supports regulatory compliance and AI governance by establishing measurable standards for audit trail sufficiency