🧠 AI⚪ NeutralImportance 6/10

DEMM-Bench: A Cross-Regime Benchmark for Agent-Runtime Governance-Evidence Sufficiency

arXiv – CS AI|Oleg Solozobov|June 23, 2026 at 04:00 AM

🤖AI Summary

DEMM-Bench introduces a benchmark framework for evaluating whether evidence records in agent-runtime systems sufficiently answer governance questions about specific decisions. Using the Decision Evidence Maturity Model, researchers tested 64 cases across eight evidence regimes and found that existing baselines overclaim sufficiency in 50-75% of cases, while a property-level scorer achieved 56.25% accuracy with zero overclaims.

Analysis

DEMM-Bench addresses a critical gap in AI system accountability by establishing measurable standards for decision-evidence sufficiency in agent-runtime environments. Agent systems continuously generate multiple evidence sources—traces, ledgers, provenance graphs, and policy logs—yet practitioners lack systematic methods to verify these records actually answer governance questions rather than merely existing in abundance. This research introduces mathematical rigor to what has been largely an informal assessment process, using the Decision Evidence Maturity Model to evaluate whether evidence across eight distinct regimes can reconstruct decision-level properties. The benchmark's findings reveal significant overclaiming in industry-standard approaches: trace-present and schema-present baselines falsely claim sufficiency in three-quarters of test cases, while ledger-present methods overclaim in half. The redacted property-level candidate scorer demonstrates the importance of targeted evaluation, achieving perfect precision with 56.25% recall on the 64-case test set. This research matters because AI governance and regulatory compliance increasingly depend on audit trails and decision traceability. Organizations deploying autonomous agents face mounting pressure to demonstrate transparent, auditable decision-making, particularly in financial services and healthcare sectors. The benchmark provides a reproducible evaluation framework with publicly deposited datasets and adapters, enabling heterogeneous systems to standardize evidence assessment. As regulatory bodies worldwide impose stricter AI transparency requirements, tools that quantify evidence sufficiency become essential infrastructure. The work establishes baseline expectations for decision accountability and identifies where existing record-keeping falls short, guiding improvements in agent-runtime instrumentation and evidence collection strategies across different deployment contexts.

Key Takeaways

→DEMM-Bench benchmarks whether agent-runtime evidence records sufficiently reconstruct governance decisions across eight evidence regimes
→Industry baseline methods overclaim sufficiency in 50-75% of cases, revealing gaps between claimed and actual decision traceability
→Property-level evaluation achieved zero overclaims at 56.25% accuracy, demonstrating the value of targeted evidence assessment
→Reproducible benchmark datasets and adapters enable standardized evaluation of decision-evidence maturity across heterogeneous systems
→Framework supports regulatory compliance and AI governance by establishing measurable standards for audit trail sufficiency

#agent-governance #benchmarking #decision-evidence #ai-accountability #audit-trails #demm #reproducibility

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

DEMM-Bench: A Cross-Regime Benchmark for Agent-Runtime Governance-Evidence Sufficiency

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge