Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection
Researchers introduce OpAI-Bench, a comprehensive benchmark for detecting AI-generated text in progressive human-AI co-edited documents across multiple granularities. The study reveals that AI-text detectability follows non-monotonic patterns, with mixed-authorship intermediate versions often harder to detect than purely human or heavily AI-edited documents, challenging assumptions in existing detection methods.
OpAI-Bench addresses a critical gap in AI-text detection research by studying how AI authorship signals behave during realistic collaborative editing workflows rather than isolated final outputs. As AI writing assistants become embedded in professional and academic document creation, the ability to detect hybrid human-AI content becomes increasingly important for maintaining authenticity verification, plagiarism detection, and content provenance tracking. The benchmark's multi-granularity approach—examining detection at document, sentence, token, and span levels—provides nuanced insights into how AI contributions manifest across different analytical scales.
The research reveals counterintuitive detection dynamics that have significant implications for content verification systems. The non-monotonic detection patterns discovered suggest that intermediate versions with mixed authorship create detection blind spots where current algorithms struggle, while both endpoints (purely human or heavily AI-edited) remain more identifiable. This finding contradicts the intuitive assumption that more AI content automatically means easier detection, exposing fundamental limitations in existing detector designs.
For stakeholders in content verification, academic integrity, and professional writing platforms, these findings highlight the need for detection systems that account for editing context and authorship mixing patterns rather than relying solely on final output analysis. The benchmark enables developers to stress-test detectors against realistic revision scenarios and identify failure modes. Moving forward, the field requires detectors that model cumulative revision history and operation-specific signals, rather than treating documents as static artifacts. This work establishes a methodological foundation for building more robust AI-text detection systems aligned with actual human-AI collaboration practices.
- →Mixed-authorship documents in intermediate revision stages are often harder to detect than purely human or heavily AI-edited endpoints, creating non-monotonic detection patterns.
- →AI-text detectability depends on multiple factors including edit operation type, domain context, and cumulative revision history, not just the proportion of AI content.
- →OpAI-Bench provides a controlled testbed with nine sequential revision versions per sample across four domains with complete authorship provenance tracking.
- →Current AI-text detectors show significant performance gaps when analyzing progressive human-AI co-editing workflows compared to static final outputs.
- →The benchmark supports multi-level evaluation from document-wide to token-level detection, enabling comprehensive analysis of how AI signals manifest at different granularities.