READER: Robust Evidence-based Authorship Decoding via Extracted Representations
Researchers introduce READER, a framework for identifying which large language model generated a specific output by analyzing hidden activation patterns. The method achieves 70-84% accuracy in identifying source models from 50 diverse prompts, suggesting that model-specific authorship signals exist in frozen LLM representations and can be reliably extracted.
READER addresses a critical operational challenge as AI systems become increasingly modular: determining which model produced a given response when multiple LLMs are accessed through APIs. Traditional fingerprinting approaches fail because they rely on surface-level linguistic patterns that vary wildly depending on prompt semantics. The READER framework innovates by treating a frozen proxy LLM as a reader that maps black-box outputs into its own internal activation space, effectively creating a forensic signature of authorship.
The research represents a meaningful advance in LLM interpretability and security. Rather than averaging representations across prompts—a brittle approach that loses temporal and query-specific information—READER accumulates Bayesian evidence across multiple independent samples. This design choice matters significantly: single-response accuracy ranges from 31-42%, but with 50 responses it reaches 70-84%, demonstrating that authorship signals compound predictably when aggregated properly.
For the AI industry, this work has dual implications. On one hand, it strengthens the ability to audit and verify model provenance, which carries important security and compliance benefits for enterprises deploying AI systems. On the other hand, it reveals that authorship traces are not merely surface artifacts but appear encoded in the fundamental representational structure of capable language models. The finding that stronger LLMs expose more linearly decodable authorship structure suggests this property scales with model capability itself.
Looking forward, organizations deploying multi-model AI systems may need to assume that source identification is technically feasible. This could influence decisions around API monitoring, response attribution, and system architecture. The research also opens questions about whether authorship signals can be deliberately obscured and what that means for model differentiation in competitive markets.
- →READER achieves 70-84% accuracy identifying source LLMs from 50 diverse prompts by extracting authorship signals from frozen model activations.
- →The framework uses Bayesian evidence accumulation across multiple prompts rather than fragile mean-pooling, preserving query-specific attribution signals.
- →Model authorship traces appear structurally encoded in LLM representations rather than surface-level linguistic patterns, scaling with model capability.
- →The technique works on agent-style prompts without predefined benchmarks, addressing real-world provenance challenges in multi-LLM systems.
- →Results suggest enterprises need to assume model source identification is technically feasible when deploying multiple LLM APIs.