ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?
Researchers introduce ArcANE, a benchmark for evaluating whether role-playing language agents maintain character consistency across narrative arcs rather than fixed personas. The benchmark spans 17 novels and 80 characters, revealing that conditioning on character arc information significantly improves model performance, especially for scenarios outside source texts.
ArcANE addresses a critical gap in how language models are evaluated for narrative tasks. Traditional benchmarks focus on factual recall within predefined story segments, but character-driven narratives require agents to understand psychological progression and extrapolate behavior into unseen scenarios. This research demonstrates that static character representations fail when stories demand dynamic, evolving responses aligned with narrative momentum.
The benchmark's design reveals important architectural insights. Across all tested models and context strategies, character arc conditioning outperformed alternatives like simple retrieval or fixed persona prompting. The performance gap widens dramatically for out-of-distribution scenarios, suggesting retrieval-augmented approaches hit fundamental limits when source material lacks relevant examples. This indicates that models must internalize character development patterns rather than merely surface-level facts.
For AI development, this work signals growing sophistication in evaluating language agents beyond commodity benchmarks. Publishers, game developers, and interactive narrative creators need models that understand character psychology. The fine-tuned ArcANE-8B and 32B variants demonstrate that targeted training on arc-aware data substantially amplifies this advantage, creating a reproducible methodology for building character-consistent systems.
The research implications extend to broader questions about how language models encode narrative understanding. Current architectures treat text as static information retrieval problems, but character-driven systems require temporal reasoning about psychological states. Future work likely involves developing specialized architectures that explicitly model narrative phases and character transformation, rather than relying on general-purpose attention mechanisms.
- βCharacter arc conditioning outperforms all other context strategies across six models and six context modes
- βPerformance advantages are largest for scenarios outside source texts where retrieval-based methods fail
- βFine-tuned ArcANE models widen arc-based advantages, suggesting targeted training improves narrative understanding
- βTraditional factual recall benchmarks miss critical aspects of character consistency in evolving narratives
- βThis benchmark methodology establishes standards for evaluating narrative intelligence in language agents