🧠 AI⚪ NeutralImportance 6/10

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

arXiv – CS AI|Junjie Li, Xinrui Guo, Yuhao Wu, Roy Ka-Wei Lee, Hongzhi Li, Yutao Xie|March 9, 2026 at 04:00 AM

🤖AI Summary

Researchers have developed ConStory-Bench, a new benchmark to evaluate consistency errors in long-form story generation by Large Language Models. The study reveals that LLMs frequently contradict their own established facts and character traits when generating lengthy narratives, with errors most commonly occurring in factual and temporal dimensions around the middle of stories.

Key Takeaways

→ConStory-Bench introduces the first comprehensive benchmark for evaluating narrative consistency in long-form LLM-generated stories with 2,000 prompts across four scenarios.
→The research identifies five error categories with 19 fine-grained subtypes of consistency problems in AI-generated narratives.
→Consistency errors are most prevalent in factual and temporal dimensions, typically appearing in the middle sections of long stories.
→An automated ConStory-Checker pipeline was developed to detect contradictions and provide textual evidence for each judgment.
→The findings reveal that certain error types tend to co-occur and appear in text segments with higher token-level entropy.