🧠 AI⚪ NeutralImportance 6/10

StoryAlign: Evaluating and Training Reward Models for Story Generation

arXiv – CS AI|Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li|May 7, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce StoryRMB, the first benchmark for evaluating reward models on story generation preferences, and develop StoryReward, a specialized reward model achieving 66.3% accuracy where existing models struggle. The work addresses the challenge of modeling subjective human preferences in narrative generation, enabling better alignment between LLM-generated stories and human expectations.

Analysis

This research tackles a fundamental limitation in current large language model applications: the inability to reliably capture subjective human preferences in creative writing tasks. While LLMs have advanced text generation capabilities, their outputs often diverge from what humans consider well-structured, engaging narratives. The absence of effective preference modeling has hindered progress in story generation, making this benchmark and reward model a meaningful contribution to the field.

The work emerges from broader industry recognition that scaling model size alone insufficient for preference alignment in creative domains. Traditional reward models trained on general text fail to capture the nuanced narrative structure, coherence, and engagement factors that distinguish compelling stories. StoryRMB's 1,133 verified instances and StoryReward's training on approximately 100,000 preference pairs represent significant data curation effort, establishing a foundation for future research in narrative-specific preference learning.

For practitioners developing AI writing assistants and creative tools, this advancement enables more sophisticated story filtering mechanisms. Best-of-n selection strategies powered by StoryReward could substantially improve user experience in story generation applications without requiring larger base models. The public release of datasets, models, and code accelerates adoption across the research community and commercial applications.

Future development likely focuses on expanding StoryReward's domain coverage, improving its performance beyond current benchmarks, and integrating it into larger creative AI systems. The methodology also provides a template for building preference models in other subjective creative domains—poetry, dialogue, worldbuilding—potentially influencing how AI systems handle preference learning more broadly.

Key Takeaways

→StoryRMB benchmark reveals existing reward models achieve only 66.3% accuracy in selecting human-preferred stories, indicating significant unmet challenges in narrative preference modeling.
→StoryReward outperforms larger models on story preference tasks through specialized training on 100,000 curated preference pairs across diverse narrative domains.
→The research demonstrates that domain-specific reward models can exceed the performance of general-purpose models in creative text evaluation.
→Open-source release of datasets and models democratizes access to narrative preference learning tools for researchers and commercial developers.
→Best-of-n story selection using StoryReward improves alignment with human preferences without requiring larger or more computationally expensive base models.