LitSeg: Narrative-Aware Document Segmentation for Literary RAG
Researchers introduce LitSeg, a narrative-theory-guided framework for intelligently segmenting literary documents in Retrieval-Augmented Generation systems. The method uses multi-stage prompting to identify plot events and narrative structures, with a lightweight variant (LitSeg-Lite) that distills this complexity into a single inference pass, demonstrating improved retrieval accuracy for literary RAG applications.
LitSeg addresses a fundamental but overlooked challenge in RAG systems: how to segment documents in ways that preserve semantic and narrative coherence. Traditional chunking strategies rely on generic semantic similarity or fixed-length windowing, which fails catastrophically for literary texts where plot continuity and character arcs span variable distances. The paper demonstrates that narrative structure matters—turning points, event boundaries, and narrative threads form natural segmentation boundaries that simple embeddings cannot capture.
This research builds on growing recognition that RAG performance depends critically on retrieval quality, which hinges on document representation. While recent work has focused on reranking and query expansion, LitSeg tackles the upstream problem of chunk construction itself. For literary domains and narrative-heavy content (including academic papers with complex arguments), semantically-aware chunking directly improves downstream QA performance.
The introduction of LitSeg-Lite represents a pragmatic engineering contribution. Multi-stage prompting with large models is computationally expensive at scale; distilling the narrative analysis process into a lightweight fine-tuned model reduces inference latency while maintaining performance gains. This approach scales RAG for literary works in production environments.
Market implications center on improved RAG capabilities for specialized domains. Publishers, educational platforms, and digital libraries could deploy LitSeg to enhance search and question-answering over literary corpora. The methodology potentially extends to other narrative-heavy domains like legal cases, historical documents, and domain-specific case studies.
- →Narrative-aware segmentation significantly outperforms semantic-blind chunking strategies for literary document RAG systems
- →Multi-stage prompting explicitly extracts narrative events, threads, and turning points to inform intelligent segmentation boundaries
- →LitSeg-Lite achieves comparable performance with single-pass inference through knowledge distillation, enabling production-scale deployment
- →The approach improves both retrieval accuracy and downstream QA performance on literary corpora
- →Framework principles could extend beyond literature to other narrative-heavy domains like legal and historical documents