SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering
Researchers introduce SPADER, a reinforcement learning framework that enables large language models to discover multiple valid answers to complex questions through tool-augmented search. The system combines step-wise credit assignment with diversity-aware rewards to improve recall and F1 scores across multiple QA benchmarks.
SPADER addresses a fundamental limitation in current AI reasoning systems: while recent advances have improved long-horizon tool use, most focus on single-answer tasks that don't reflect real-world information needs. Multi-Answer QA requires agents to exhaustively search for all valid solutions—a significantly harder problem requiring both precise credit assignment and sustained exploration incentives. The framework tackles two specific challenges that have hindered prior approaches. First, assigning credit across long search trajectories traditionally requires critic models that add computational overhead and training complexity. SPADER's Step-wise Peer Advantage mechanism eliminates this by comparing parallel trajectories at each decision point, using peer returns to estimate advantages without additional neural networks. Second, exploration naturally clusters toward high-frequency entities, making rare but valid answers difficult to discover. SPADER's diversity-aware rewards actively counterbalance this by upweighting novel findings and downweighting redundant ones, effectively reshaping the exploration landscape. The experimental results spanning QAMPARI, Mintaka, WebQSP, and QUEST datasets demonstrate consistent improvements over both prompting-based baselines and supervised RL approaches. This work carries implications for AI agents deployed in information retrieval, research assistance, and knowledge discovery applications where comprehensive answers matter more than single-shot responses. The open-source release enables reproducibility and adoption, potentially influencing how future AI systems approach multi-solution search problems.
- →SPADER eliminates critic networks through peer-based advantage estimation, reducing computational overhead while improving credit assignment.
- →Diversity-aware exploration rewards systematically promote discovery of rare, valid answers rather than clustering around common entities.
- →The framework shows consistent F1 and recall improvements across four benchmark datasets compared to prompting and supervised baselines.
- →Step-wise peer alignment enables parallel trajectory comparison without centralized critic models, improving training efficiency.
- →Open-source release supports broader adoption in information retrieval and multi-answer knowledge discovery applications.