MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop
Researchers introduce MulFeRL, a reinforcement learning framework that uses multi-turn verbal feedback to improve AI reasoning on failed tasks. By converting qualitative feedback into trainable signals and assigning credit for incremental progress, the approach outperforms traditional reward-based methods on math problems and generalizes well to unseen domains.
MulFeRL addresses a fundamental limitation in current reinforcement learning systems: scalar rewards provide minimal guidance when models fail. Traditional outcome-only feedback tells a system what went wrong but not why, creating a sparse learning signal that slows improvement. This research tackles that problem by incorporating richer verbal explanations that explain reasoning breakdowns, transforming qualitative feedback into actionable learning signals.
The technical contribution centers on three mechanisms working in concert. Progress induction identifies partial advances within failed attempts, triggering regeneration loops that leverage feedback. Progress credit assignment ensures the model learns from verifier-confirmed improvements rather than binary success/failure signals. Structured feedback injection integrates explanations directly into the reasoning process, making feedback part of the model's decision-making rather than external correction.
For the AI development community, this represents meaningful progress toward more sample-efficient learning. Current large language models require massive datasets and computational resources; systems that learn effectively from sparse feedback reduce these requirements. The framework's strong out-of-domain generalization indicates the approach develops more robust reasoning capabilities rather than dataset-specific memorization.
The work has immediate implications for AI companies developing reasoning models for mathematics, science, and code generation. Methods that extract more value from each training example reduce development costs and accelerate capability improvements. As AI systems tackle increasingly complex problems, feedback-guided learning becomes more practical than collecting new training data. The research trajectory suggests future systems will combine verbal explanations with learning algorithms more systematically, moving away from pure supervised learning and toward interactive improvement loops.
- βMulFeRL converts verbal feedback from failed attempts into trainable signals, addressing sparse reward limitations in reinforcement learning
- βThe framework combines progress detection, credit assignment, and structured feedback injection for multi-turn improvement loops
- βPerformance exceeds supervised learning and traditional RL baselines on mathematical reasoning tasks with strong out-of-domain transfer
- βThe approach reduces reliance on massive datasets by extracting more learning value from each example through richer feedback
- βResearch demonstrates practical pathway toward more sample-efficient AI reasoning systems for complex problem domains