MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference
Researchers introduce MAVRL, a machine learning approach that learns reward functions from multiple heterogeneous feedback types (demonstrations, comparisons, ratings, stops) simultaneously using Bayesian inference and amortized variational inference. The method eliminates manual loss balancing and demonstrates superior performance compared to single-feedback approaches across discrete and continuous control benchmarks.
MAVRL addresses a fundamental challenge in reinforcement learning: the fragmentation of reward learning across different feedback modalities. Traditional approaches either isolate learning to single feedback types or combine them through manually-tuned weighted losses, limiting scalability and introducing hyperparameter complexity. By formulating the problem as Bayesian inference over a shared latent reward function, the researchers enable principled integration of qualitatively different signals within a unified probabilistic framework.
This work builds on decades of reward learning research but represents a methodological shift toward treating heterogeneous feedback as complementary information sources rather than competing objectives. The amortized variational inference architecture—with a shared encoder and feedback-specific decoders optimized through a single evidence lower bound—elegantly sidesteps the need for manual loss balancing. This approach mirrors broader trends in machine learning toward end-to-end differentiable systems that learn data-driven solutions to traditionally hand-crafted problems.
The implications extend beyond academic interest. In robotics, autonomous systems, and AI alignment, practitioners must often integrate diverse human feedback: expert demonstrations, pairwise preferences, scalar ratings, and termination signals. MAVRL's demonstrated robustness to environment perturbations and interpretable uncertainty estimates address practical concerns about deploying learned reward functions in safety-critical domains. The uncertainty quantification particularly matters for identifying model confidence gaps and detecting feedback inconsistencies.
Looking ahead, researchers should investigate how this framework scales to larger, more complex environments and whether the inferred reward uncertainty can guide active learning strategies. Integration with recent foundation models and investigation of feedback distribution shifts will determine MAVRL's practical viability in real-world deployment scenarios.
- →MAVRL unifies multiple feedback types through Bayesian inference without manual loss weighting
- →Joint learning exploits complementary information across diverse feedback modalities
- →Inferred reward uncertainty provides interpretable confidence metrics for model analysis
- →Policies trained on jointly-inferred rewards show improved robustness to environmental perturbations
- →Eliminates the need for reducing heterogeneous feedback to common intermediate representations