A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs
Researchers present a novel harmonic mean formulation for average reward reinforcement learning in Semi-Markov decision processes (SMDPs), addressing a critical gap where existing algorithms fail under non-stationary reward and duration distributions. The new approach enables more robust model-free learning algorithms for infinite-horizon tasks where traditional reward-to-duration ratio optimization becomes mathematically incorrect.
This paper addresses a fundamental mathematical problem in reinforcement learning that has practical implications for AI systems operating in real-world, non-episodic environments. Traditional approaches to average reward optimization assume stationarity in reward and duration distributions, an assumption that breaks down in many real applications. The authors demonstrate that using a simple ratio of rewards to durations produces incorrect results when these distributions shift over time, a scenario common in continuous learning scenarios.
The introduction of a modified harmonic mean operator represents an elegant solution grounded in mathematical rigor rather than heuristic fixes. Semi-Markov decision processes are particularly relevant for modeling systems where action durations vary stochastically—common in robotics, network optimization, and resource allocation problems. The harmonic mean naturally handles the averaging of rates, making it mathematically appropriate for this domain.
From a technical development perspective, this work strengthens the theoretical foundations of reinforcement learning in continuous environments. The empirical validation against existing algorithms provides practical evidence of superiority in non-stationary settings. This matters for AI practitioners building systems that must operate indefinitely without resetting to episodic states, such as autonomous systems, trading algorithms, or industrial control applications.
The significance extends beyond academic interest. As reinforcement learning deployment increases in real-world applications, correctness under non-stationary conditions becomes critical. This research provides both theoretical guarantees and practical algorithms that developers can implement, reducing the risk of subtle errors in average reward optimization that could compound over extended operation.
- →Existing average reward RL algorithms fail mathematically when reward and duration distributions are non-stationary in infinite-horizon tasks
- →A modified harmonic mean operator correctly computes reward rates under non-stationary conditions, addressing this fundamental gap
- →The approach applies specifically to Semi-Markov decision processes where actions produce variable durations and rewards
- →Model-free algorithms derived from this formulation maintain theoretical robustness while performing empirically better than current methods
- →This advancement improves RL reliability for real-world continuous systems that cannot assume stationary distributions