Variance-penalized Markov Decision Processes: Dynamic Programming and Reinforcement Learning Techniques
Abstract
In control systems theory, the Markov decision process (MDP) is a widely used optimization model involving selection of the optimal action in each state visited by a discrete-event system driven by Markov chains. The classical MDP model is suitable for an agent/decision-maker interested in maximizing expected revenues, but does not account for minimizing variability in the revenues. An MDP model in which the agent can maximize the revenues while simultaneously controlling the variance in the revenues is proposed. This work is rooted in machine learning/neural network concepts, where updating is based on system feedback and step sizes. First, a Bellman equation for the problem is proposed. Thereafter, convergent dynamic programming and reinforcement learning techniques for solving the MDP are provided along with encouraging numerical results on a small MDP and a preventive maintenance problem. © 2014 Taylor and Francis.
Recommended Citation
A. Gosavi, "Variance-penalized Markov Decision Processes: Dynamic Programming and Reinforcement Learning Techniques," International Journal of General Systems, vol. 43, no. 6, pp. 649 - 669, Taylor and Francis Group; Taylor and Francis, Aug 2014.
The definitive version is available at https://doi.org/10.1080/03081079.2014.883387
Department(s)
Engineering Management and Systems Engineering
Keywords and Phrases
Bellman equation; dynamic programming; reinforcement learning; risk penalties; variance-penalized MDPs
International Standard Serial Number (ISSN)
1563-5104; 0308-1079
Document Type
Article - Journal
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2024 Taylor and Francis Group; Taylor and Francis, All rights reserved.
Publication Date
18 Aug 2014
Comments
National Science Foundation, Grant ECS 0841055