Variance-penalized Markov Decision Processes: Dynamic Programming and Reinforcement Learning Techniques

Abstract

In control systems theory, the Markov decision process (MDP) is a widely used optimization model involving selection of the optimal action in each state visited by a discrete-event system driven by Markov chains. The classical MDP model is suitable for an agent/decision-maker interested in maximizing expected revenues, but does not account for minimizing variability in the revenues. An MDP model in which the agent can maximize the revenues while simultaneously controlling the variance in the revenues is proposed. This work is rooted in machine learning/neural network concepts, where updating is based on system feedback and step sizes. First, a Bellman equation for the problem is proposed. Thereafter, convergent dynamic programming and reinforcement learning techniques for solving the MDP are provided along with encouraging numerical results on a small MDP and a preventive maintenance problem. © 2014 Taylor and Francis.

Department(s)

Engineering Management and Systems Engineering

Comments

National Science Foundation, Grant ECS 0841055

Keywords and Phrases

Bellman equation; dynamic programming; reinforcement learning; risk penalties; variance-penalized MDPs

International Standard Serial Number (ISSN)

1563-5104; 0308-1079

Document Type

Article - Journal

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2024 Taylor and Francis Group; Taylor and Francis, All rights reserved.

Publication Date

18 Aug 2014

Share

 
COinS