Abstract
Variance-penalized Markov decision processes (MDPs) for an infinite time horizon have been studied in the literature for asymptotic and one-step variance; in these models, the objective function is generally the expected long-run reward minus a constant time the variance, where variance is used as a measure of risk. for the finite time horizon, asymptotic variance has been considered in Collins [1], but this model accounts for only a terminal reward, i.e., reward is earned at the end of the time horizon. in this paper, we seek to develop a framework for one-step variance in the finite time horizon in which rewards can be non-zero in every state. We develop a solution algorithm based on the stochastic shortest path algorithm of Bertsekas and Tsitsiklis [2]. We also present a Q-Learning algorithm for a simulation-Based scenario which applies in the absence of the transition probability model, along with some preliminary convergence results. ©2010 IEEE.
Recommended Citation
A. Gosavi, "Finite Horizon Markov Control with One-step Variance Penalties," 2010 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2010, pp. 1355 - 1359, article no. 5707071, Institute of Electrical and Electronics Engineers, Dec 2010.
The definitive version is available at https://doi.org/10.1109/ALLERTON.2010.5707071
Department(s)
Engineering Management and Systems Engineering
International Standard Book Number (ISBN)
978-142448214-6
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2024 Institute of Electrical and Electronics Engineers, All rights reserved.
Publication Date
01 Dec 2010