Abstract

Variance-penalized Markov decision processes (MDPs) for an infinite time horizon have been studied in the literature for asymptotic and one-step variance; in these models, the objective function is generally the expected long-run reward minus a constant time the variance, where variance is used as a measure of risk. for the finite time horizon, asymptotic variance has been considered in Collins [1], but this model accounts for only a terminal reward, i.e., reward is earned at the end of the time horizon. in this paper, we seek to develop a framework for one-step variance in the finite time horizon in which rewards can be non-zero in every state. We develop a solution algorithm based on the stochastic shortest path algorithm of Bertsekas and Tsitsiklis [2]. We also present a Q-Learning algorithm for a simulation-Based scenario which applies in the absence of the transition probability model, along with some preliminary convergence results. ©2010 IEEE.

Department(s)

Engineering Management and Systems Engineering

International Standard Book Number (ISBN)

978-142448214-6

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2024 Institute of Electrical and Electronics Engineers, All rights reserved.

Publication Date

01 Dec 2010

Share

 
COinS