Abstract
Q-Learning is based on value iteration and remains the most popular choice for solving Markov Decision Problems (MDPs) via reinforcement learning (RL), where the goal is to bypass the transition probabilities of the MDP. Approximate policy iteration (API) is another RL technique, not as widely used as Q-Learning, based on modified policy iteration. in this paper, we present and analyze an API algorithm for discounted reward based on (i) classical temporal differences update for policy evaluation and (ii) simulation-Based mean estimation for policy improvement. Further, we analyze for convergence API algorithms based on Q-factors for (i) discounted reward and (ii) for average reward MDPs. the average reward algorithm is based on relative value iteration; we also present results from some numerical experiments with it. © 2012 Published by Elsevier B.V.
Recommended Citation
A. Gosavi, "Approximate Policy Iteration for Markov Control Revisited," Procedia Computer Science, vol. 12, pp. 90 - 95, Elsevier, Jan 2012.
The definitive version is available at https://doi.org/10.1016/j.procs.2012.09.036
Department(s)
Engineering Management and Systems Engineering
Publication Status
Open Access
Keywords and Phrases
Approximate policy iteration; Average reward; Q-P-Learning; Relative value iteration
International Standard Serial Number (ISSN)
1877-0509
Document Type
Article - Conference proceedings
Document Version
Final Version
File Type
text
Language(s)
English
Rights
© 2024 Elsevier, All rights reserved.
Creative Commons Licensing
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Publication Date
01 Jan 2012