Abstract

Q-Learning is based on value iteration and remains the most popular choice for solving Markov Decision Problems (MDPs) via reinforcement learning (RL), where the goal is to bypass the transition probabilities of the MDP. Approximate policy iteration (API) is another RL technique, not as widely used as Q-Learning, based on modified policy iteration. in this paper, we present and analyze an API algorithm for discounted reward based on (i) classical temporal differences update for policy evaluation and (ii) simulation-Based mean estimation for policy improvement. Further, we analyze for convergence API algorithms based on Q-factors for (i) discounted reward and (ii) for average reward MDPs. the average reward algorithm is based on relative value iteration; we also present results from some numerical experiments with it. © 2012 Published by Elsevier B.V.

Department(s)

Engineering Management and Systems Engineering

Publication Status

Open Access

Keywords and Phrases

Approximate policy iteration; Average reward; Q-P-Learning; Relative value iteration

International Standard Serial Number (ISSN)

1877-0509

Document Type

Article - Conference proceedings

Document Version

Final Version

File Type

text

Language(s)

English

Rights

© 2024 Elsevier, All rights reserved.

Creative Commons Licensing

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Publication Date

01 Jan 2012

Share

 
COinS