Estimation and control
 in finite Markov decision processes
 with the average reward criterion

Rolando Cavazos-Cadena; Raúl Montes-de-Oca

doi:10.4064/am31-2-1

Instytut Matematyczny Polskiej Akademii Nauk / Institute of Mathematics / Publishing house / Journals and Serials / Applicationes Mathematicae / All issues

Estimation and control in finite Markov decision processes with the average reward criterion

Volume 31 / 2004

Rolando Cavazos-Cadena, Raúl Montes-de-Oca Applicationes Mathematicae 31 (2004), 127-154 MSC: Primary 90C40, 93E20; Secondary 60J05. DOI: 10.4064/am31-2-1

Abstract

This work concerns Markov decision chains with finite state and action sets. The transition law satisfies the simultaneous Doeblin condition but is unknown to the controller, and the problem of determining an optimal adaptive policy with respect to the average reward criterion is addressed. A subset of policies is identified so that, when the system evolves under a policy in that class, the frequency estimators of the transition law are consistent on an essential set of admissible state-action pairs, and the non-stationary value iteration scheme is used to select an optimal adaptive policy within that family.

Authors

Rolando Cavazos-CadenaDepartamento de Estadística y Cálculo
Universidad Autónoma Agraria Antonio Narro
Buenavista, Saltillo COAH 25315, México
e-mail
Raúl Montes-de-OcaDepartamento de Matemáticas
Universidad Autónoma Metropolitana
Campus Iztapalapa
Avenida San Rafael Atlixco #186
Colonia Vicentina
México 09340, D.F., México
e-mail

Free download under CC-BY license

Search for IMPAN publications