Gaussian Processes for Reinforcement Learning: Extensions and Improvements | Papers Electrical and Electronics Engineering

Reinforcement learning with Gaussian processes

Yaakov Engel yaki@cs.ualberta.ca

Dept. of Computing Science, University of Alberta, Edmonton, Canada

Shie Mannor shie@ece.mcgill.ca

Dept. of Electrical and Computer Engineering, McGill University, Montreal, Canada

Ron Meir rmeir@ee.technion.ac.il

Dept. of Electrical Engineering, Technion Institute of Technology, Haifa 32000, Israel

Abstract

Gaussian Process Temporal Difference

(GPTD) learning offers a Bayesian solution

to the policy evaluation problem of reinforce-

ment learning. In this paper we extend the

GPTD framework by addressing two pressing

issues, which were not adequately treated

in the original GPTD paper (Engel et al.,

2003). The first is the issue of stochasticity

in the state transitions, and the second is

concerned with action selection and policy

improvement. We present a new generative

model for the value function, deduced from

its relation with the discounted return. We

derive a corresponding on-line algorithm

for learning the posterior moments of the

value Gaussian process. We also present a

SARSA based extension of GPTD, termed

GPSARSA, that allows the selection of

actions and the gradual improvement of

policies without requiring a world-model.

1. Introduction

In Engel et al. (2003) the use of Gaussian Processes

(GPs) for solving the Reinforcement Learning (RL)

problem of value estimation was introduced. Since

GPs belong to the family of kernel machines, they

bring into RL the high, and quickly growing represen-

tational flexibility of kernel based representations, al-

lowing them to deal with almost any conceivable object

of interest, from text documents and DNA sequence

data to probability distributions, trees and graphs, to

mention just a few (see Sch¨olkopf & Smola, 2002, and

references therein). Moreover, the use of Bayesian rea-

Appearing in Proceedings of the 22nd International Confer-

ence on Machine Learning, Bonn, Germany, 2005. Copy-

right 2005 by the author(s)/owner(s).

soning with GPs allows one to obtain not only value

estimates, but also estimates of the uncertainty in the

value, and this in large and even infinite MDPs.

However, both the probabilistic generative model and

the corresponding Gaussian Process Temporal Dif-

ferences (GPTD) algorithm proposed in Engel et al.

(2003) had two major shortcomings. First, the original

model is strictly correct only if the state transitions of

the underlying Markov Decision Process (MDP) are

deterministic1, and if the rewards are corrupted by

white Gaussian noise. While the second assumption is

relatively innocuous, the first is a serious handicap to

the applicability of the GPTD model to general MDPs.

Secondly, in RL what we are really after is an optimal,

or at least a good suboptimal action selection policy.

Many algorithms for solving this problem are based on

the Policy Iteration method, in which the value func-

tion must be estimated for a sequence of fixed poli-

cies, making value estimation, or policy evaluation, a

crucial algorithmic component. Since the GPTD al-

gorithm only addresses the value estimation problem,

we need to modify it somehow, if we wish to solve the

complete RL problem.

One possible heuristic modification, demonstrated in

Engel et al. (2003), is the use of Optimistic Policy It-

eration (OPI) (Bertsekas & Tsitsiklis, 1996). In OPI

the learning agent utilizes a model of its environment

and its current value estimate to guess the expected

payoff for each of the actions available to it at each

time step. It then greedily (or ε-greedily) chooses the

highest ranking action. Clearly, OPI may be used only

when a good model of the MDP is available to the

agent. However, assuming that such a model is avail-

able as prior knowledge is a rather strong assumption

inapplicable in many domains, while estimating such a

model on-the-fly, especially when the state-transitions

1Or if the discount factor is zero, in which case the

model degenerates to simple GP regression.

Gaussian Processes for Reinforcement Learning: Extensions and Improvements, Papers of Electrical and Electronics Engineering

Related documents

Partial preview of the text

Download Gaussian Processes for Reinforcement Learning: Extensions and Improvements and more Papers Electrical and Electronics Engineering in PDF only on Docsity!

Abstract

1. Introduction

2. Modeling the Value Via the

Discounted Return

∑^ ∞

3. An On-Line Algorithm

[

]

4. Policy Improvement with GPSARSA

6. Discussion

References