A Convergent on Algorithm for Off Policy Temporal Difference Learning with Linear Function Approximation | ECE 692 | Lab Reports Electrical and Electronics Engineering

A Convergent O(n) Algorithm

for Off-policy Temporal-difference Learning

with Linear Function Approximation

Richard S. Sutton, Csaba Szepesv´

ari∗

, Hamid Reza Maei

Reinforcement Learning and Artificial Intelligence Laboratory

Department of Computing Science

University of Alberta

Edmonton, Alberta, Canada T6G 2E8

Abstract

We introduce the first temporal-difference learning algorithm that is stable with

linear function approximation and off-policy training, for any finite Markov de-

cision process, behavior policy, and target policy, and whose complexity scales

linearly in the number of parameters. We consider an i.i.d. policy-evaluation set-

ting in which the data need not come from on-policy experience. The gradient

temporal-difference (GTD) algorithm estimates the expected update vector of the

TD(0) algorithm and performs stochastic gradient descent on its L2norm. We

prove that this algorithm is stable and convergent under the usual stochastic ap-

proximation conditions to the same least-squares solution as found by the LSTD,

but without LSTD’s quadratic computational complexity. GTD is online and in-

cremental, and does not involve multiplying by products of likelihood ratios as in

importance-sampling methods.

1 Off-policy learning methods

Off-policy methods have an important role to play in the larger ambitions of modern reinforcement

learning. In general, updates to a statistic of a dynamical process are said to be “off-policy” if

their distribution does not match the dynamics of the process, particularly if the mismatch is due

to the way actions are chosen. The prototypical example in reinforcement learning is the learning

of the value function for one policy, the target policy, using data obtained while following another

policy, the behavior policy. For example, the popular Q-learning algorithm (Watkins 1989) is an off-

policy temporal-difference algorithm in which the target policy is greedy with respect to estimated

action values, and the behavior policy is something more exploratory, such as a corresponding -

greedy policy. Off-policy methods are also critical to reinforcement-learning-based efforts to model

human-level world knowledge and state representations as predictions of option outcomes (e.g.,

Sutton, Precup & Singh 1999; Sutton, Rafols & Koop 2006).

Unfortunately, off-policy methods such as Q-learning are not sound when used with approximations

that are linear in the learned parameters—the most popular form of function approximation in rein-

forcement learning. Counterexamples have been known for many years (e.g., Baird 1995) in which

Q-learning’s parameters diverge to infinity for any positive step size. This is a severe problem in

so far as function approximation is widely viewed as necessary for large-scale applications of rein-

forcement learning. The need is so great that practitioners have often simply ignored the problem

and continued to use Q-learning with linear function approximation anyway. Although no instances

∗Csaba Szepesv´

ari is on leave from MTA SZTAKI.

A Convergent on Algorithm for Off Policy Temporal Difference Learning with Linear Function Approximation | ECE 692, Lab Reports of Electrical and Electronics Engineering

Related documents

Partial preview of the text

Download A Convergent on Algorithm for Off Policy Temporal Difference Learning with Linear Function Approximation | ECE 692 and more Lab Reports Electrical and Electronics Engineering in PDF only on Docsity!