




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Lab; Class: Special Topics; Subject: Electrical And Computer Engr; University: University of Tennessee - Knoxville; Term: Unknown 1999;
Typology: Lab Reports
1 / 8
This page cannot be seen from the preview
Don't miss anything!
Richard S. Sutton, Csaba Szepesv´ari∗ , Hamid Reza Maei Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta Edmonton, Alberta, Canada T6G 2E
We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov de- cision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation set- ting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L 2 norm. We prove that this algorithm is stable and convergent under the usual stochastic ap- proximation conditions to the same least-squares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and in- cremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.
Off-policy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “off-policy” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Q-learning algorithm (Watkins 1989) is an off- policy temporal-difference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding - greedy policy. Off-policy methods are also critical to reinforcement-learning-based efforts to model human-level world knowledge and state representations as predictions of option outcomes (e.g., Sutton, Precup & Singh 1999; Sutton, Rafols & Koop 2006).
Unfortunately, off-policy methods such as Q-learning are not sound when used with approximations that are linear in the learned parameters—the most popular form of function approximation in rein- forcement learning. Counterexamples have been known for many years (e.g., Baird 1995) in which Q-learning’s parameters diverge to infinity for any positive step size. This is a severe problem in so far as function approximation is widely viewed as necessary for large-scale applications of rein- forcement learning. The need is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. Although no instances
∗Csaba Szepesv´ari is on leave from MTA SZTAKI.
of absolute divergence in applications have been reported in the literature, the potential for instability is disturbing and probably belies real but less obvious problems.
The stability problem is not specific to reinforcement learning. Classical dynamic programming methods such as value and policy iteration are also off-policy methods and also diverge on some problems when used with linear function approximation. Reinforcement learning methods are ac- tually an improvement over conventional dynamic programming methods in that at least they can be used stably with linear function approximation in their on-policy form. The stability problem is also not due to the interaction of control and prediction, or to stochastic approximation effects; the simplest counterexamples are for deterministic, expected-value-style, synchronous policy evaluation (see Baird 1995; Sutton & Barto 1998).
Prior to the current work, the possibility of instability could not be avoided whenever four individ- ually desirable algorithmic features were combined: 1) off-policy updates, 2) temporal-difference learning, 3) linear function approximation, and 4) linear complexity in memory and per-time-step computation. If any one of these four is abandoned, then stable methods can be obtained relatively easily. But each feature brings value and practitioners are loath to give any of them up, as we discuss later in a penultimate related-work section. In this paper we present the first algorithm to achieve all four desirable features and be stable and convergent for all finite Markov decision processes, all target and behavior policies, and all feature representations for the linear approximator. Moreover, our algorithm does not use importance sampling and can be expected to be much better conditioned and of lower variance than importance sampling methods. Our algorithm can be viewed as perform- ing stochastic gradient-descent in a novel objective function whose optimum is the least-squares TD solution. Our algorithm is also incremental and suitable for online use just as are simple temporal- difference learning algorithms such as Q-learning and TD(λ) (Sutton 1988). Our algorithm can be broadly characterized as a gradient-descent version of TD(0), and accordingly we call it GTD(0).
In this section we formulate the off-policy policy-evaluation problem for one-step temporal- difference learning such that the data consists of independent, identically-distributed (i.i.d.) sam- ples. We start by considering the standard reinforcement learning framework, in which a learning agent interacts with an environment consisting of a finite Markov decision process (MDP). At each of a sequence of discrete time steps, t = 1, 2 ,.. ., the environment is in a state st ∈ S, the agent chooses an action at ∈ A, and then the environment emits a reward rt ∈ R, and transitions to its next state st+1 ∈ S. The state and action sets are finite. State transitions are stochastic and depen- dent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, and on the next state. The agent process generating the actions is termed the behavior policy. To start, we assume a deterministic target policy π : S → A. The objective is to learn an approximation to its state-value function:
V π^ (s) = Eπ
t=
γt−^1 rt|s 1 = s
where γ ∈ [0, 1) is the discount rate. The learning is to be done without knowledge of the process dynamics and from observations of a single continuous trajectory with no resets.
In many problems of interest the state set is too large for it to be practical to approximate the value of each state individually. Here we consider linear function approximation, in which states are mapped to feature vectors with fewer components than the number of states. That is, for each state s ∈ S there is a corresponding feature vector φ(s) ∈ Rn, with n |S|. The approximation to the value function is then required to be linear in the feature vectors and a corresponding parameter vector θ ∈ Rn: V π^ (s) ≈ θ>φ(s). (2)
Further, we assume that the states st are not visible to the learning agent in any way other than through the feature vectors. Thus this function approximation formulation can include partial- observability formulations such as POMDPs as a special case.
The environment and the behavior policy together generate a stream of states, actions and re- wards, s 1 , a 1 , r 1 , s 2 , a 2 , r 2 ,.. ., which we can break into causally related 4-tuples, (s 1 , a 1 , r 1 , s′ 1 ),
is the direction opposite to the gradient. This is straightforward if the gradient can be written as a single expected value, but here we have a product of two expected values. One cannot sample both of them because the sample product will be biased by their correlation. However, one could store a long-term, quasi-stationary estimate of either of the expectations and then sample the other. The question is, which expectation should be estimated and stored, and which should be sampled? Both ways seem to lead to interesting learning algorithms.
First let us consider the algorithm obtained by forming and storing a separate estimate of the first expectation, that is, of the matrix A = E
φ(φ − γφ′)>
. This matrix is straightforward to estimate from experience as a simple arithmetic average of all previously observed sample outer products φ(φ − γφ′)>. Note that A is a stationary statistic in any fixed-policy policy-evaluation problem; it does not depend on θ and would not need to be re-estimated if θ were to change. Let Ak be the estimate of A after observing the first k samples, (φ 1 , r 1 , φ′ 1 ),... , (φk, rk, φ′ k). Then this algorithm is defined by
Ak =
k
∑^ k
i=
φi(φi − γφ′ i)>^ (7)
along with the gradient descent rule:
θk+1 = θk + αkA> k δkφk, k ≥ 1 , (8)
where θ 1 is arbitrary, δk = rk + γθ> k φ′ k − θ> k φk, and αk > 0 is a series of step-size parameters, possibly decreasing over time. We call this algorithm A>TD(0) because it is essentially conventional TD(0) prefixed by an estimate of the matrix A>. Although we find this algorithm interesting, we do not consider it further here because it requires O(n^2 ) memory and computation per time step.
The second path to a stochastic-approximation algorithm for estimating the gradient (6) is to form and store an estimate of the second expectation, the vector E[δφ], and to sample the first expectation, E
φ(φ − γφ′)>
. Let uk denote the estimate of E[δφ] after observing the first k − 1 samples, with u 1 = 0. The GTD(0) algorithm is defined by
uk+1 = uk + βk(δkφk − uk) (9)
and θk+1 = θk + αk(φk − γφ′ k)φ> k uk, (10)
where θ 1 is arbitrary, δk is as in (3) using θk, and αk > 0 and βk > 0 are step-size parameters, possibly decreasing over time. Notice that if the product is formed right-to-left, then the entire computation is O(n) per time step.
The purpose of this section is to establish that GTD(0) converges with probability one to the TD solution in the i.i.d. problem formulation under standard assumptions. In particular, we have the following result:
Theorem 4.1 (Convergence of GTD(0)). Consider the GTD(0) iteration (9,10) with step-size se- quences αk and βk satisfying βk = ηαk, η > 0 , αk, βk ∈ (0, 1],
k=0 αk^ =^ ∞,^
k=0 α
2 k <^ ∞. Further assume that (φk, rk, φ′ k) is an i.i.d. sequence with uniformly bounded second moments. Let
A = E
φk(φk − γφ′ k)>
and b = E[rkφk] (note that A and b are well-defined because the distribu- tion of (φk, rk, φ′ k) does not depend on the sequence index k). Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4).
Proof. We use the ordinary-differential-equation (ODE) approach (Borkar & Meyn 2000). First, we rewrite the algorithm’s two iterations as a single iteration in a combined parameter vector with 2 n components ρ> k = (v> k , θ> k ), where vk = uk/
η, and a new reward-related vector with 2 n
components g k>+1 = (rkφ> k , 0 >):
ρk+1 = ρk + αk
η (Gk+1ρk + gk+1) ,
where
Gk+1 =
ηI φk(γφ′ k − φk)> (φk − γφ′ k)φ> k 0
Let G = E[Gk] and g = E[gk]. Note that G and g are well-defined as by the assumption the process {φk, rk, φ′ k}k is i.i.d. In particular,
η I −A A>^0
, g =
b 0
Further, note that (4) follows from Gρ + g = 0, (11)
where ρ>^ = (v>, θ>).
Now we apply Theorem 2.2 of Borkar & Meyn (2000). For this purpose we write ρk+1 = ρk + αk
η(Gρk +g+(Gk+1−G)ρk +(gk+1−g)) = ρk +α′ k(h(ρk)+Mk+1), where α′ k = αk
η, h(ρ) = g + Gρ and Mk+1 = (Gk+1 − G)ρk + gk+1 − g. Let Fk = σ(ρ 1 , M 1 ,... , ρk− 1 , Mk). Theorem 2. requires the verification of the following conditions: (i) The function h is Lipschitz and h∞(ρ) = limr→∞ h(rρ)/r is well-defined for every ρ ∈ R^2 n; (ii-a) The sequence (Mk, Fk) is a martingale difference sequence, and (ii-b) for some C 0 > 0 , E
‖Mk+1‖^2 | Fk
≤ C 0 (1 + ‖ρk‖^2 ) holds for any initial parameter vector ρ 1 ; (iii) The sequence α′ k satisfies 0 < α′ k ≤ 1 ,
k=1 α
′ ∑∞^ k^ =^ ∞, k=1(α
′ k)
(^2) < +∞; and (iv) The ODE ρ˙ = h(ρ) has a globally asymptotically stable equilibrium.
Clearly, h(ρ) is Lipschitz with coefficient ‖G‖ and h∞(ρ) = Gρ. By construction, (Mk, Fk) satisfies E[Mk+1|Fk] = 0 and Mk ∈ Fk, i.e., it is a martingale difference sequence. Condition (ii-b) can be shown to hold by a simple application of the triangle inequality and the boundedness of the the second moments of (φk, rk, φ′ k). Condition (iii) is satisfied by our conditions on the step-size sequences αk, βk. Finally, the last condition (iv) will follow from the elementary theory of linear differential equations if we can show that the real parts of all the eigenvalues of G are negative.
First, let us show that G is non-singular. Using the determinant rule for partitioned matrices^1 we get det(G) = det(A>A) 6 = 0. This indicates that all the eigenvalues of G are non-zero. Now, let λ ∈ C, λ 6 = 0 be an eigenvalue of G with corresponding normalized eigenvector x ∈ C^2 n;
that is, ‖x‖^2 = x∗x = 1, where x∗^ is the complex conjugate of x. Hence x∗Gx = λ. Let
x>^ = (x> 1 , x> 2 ), where x 1 , x 2 ∈ Cn. Using the definition of G, λ = x∗Gx = −
η‖x 1 ‖^2 + x∗ 1 Ax 2 − x∗ 2 A>x 1. Because A is real, A∗^ = A>, and it follows that (x∗ 1 Ax 2 )∗^ = x∗ 2 A>x 1. Thus,
Re(λ) = Re(x∗Gx) = −
η‖x 1 ‖^2 ≤ 0. We are now done if we show that x 1 cannot be zero. If x 1 = 0, then from λ = x∗Gx we get that λ = 0, which contradicts with λ 6 = 0.
The next result concerns the convergence of GTD(0) when (φk, rk, φ′ k) is obtained by the off-policy sub-sampling process described originally in Section 2. We make the following assumption:
Assumption A1 The behavior policy πb (generator of the actions at) selects all actions of the target policy π with positive probability in every state, and the target policy is deterministic.
This assumption is needed to ensure that the sub-sampled process sk is well-defined and that the obtained sample is of “high quality”. Under this assumption it holds that sk is again a Markov chain by the strong Markov property of Markov processes (as the times selected when actions correspond to those of the behavior policy form Markov times with respect to the filtration defined by the original process st). The following theorem shows that the conclusion of the previous result continues to hold in this case:
Theorem 4.2 (Convergence of GTD(0) with a sub-sampled process.). Assume A1. Let the param- eters θk, uk be updated by (9,10). Further assume that (φk, rk, φ′ k) is such that E
‖φk‖^2 |sk− 1
r k^2 |sk− 1
‖φ′ k‖^2 |sk− 1
are uniformly bounded. Assume that the Markov chain (sk) is aperi- odic and irreducible, so that limk→∞ P(sk = s′|s 0 = s) = μ(s′) exists and is unique. Let s be a state randomly drawn from μ, and let s′^ be a state obtained by following π for one time step in the MDP from s. Further, let r(s, s′) be the reward incurred. Let A = E
φ(s)(φ(s) − γφ(s′))>
and b = E[r(s, s′)φ(s)]. Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4), provided that s 1 ∼ μ.
Proof. The proof of Theorem 4.1 goes through without any changes once we observe that G = E[Gk+1|Fk] and g = E[gk+1 | Fk].
(^1) According to this rule, if A ∈ Rn×n, B ∈ Rn×m, C ∈ Rm×n, D ∈ Rm×m (^) then for F = [A B; C D] ∈
R(n+m)×(n+m), det(F ) = det(A) det(D − CA−^1 B).
There have been several prior attempts to attain the four desirable algorithmic features mentioned at the beginning this paper (off-policy stability, temporal-difference learning, linear function approxi- mation, and O(n) complexity) but none has been completely successful.
One idea for retaining all four desirable features is to use importance sampling techniques to re- weight off-policy updates so that they are in the same direction as on-policy updates in expected value (Precup, Sutton & Dasgupta 2001; Precup, Sutton & Singh 2000). Convergence can some- times then be assured by existing results on the convergence of on-policy methods (Tsitsiklis & Van Roy 1997; Tadic 2001). However, the importance sampling weights are cumulative products of (possibly many) target-to-behavior-policy likelihood ratios, and consequently they and the cor- responding updates may be of very high variance. The use of “recognizers” to construct the target policy directly from the behavior policy (Precup, Sutton, Paduraru, Koop & Singh 2006) is one strat- egy for limiting the variance; another is careful choice of the target policies (see Precup, Sutton & Dasgupta 2001). However, it remains the case that for all of such methods to date there are always choices of problem, behavior policy, and target policy for which the variance is infinite, and thus for which there is no guarantee of convergence.
Residual gradient algorithms (Baird 1995) have also been proposed as a way of obtaining all four desirable features. These methods can be viewed as gradient descent in the expected squared TD error, E
δ^2
; thus they converge stably to the solution that minimizes this objective for arbitrary differentiable function approximators. However, this solution has always been found to be much inferior to the TD solution (exemplified by (4) for the one-step linear case). In the literature (Baird 1995; Sutton & Barto 1998), it is often claimed that residual-gradient methods are guaranteed to find the TD solution in two special cases: 1) systems with deterministic transitions and 2) systems in which two samples can be drawn for each next state (e.g., for which a simulation model is available). Our own analysis indicates that even these two special requirements are insufficient to guarantee convergence to the TD solution.^2
Gordon (1995) and others have questioned the need for linear function approximation. He has proposed replacing linear function approximation with a more restricted class of approximators, known as averagers, that never extrapolate outside the range of the observed data and thus cannot diverge. Rightly or wrongly, averagers have been seen as being too constraining and have not been used on large applications involving online learning. Linear methods, on the other hand, have been widely used (e.g., Baxter, Tridgell & Weaver 1998; Sturtevant & White 2006; Schaeffer, Hlynka & Jussila 2001).
The need for linear complexity has also been questioned. Second-order methods for linear approx- imators, such as LSTD (Bradtke & Barto 1996; Boyan 2002) and LSPI (Lagoudakis & Parr 2003; see also Peters, Vijayakumar & Schaal 2005), can be effective on moderately sized problems. If the number of features in the linear approximator is n, then these methods require memory and per-time- step computation that is O(n^2 ). Newer incremental methods such as iLSTD (Geramifard, Bowling & Sutton 2006) have reduced the per-time-complexity to O(n), but are still O(n^2 ) in memory. Spar- sification methods may reduce the complexity further, they do not help in the general case, and may apply to O(n) methods as well to further reduce their complexity. Linear function approximation is most powerful when very large numbers of features are used, perhaps millions of features (e.g., as in Silver, Sutton & M¨uller 2007). In such cases, O(n^2 ) methods are not feasible.
GTD(0) is the first off-policy TD algorithm to converge under general conditions with linear func- tion approximation and linear complexity. As such, it breaks new ground in terms of important,
(^2) For a counterexample, consider that given in Dayan’s (1992) Figure 2, except now consider that state A is
actually two states, A and A’, which share the same feature vector. The two states occur with 50-50 probability, and when one occurs the transition is always deterministically to B followed by the outcome 1 , whereas when the other occurs the transition is always deterministically to the outcome 0. In this case V (A) and V (B) will converge under the residual-gradient algorithm to the wrong answers, 1 / 3 and 2 / 3 , even though the system is deterministic, and even if multiple samples are drawn from each state (they will all be the same).
absolute abilities not previous available in existing algorithms. We have conducted empirical stud- ies with the GTD(0) algorithm and have confirmed that it converges reliably on standard off-policy counterexamples such as Baird’s (1995) “star” problem. On on-policy problems such as the n-state random walk (Sutton 1988; Sutton & Barto 1998), GTD(0) does not seem to learn as efficiently as classic TD(0), although we are still exploring different ways of setting the step-size parameters, and other variations on the algorithm. It is not clear that the GTD(0) algorithm in its current form will be a fully satisfactory solution to the off-policy learning problem, but it is clear that is breaks new ground and achieves important abilities that were previously unattainable.
Acknowledgments
The authors gratefully acknowledge insights and assistance they have received from David Silver, Eric Wiewiora, Mark Ring, Michael Bowling, and Alborz Geramifard. This research was supported by iCORE, NSERC and the Alberta Ingenuity Fund.
References
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann. Baxter, J., Tridgell, A., Weaver, L. (1998). Experiments in parameter learning using temporal differences. International Computer Chess Association Journal, 21 , 84–99. Bertsekas, D. P., Tsitsiklis. J. (1996). Neuro-Dynamic Programming. Athena Scientific, 1996. Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization , 38(2):447–469. Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49:233–
Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57. Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8:341–362. Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental least-square temporal difference learning. Proceedings of the National Conference on Artificial Intelligence, pp. 356–361. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco. Lagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learning Research, 4:1107-1149. Peters, J., Vijayakumar, S. and Schaal, S. (2005). Natural Actor-Critic. Proceedings of the 16th European Conference on Machine Learning, pp. 280–291. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417–424. Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with Recognizers. Advances in Neural Information Processing Systems 18. Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann. Schaeffer, J., Hlynka, M., Jussila, V. (2001). Temporal difference learning applied to a high-performance game- playing program. Proceedings of the International Joint Conference on Artificial Intelligence, pp. 529–534. Silver, D., Sutton, R. S., M¨uller, M. (2007). Reinforcement learning of local shape in the game of Go. Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1053–1058. Sturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceed- ings of the 5th International Conference on Computers and Games. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44. Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211. Sutton, R. S., Rafols, E.J., and Koop, A. 2006. Temporal abstraction in temporal-difference networks. Advances in Neural Information Processing Systems 18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine Learning 42:241– Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approxi- mation. IEEE Transactions on Automatic Control, 42:674–690. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.