




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The use of gaussian processes (gps) in reinforcement learning (rl) for value estimation. The authors address the shortcomings of the original gptd model, including its strict assumptions of deterministic transitions and rewards, and propose modifications to allow for stochasticity and model-free policy improvement. They also present a sarsa-based extension of gptd, gpsarsa, which allows for action selection and gradual policy improvement.
Typology: Papers
1 / 8
This page cannot be seen from the preview
Don't miss anything!
Yaakov Engel yaki@cs.ualberta.ca
Dept. of Computing Science, University of Alberta, Edmonton, Canada
Shie Mannor shie@ece.mcgill.ca
Dept. of Electrical and Computer Engineering, McGill University, Montreal, Canada
Ron Meir rmeir@ee.technion.ac.il
Dept. of Electrical Engineering, Technion Institute of Technology, Haifa 32000, Israel
Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforce- ment learning. In this paper we extend the GPTD framework by addressing two pressing issues, which were not adequately treated in the original GPTD paper (Engel et al., 2003). The first is the issue of stochasticity in the state transitions, and the second is concerned with action selection and policy improvement. We present a new generative model for the value function, deduced from its relation with the discounted return. We derive a corresponding on-line algorithm for learning the posterior moments of the value Gaussian process. We also present a SARSA based extension of GPTD, termed GPSARSA, that allows the selection of actions and the gradual improvement of policies without requiring a world-model.
In Engel et al. (2003) the use of Gaussian Processes (GPs) for solving the Reinforcement Learning (RL) problem of value estimation was introduced. Since GPs belong to the family of kernel machines, they bring into RL the high, and quickly growing represen- tational flexibility of kernel based representations, al- lowing them to deal with almost any conceivable object of interest, from text documents and DNA sequence data to probability distributions, trees and graphs, to mention just a few (see Sch¨olkopf & Smola, 2002, and references therein). Moreover, the use of Bayesian rea-
Appearing in Proceedings of the 22 nd^ International Confer- ence on Machine Learning, Bonn, Germany, 2005. Copy- right 2005 by the author(s)/owner(s).
soning with GPs allows one to obtain not only value estimates, but also estimates of the uncertainty in the value, and this in large and even infinite MDPs. However, both the probabilistic generative model and the corresponding Gaussian Process Temporal Dif- ferences (GPTD) algorithm proposed in Engel et al. (2003) had two major shortcomings. First, the original model is strictly correct only if the state transitions of the underlying Markov Decision Process (MDP) are deterministic^1 , and if the rewards are corrupted by white Gaussian noise. While the second assumption is relatively innocuous, the first is a serious handicap to the applicability of the GPTD model to general MDPs. Secondly, in RL what we are really after is an optimal, or at least a good suboptimal action selection policy. Many algorithms for solving this problem are based on the Policy Iteration method, in which the value func- tion must be estimated for a sequence of fixed poli- cies, making value estimation, or policy evaluation, a crucial algorithmic component. Since the GPTD al- gorithm only addresses the value estimation problem, we need to modify it somehow, if we wish to solve the complete RL problem. One possible heuristic modification, demonstrated in Engel et al. (2003), is the use of Optimistic Policy It- eration (OPI) (Bertsekas & Tsitsiklis, 1996). In OPI the learning agent utilizes a model of its environment and its current value estimate to guess the expected payoff for each of the actions available to it at each time step. It then greedily (or ε-greedily) chooses the highest ranking action. Clearly, OPI may be used only when a good model of the MDP is available to the agent. However, assuming that such a model is avail- able as prior knowledge is a rather strong assumption inapplicable in many domains, while estimating such a model on-the-fly, especially when the state-transitions (^1) Or if the discount factor is zero, in which case the model degenerates to simple GP regression.
are stochastic, may be prohibitively expensive. In either case, computing the expectations involved in ranking the actions may itself be prohibitively costly. Another possible modification, one that does not re- quire a model, is to estimate state-action values, or Q-values, using an algorithm such as Sutton’s SARSA (Sutton & Barto, 1998).
The first contribution of this paper is a modification of the original GPTD model that allows it to learn value and value-uncertainty estimates in general MDPs, al- lowing for stochasticity in both transitions and re- wards. Drawing inspiration from Sutton’s SARSA al- gorithm, our second contribution is GPSARSA, an ex- tension of the GPTD algorithm for learning a Gaussian distribution over state-action values, thus allowing us to perform model-free policy improvement.
Let us introduce some definitions to be used in the sequel. A Markov Decision Process (MDP) is a tuple (X , U, R, p) where X and U are the state and action spaces, respectively; R : X → R is the immediate reward, which may be random, in which case q(·|x) denotes the distribution of rewards at the state x; and p : X × U × X → [0, 1] is the transition distri- bution, which we assume is stationary. A stationary policy μ : X ×U → [0, 1] is a mapping from states to ac- tion selection probabilities. Given a fixed policy μ, the transition probabilities of the MDP are given by the policy-dependent state transition probability distribu- tion pμ(x′|x) =
U dup(x
′|u, x)μ(u|x). The discounted
return D(x) for a state x is a random process defined by
D(x) =
i=
γiR(xi)|x 0 = x, with xi+1 ∼ pμ(·|xi ).
(2.1) Here, γ ∈ [0, 1] is a discount factor that determines the exponential devaluation rate of delayed rewards^2. Note that the randomness in D(x 0 ) for any given state x 0 is due both to the stochasticity of the sequence of states that follow x 0 , and to the randomness in the re- wards R(x 0 ), R(x 1 ), R(x 2 ).. .. We refer to this as the intrinsic randomness of the MDP. Using the station- arity of the MDP we may write
D(x) = R(x) + γD(x′), with x′^ ∼ pμ(·|x). (2.2)
The equality here marks an equality in the distribu- tions of the two sides of the equation. Let us define
(^2) When γ = 1 the policy must be proper (i.e., guaranteed
to terminate), see Bertsekas and Tsitsiklis (1996).
the expectation operator Eμ as the expectation over all possible trajectories and all possible rewards collected in them. This allows us to define the value function V (x) as the result of applying this expectation opera- tor to the discounted return D(x). Thus, applying Eμ to both sides of Eq. (2.2), and using the conditional expectation formula (Scharf, 1991), we get
V (x) = ¯r(x) + γEx′|xV (x′) ∀x ∈ X , (2.3)
which is recognizable as the fixed-policy version of the Bellman equation (Bertsekas & Tsitsiklis, 1996).
2.1. The Value Model The recursive definition of the discounted return (2.2) is the basis for our statistical generative model con- necting values and rewards. Let us decompose the discounted return D into its mean V and a random, zero-mean residual ∆V ,
D(x) = V (x) + ∆V (x), (2.4)
where V (x) = EμD(x). In the classic frequentist ap- proach V (·) is no longer random, since it is the true value function induced by the policy μ. Adopting the Bayesian methodology, we may still view the value V (·) as a random entity by assigning it additional ran- domness that is due to our subjective uncertainty re- garding the MDP’s model (p, q). We do not know what the true functions p and q are, which means that we are also uncertain about the true value function. We choose to model this additional extrinsic uncertainty by defining V (x) as a random process indexed by the state variable x. This decomposition is useful, since it separates the two sources of uncertainty inherent in the discounted return process D: For a known MDP model, V becomes deterministic and the randomness in D is fully attributed to the intrinsic randomness in the state-reward trajectory, modeled by ∆V. On the other hand, in a MDP in which both transitions and rewards are deterministic but otherwise unknown, ∆V becomes deterministic (i.e., identically zero), and the randomness in D is due solely to the extrinsic un- certainty, modeled by V. For a more thorough discus- sion of intrinsic and extrinsic uncertainties see Mannor et al. (2004). Substituting Eq. (2.4) into Eq. (2.2) and rearranging we get
R(x) = V (x)−γV (x′)+N (x, x′), x′^ ∼ pμ(·|x), (2.5)
where N (x, x′) def = ∆V (x) − γ∆V (x′). Suppose we are provided with a trajectory x 0 , x 1 ,... , xt, sam- pled from the MDP under a policy μ, i.e., from
The posterior mean and variance of the value at some point x are given, respectively, by
ˆvt(x) = kt(x)>αt, pt(x) = k(x, x) − kt(x)>Ctkt(x), (2.9) where kt(x) = (k(x 0 , x),... , k(xt, x))>^ ,
αt = H> t
HtKtH> t + Σt
rt− 1 ,
Ct = H> t
HtKtH> t + Σt
Ht. (2.10)
2.2. Relation to Monte-Carlo Simulation
Consider an episode in which a terminal state is reached at time step t + 1. In this case, the last equation in our generative model should read R(xt) = V (xt) + N (xt), since V (xt+1) = 0. Our complete set of equations is now
Rt = Ht+1Vt + Nt, (2.11)
with Ht+1 a square (t + 1) × (t + 1) matrix, given by Ht+1 as defined in (2.7), with its last column removed. Note that Ht+1 is also invertible, since its determinant equals 1.
Our model’s validity may be substantiated by perform- ing a whitening transformation on Eq. (2.11). Since the noise covariance matrix Σt is positive definite, there exists a square matrix Zt satisfying Z> t Zt = Σ− t 1. Multiplying Eq. (2.11) by Zt we then get ZtRt = ZtHt+1Vt +ZtNt. The transformed noise term ZtNt has a covariance matrix given by ZtΣtZ> t = Zt(Z> t Zt)−^1 Z> t = I. Thus the transformation Zt whitens the noise. In our case, a whitening matrix is given by
Zt = H− t+1^1 =
1 γ γ^2... γt 0 1 γ... γt−^1 .. .
The transformed model is ZtRt = Vt + N (^) t′ with white Gaussian noise N (^) t′ = ZtNt ∼ N ( 0 , σ^2 I). Let us look at the i’th equation (i.e., row) of this transformed model:
R(xi) + γR(xi+1) +... + γt−iR(xt) = V (xi ) + N (^) i′ ,
with N (^) i′ ∼ N (0, σ^2 ). This is exactly the generative model we would have used had we wanted to learn the value function by performing GP regression using Monte-Carlo samples of the discounted-return as our targets. The major benefit in using the GPTD formu- lation is that it allows us to perform exact updates of the parameters of the posterior value mean and covari- ance on-line.
Computing the parameters αt and Ct of the poste- rior moments (2.10) is computationally expensive for large samples, due to the need to store and invert a matrix of size t × t. Even when this has been per- formed, computing the posterior moments for every new query point requires that we multiply two t × 1 vectors for the mean, and compute a t × t quadratic form for the variance. These computational require- ments are prohibitive if we are to compute value esti- mates on-line, as is usually required of RL algorithms. Engel et al. (2003) used an on-line kernel sparsifica- tion algorithm that is based on a view of the kernel as an inner-product in some high dimensional feature space to which raw state vectors are mapped^5. This sparsification method incrementally constructs a dic- tionary D =
˜x 1 ,... , ˜x|D|
of representative states. Upon observing xt, the distance between the feature- space image of xt and the span of the images of current dictionary members is computed. If the squared dis- tance exceeds some positive threshold ν, xt is added to the dictionary, otherwise, it is left out. Determin- ing this squared distance, δt, involves solving a simple least-squares problem, whose solution is a |D| × 1 vec- tor at of optimal approximation coefficients, satisfying
at = K˜− t−^11 ˜kt− 1 (xt), δt = k(xt, xt) − a> t ˜kt− 1 (xt), (3.12) where ˜kt(x) =
k(˜x 1 , x),... , k(˜x|Dt|, x)
is a |Dt| × 1 vector, and K˜t =
˜kt(˜x 1 ),... , k˜t(˜x|D t|)
a square |Dt| × |Dt|, symmetric, positive-definite matrix. By construction, the dictionary has the property that the feature-space images of all states encountered dur- ing learning may be approximated to within a squared error ν by the images of the dictionary members. The threshold ν may be tuned to control the sparsity of the solution. Sparsification allows kernel expansions, such as those appearing in Eq. 2.10, to be approximated by kernel expansions involving only dictionary members, by using
kt(x) ≈ At ˜kt(x), Kt ≈ At K˜tA> t. (3.13)
The t×|Dt| matrix At contains in its rows the approx- imation coefficients computed by the sparsification al- gorithm, i.e., At = [a 1 ,... , at]>, with padding zeros placed where necessary, see Engel et al. (2003). The end result of the sparsification procedure is that the posterior value mean ˆvt and variance pt may be compactly approximated as follows (compare to (^5) For completeness, we repeat here some of the details concerning this sparsification method.
Eq. 2.9, 2.10)
ˆvt(x) = ˜kt(x)>^ α˜t, pt(x) = k(x, x) − ˜kt(x)>^ C˜tk˜t(x), (3.14)
where α˜t = H˜> t
H˜t K˜t H˜> t + Σt
rt− 1
C^ ˜t = H˜> t
H˜t K˜t H˜> t + Σt
H˜t, (3.15)
and H˜t = HtAt.
The parameters that the GPTD algorithm is required to store and update in order to evaluate the posterior mean and variance are now α˜t and C˜t, whose dimen- sions are |Dt|×1 and |Dt|×|Dt|, respectively. In many cases this results in significant computational savings, both in terms of memory and time, when compared with the exact non-sparse solution.
The derivation of the recursive update formulas for the mean and covariance parameters α˜t and C˜t, for a new sample xt, is rather long and tedious due to the added complication arising from the non-diagonality of the noise covariance matrix Σt. We therefore refer the in- terested reader to (Engel, 2005, Appendix A.2.3) for the complete derivation (with state dependent noise). In Table 1 we present the resulting algorithm in pseu- docode.
Some insight may be gained by noticing that the term rt− 1 − ∆˜k> t α˜t− 1 appearing in the update for dt is a temporal difference term. From Eq. (3.14) and the def- inition of ∆˜kt (see Table 1) we have rt− 1 −∆˜k> t α˜t− 1 = rt− 1 + γˆvt− 1 (xt) − vˆt− 1 (xt− 1 ). Consequently, dt may be viewed as a linear filter driven by the temporal dif- ferences. The update of α˜t is simply the output of this filter, multiplied by the gain vector ˜ct/st. The resemblance to the Kalman Filter updates is evident. It should be noted that it is indeed fortunate that the noise covariance matrix vanishes except for its three central diagonals. This relative simplicity of the noise model is the reason we were able to derive simple and efficient recursive updates, such as the ones described above.
As mentioned above, SARSA is a fairly straightfor- ward extension of the TD algorithm (Sutton & Barto, 1998), in which state-action values are estimated, thus allowing policy improvement steps to be performed without requiring any additional knowledge on the MDP model. The idea is to use the stationary pol- icy μ being followed in order to define a new, aug- mented process, the state space of which is X ′^ = X ×U, (i.e., the original state space augmented by the action
Table 1. The On-Line Monte-Carlo GPTD Algorithm Parameters: ν , σ Initialize D 0 = {x 0 }, K˜− 0 1 = 1/k(x 0 , x 0 ), a 0 = (1), α ˜ 0 = 0, C˜ 0 = 0, ˜c 0 = 0, d 0 = 0, 1/s 0 = 0 for t = 1, 2 ,... observe xt− 1 , rt− 1 , xt at = K˜− t−^11 k˜t− 1 (xt) δt = k(xt, xt) − ˜kt− 1 (xt)>at ∆˜kt = ˜kt− 1 (xt− 1 ) − γ˜kt− 1 (xt) dt = γσ 2 st− 1 dt−^1 +^ rt−^1 −^ ∆˜k
t α˜t− 1 if δt > ν K^ ˜− t 1 = (^) δ^1 t
» δt K˜− t−^11 + ata> t −at −a> t 1
at = (0,... , 1)> h^ ˜t = (at− 1 , −γ)> ∆ktt = a> t− 1
“ ˜kt− 1 (xt− 1 ) − 2 γ˜kt− 1 (xt)
”
„ ˜ct− 1 0
«
„ C˜t− 1 ∆˜kt 0
«
st = (1 + γ^2 )σ^2 + ∆ktt − ∆k˜> t C˜t− 1 ∆k˜t
t− 1 ∆˜kt^ −^ γ^2 σ^4 st− 1 α ˜t− 1 =
„ α˜t− 1 0
«
C^ ˜t− 1 =
» (^) ˜ Ct− 1 0 0 >^0
else h^ ˜t = at− 1 − γat ∆ktt = h˜> t ∆˜kt ˜ct = γσ
2 st− 1 ˜ct−^1 +^ h˜t − C˜t− 1 ∆˜kt st = (1 + γ^2 )σ^2 + ∆k˜> t
“ ˜ct + γσ 2 st− 1 ˜ct−^1
” − γ (^2) σ 4 st− 1 end if α ˜t = α˜t− 1 + ˜c stt dt C^ ˜t = C˜t− 1 + (^) s^1 t ˜ct˜c
t end for return Dt, α˜t, C˜t
space), maintaining the same reward model. This aug- mented process is Markovian with transition probabil- ities p′(x′, u′|x, u) = pμ(x′|x)μ(u′|x′). SARSA is sim- ply the TD algorithm applied to this new process. The same reasoning may be applied to derive a GPSARSA algorithm from the GPTD algorithm. All we need is to define a covariance kernel function over state-action pairs, i.e., k : (X × U) × (X × U) → R. Since states and actions are different entities it makes sense to de- compose k into a state-kernel kx and an action-kernel ku: k(x, u, x′, u′) = kx(x, x′)ku(u, u′). If both kx and ku are kernels we know that k is also a legitimate ker- nel (Sch¨olkopf & Smola, 2002), and just as the state- kernel codes our prior beliefs concerning correlations between the values of different states, so should the action-kernel code our prior beliefs on value correla- tions between different actions. All that remains now is to run GPTD on the aug- mented state-reward sequence, using the new state-
−
−
− −
−
−
−
−
−50^ −
−
−
−
−
−
−
−30^ − −
−
−
−
−
−
−
−
−
−
−
− −
−
0 −
0.05^ 0.
0.1^ 0.
0.1 0.
0.1 0.
Figure 1. The posterior value mean (left), posterior value variance (center), and the corresponding greedy policy (right), for the maze shown here, after 200 learning episodes. The goal is at the bottom left.
(^10) 0.5 0.6 0.7 0.8 0.9 1 1.1 1. −
10 −
100
101
102
Pr(right)
ME
GPTD MC−GPTD
Figure 2. MC-GPTD compared with the original GPTD on the random-walk MDP. The mean error after 400 episodes for each algorithm is plotted against Pr(right).
We then computed the root-mean-squared error (RMSE) between the resulting value estimates and the true value function (which is easily computed). In all of the experiments (except when Pr(right) = 1, as explained below), the error of the original GPTD con- verged well before the 100 episode mark. The results of this experiment are shown in Fig. 2. Both algorithms converge to the same low error for Pr(right) = 1 (i.e. a deterministic policy), with almost identical learning curves (not shown). However, as Pr(right) is lowered, making the transitions stochastic, the original GPTD algorithm converges to inconsistent value estimates, whereas MC-GPTD produces consistent results. This bias was already observed and reported in Engel et al. (2003) as the “dampening” of the value estimates.
In Engel et al. (2003) GPTD was presented as an al- gorithm for learning a posterior distribution over value
functions, for MDPs with stochastic rewards, but de- terministic transitions. This was done by connecting the hidden values and the observed rewards by a gener- ative model of the form of Eq. 2.8, in which the covari- ance of the noise term was diagonal: Σt = σ^2 I. In the present paper we derived a second GPTD algorithm that overcomes the limitation of the first algorithm to deterministic transitions. We did this by invoking a useful decomposition of the discounted return random process into the value process, modeling our uncer- tainty concerning the MDP’s model, and a zero-mean residual (2.4), modeling the MDP’s intrinsic stochas- ticity; and by additionally assuming independence of the residuals. Surprisingly, all that this amounts to is the replacement of the diagonal noise covariance, em- ployed in Engel et al. (2003), with a tridiagonal, cor- related noise covariance: Σt = σ^2 HtH> t. This change induces a model that we have shown to be effectively equivalent to GP regression of Monte-Carlo samples of the discounted return. This should help uncover the implicit assumption underlying some of the most prevalent MC value estimation methods (e.g., TD(1) and LSTD(1)), namely, that the samples of the dis- counted return used are IID. Although in most realistic problems this assumption is clearly wrong, we never- theless know that MC estimates, although not neces- sarily optimal, are asymptotically consistent. We are therefore inclined to adopt a broader view of GPTD as a general GP-based framework for Bayesian modeling of value functions, encompassing all genera- tive models of the form R = HtV + N , with Ht given by (2.7), a Gaussian prior placed on V , and an arbi- trary zero-mean Gaussian noise process N. No doubt, most such models will be meaningless from a value estimation point of view, while others will not admit efficient recursive algorithms for computing the poste- rior value moments. However, if the noise covariance Σt is suitably chosen, and if it is additionally simple
in some way, we may be able to derive such a recursive algorithm to compute complete posterior value distri- butions, on-line. For instance, it turns out that by employing alternative forms of noise covariance, we are able to obtain GP-based variants of LSTD(λ) (Engel, 2005).
The second contribution of this paper is the extension of GPTD to the estimation of state-action values, or Q-values, leading to the GPSARSA algorithm. Learn- ing Q-values makes the task of policy improvement in the absence of a transition model tenable, even when the action space is continuous, as demonstrated by the example in Section 5. The availability of confi- dence intervals for Q-values significantly expands the repertoire of possible exploration strategies. In finite MDPs, strategies employing such confidence intervals have been experimentally shown to perform more effi- ciently then conventional ε-greedy or Boltzmann sam- pling strategies, e.g., Kaelbling (1993); Dearden et al. (1998); Even-Dar et al. (2003). GPSARSA allows such methods to be applied to infinite MDPs, and it remains to be seen whether significant improvements can be so achieved for realistic problems with contin- uous space and action spaces.
In Rasmussen and Kuss (2004) an alternative approach to employing GPs in RL is proposed. The approach in that paper is fundamentally different from the gen- erative approach of the GPTD framework. In Ras- mussen and Kuss (2004) one GP is used to learn the MDP’s transition model, while another is used to esti- mate the value. This leads to an inherently off-line algorithm, which is not capable of interacting with the controlled system directly and updating its esti- mates as additional data arrive. There are several other shortcomings that limit the usefulness of that framework. First, the state dynamics is assumed to be factored, in the sense that each state coordinate is assumed to evolve in time independently of all others. This is a rather strong assumption that is not likely to be satisfied in many real problems. Moreover, it is also assumed that the reward function is completely known in advance, and is of a very special form – ei- ther polynomial or Gaussian. Finally, the covariance kernels used are also restricted to be either polynomial or Gaussian or a mixture of the two, due to the need to integrate over products of GPs. This considerably diminishes the appeal of employing GPs, since one of the main reasons for using them, and kernel meth- ods in general, is the richness of expression inherent in the ability to construct arbitrary kernels, reflecting domain and problem-specific knowledge, and defined over sets of diverse objects, such as text documents and DNA sequences (to name only two), and not just
points in metric space. Preliminary results on a high dimensional control task, in which GPTD is used in learning to control a simu- lated robotic “Octopus” arm, with 88 state variables, suggest that the kernel-based GPTD framework is not limited to low dimensional domains such as those ex- perimented with here (Engel et al., 2005). Standing challenges for future work include balancing explo- ration and exploitation in RL using the value confi- dence intervals provided by GPTD methods; further exploring the space of GPTD models by considering additional noise covariance structures; application of the GPTD methodology to POMDPs; creating a GP- Actor-Critic architecture; GPQ-Learning for off-policy learning of the optimal policy; and analyzing the con- vergence properties of GPTD.
Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic pro- gramming. Athena Scientific. Dearden, R., Friedman, N., & Russell, S. (1998). Bayesian Q-learning. Proc. of the Fifteenth National Conference on Artificial Intelligence. Engel, Y. (2005). Algorithms and Representations for Rein- forcement Learning. Doctoral dissertation, The Hebrew University of Jerusalem. www.cs.ualberta.ca/∼yaki. Engel, Y., Mannor, S., & Meir, R. (2003). Bayes meets Bellman: The Gaussian process approach to temporal difference learning. Proc. of the 20th International Con- ference on Machine Learning. Engel, Y., Szabo, P., & Volkinstein, D. (2005). Learning to control an Octopus arm using Gaussian process temporal difference learning. www.cs.ualberta.ca/∼yaki/reports/octopus.pdf. Even-Dar, E., Mannor, S., & Mansour, Y. (2003). Action elimination and stopping conditions for reinforcement learning. Proc. of the 20th International Conference on Machine Learning. Kaelbling, L. P. (1993). Learning in embedded systems. MIT Press. Mannor, S., Simester, D., Sun, P., & Tsitsiklis, J. (2004). Bias and variance in value function estimation. Proc. of the 21st International Conference on Machine Learning. Rasmussen, C., & Kuss, M. (2004). Gaussian processes in reinforcement learning. Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press. Scharf, L. (1991). Statistical signal processing. Addison- Wesley. Sch¨olkopf, B., & Smola, A. (2002). Learning with Kernels. Cambridge, MA: MIT Press. Sutton, R., & Barto, A. G. (1998). Reinforcement Learn- ing: An Introduction. MIT Press.