Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Comparison of Q-learning Algorithms on Atari Games: DQN, DQN Intrinsic Motivation, UBE, Study Guides, Projects, Research of Machine Learning

A comparison of the performance of different q-learning algorithms, including dqn, dqn intrinsic motivation, ube 1-step, and ube n-step, on the atari suite of games. The results show the average episode return for each algorithm on various games, with the number of frames played increasing from 0 to 30 million. The n-step ube approach outperformed the other strategies in most games, with the highest scores being 35 for asteroids and 94.54 for dqn ube 1-step on alien.

Typology: Study Guides, Projects, Research

2017/2018

Uploaded on 01/15/2018

omneus
omneus 🇨🇦

1 document

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
The Uncertainty Bellman Equation and Exploration
Brendan O’Donoghue, Ian Osband, Remi Munos, Volodymyr Mnih
Deepmind
{bodonoghue, iosband, munos, vmnih}@google.com
September 19, 2017
Abstract
We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is
well known that the Bellman equation connects the value at any time-step to the expected value at
subsequent time-steps. In this paper we consider a similar uncertainty Bellman equation (UBE), which
connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby
extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the
unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed
policy. This bound can be much tighter than traditional count-based bonuses that compound standard
deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this
method scales naturally to large systems with complex generalization. Substituting our UBE-exploration
strategy for -greedy improves DQN performance on 51 out of 57 games in the Atari suite.
1 Introduction
We consider the reinforcement learning (RL) problem of an agent interacting with its environment to max-
imize cumulative rewards through time [35]. We model the environment as a Markov decision process
(MDP), but where the agent is initially uncertain of the true dynamics of the MDP [4, 5]. At each time-step,
the agent performs an action, receives a reward, and moves to the next state; from these data it can learn
which actions lead to higher payoffs. This leads to the exploration versus exploitation trade-off: Should the
agent investigate poorly understood states and actions to improve future performance or instead take actions
that maximize rewards given its current knowledge?
Separating estimation and control in RL via ‘greedy’ algorithms can lead to premature and subopti-
mal exploitation. To offset this, the majority of practical implementations introduce some random noise
or dithering into their action selection (such as -greedy). These algorithms will eventually explore every
reachable state and action infinitely often, but can take exponentially long to learn the optimal policy [12].
By contrast, for any set of prior beliefs the optimal exploration policy can be computed directly by dynamic
programming in the Bayesian belief space. However this approach can be computationally intractable for
even very small problems [9] while direct computational approximations can fail spectacularly badly [22].
For this reason, most provably-efficient approaches to reinforcement learning rely upon the optimism
in the face of uncertainty (OFU) heuristic [14, 13, 7]. These algorithms give a bonus to poorly-understood
states and actions and subsequently follow the policy that is optimal for this augmented optimistic MDP.
This optimism incentivises exploration but, as the agent learns more about the environment, the scale of
the bonus should decrease and the agent’s performance should approach optimality. At a high level these
approaches to OFU-RL build up confidence sets that contain the true MDP with high probability [33, 16,
1
arXiv:1709.05380v1 [cs.AI] 15 Sep 2017
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Comparison of Q-learning Algorithms on Atari Games: DQN, DQN Intrinsic Motivation, UBE and more Study Guides, Projects, Research Machine Learning in PDF only on Docsity!

The Uncertainty Bellman Equation and Exploration

Brendan O’Donoghue, Ian Osband, Remi Munos, Volodymyr Mnih

Deepmind

{bodonoghue, iosband, munos, vmnih}@google.com

September 19, 2017

Abstract We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for -greedy improves DQN performance on 51 out of 57 games in the Atari suite.

1 Introduction

We consider the reinforcement learning (RL) problem of an agent interacting with its environment to max- imize cumulative rewards through time [35]. We model the environment as a Markov decision process (MDP), but where the agent is initially uncertain of the true dynamics of the MDP [4, 5]. At each time-step, the agent performs an action, receives a reward, and moves to the next state; from these data it can learn which actions lead to higher payoffs. This leads to the exploration versus exploitation trade-off: Should the agent investigate poorly understood states and actions to improve future performance or instead take actions that maximize rewards given its current knowledge? Separating estimation and control in RL via ‘greedy’ algorithms can lead to premature and subopti- mal exploitation. To offset this, the majority of practical implementations introduce some random noise or dithering into their action selection (such as -greedy). These algorithms will eventually explore every reachable state and action infinitely often, but can take exponentially long to learn the optimal policy [12]. By contrast, for any set of prior beliefs the optimal exploration policy can be computed directly by dynamic programming in the Bayesian belief space. However this approach can be computationally intractable for even very small problems [9] while direct computational approximations can fail spectacularly badly [22]. For this reason, most provably-efficient approaches to reinforcement learning rely upon the optimism in the face of uncertainty (OFU) heuristic [14, 13, 7]. These algorithms give a bonus to poorly-understood states and actions and subsequently follow the policy that is optimal for this augmented optimistic MDP. This optimism incentivises exploration but, as the agent learns more about the environment, the scale of the bonus should decrease and the agent’s performance should approach optimality. At a high level these approaches to OFU-RL build up confidence sets that contain the true MDP with high probability [33, 16,

arXiv:1709.05380v1 [cs.AI] 15 Sep 2017

11]. These techniques can provide performance guarantees that are ‘near-optimal’ in terms of the problem parameters.However, apart from the simple ‘multi-armed bandit’ setting with only one state, there is still a significant gap between the upper and lower bounds for these algorithms [15, 11, 27]. One inefficiency in these algorithms is that, although the concentration may be tight at each state and action independently, the combination of simultaneously optimistic estimates may result in an extremely over-optimistic estimate for the MDP as a whole [28]. Other works have suggested that a Bayesian posterior sampling approach may not suffer from these inefficiencies and can lead to performance improvements over OFU methods [34, 25]. In this paper we explore an alternative approach that harnesses the simple relationship of the uncertainty Bellman equation (UBE), where we define uncertainty to be the variance of the value estimator the agent is learning, in a sense similar to the parametric variance of Mannor et. al. [17]. Intuitively speaking, if the agent has high uncertainty (as measured by high estimator variance) in a region of the state-space then it should explore there, in order to get a better estimate of those Q-values. We show that, just as the Bellman equation relates the value of a policy beyond a single time-step, so too does the uncertainty Bellman equation propagate uncertainty values over multiple time-steps, thereby facilitating ‘deep exploration’ [26]. The benefit of our approach (which learns the solution to the UBE and uses this to guide exploration) is that we can harness the existing machinery for deep reinforcement learning with minimal change to existing network architectures. The resulting algorithm shares an intimate connection to the existing literature of both OFU and intrinsic motivation [31, 30]. Recent work has further connected these approaches through the notion of ‘pseudo-count’ [2, 29] or some generalization of the number of visits to a state and action. Rather than pseudo-count, our work builds upon the idea that the more fundamental quantity relates to the uncertainty of the estimated value function and that naively compounding count-based bonuses may lead to inefficient confidence sets [28]. The key difference is that the UBE compounds the sum of the variances at each step, rather than standard deviation. The observation that the higher moments of a value function also satisfy a form of Bellman equation is not new and has been observed by some of the early papers on the subject [32]. Unlike most prior work, we focus upon the epistemic uncertainty over the mean of the value function, rather than the higher moments of the reward-to-go [16, 1, 18]. For application to rich environments with complex generalization we will use a deep learning architecture to learn a solution to the UBE according to our observed data, in the style of [37].

2 Problem formulation

We consider an infinite horizon, discounted, finite state and action space MDP, with state space S, action space A and rewards at each time period denoted by rt ∈ R. A policy π : S × A → R+ is a mapping from state-action pair to the probability of taking that action at that state. At each time-step t the agent receives a state st and a reward rt and selects an action at from the policy πt, and the agent moves to the next state st+1 ∼ P (·, st, at). Here P (s′, s, a) is the probability of transitioning from state s to state s′^ after taking action a. The goal of the agent is maximize the expected total discounted return J under its policy π, where J(π) = E

[∑∞

t=0 γ trt | π]. Here the expectation is with respect to the initial state distribution, the

state-transition probabilities, and the policy π. The discount factor γ ∈ (0, 1) controls how much the agent prioritizes long-term versus short-term rewards. The action-value, or Q-value, of a particular state under policy π is the expected total discounted return from taking that action at that state and following π thereafter, Qπ(s, a) = E

[∑∞

t=0 γ

trt | s 0 = s, a 0 = a, π].

The value of state s under policy π, V π(s) = E [Qπ(s, a) | a ∼ π] is the expected total discounted return

Therefore if we know, or can bound, the variance and bias of an estimator then we can use them to construct intervals that contain the true Q-values with high-probability. An agent can then apply the OFU-principle to prioritize its exploration towards potentially rewarding policies [13, 22]. We argue that, for many settings of interest, this error is dominated by the variance term and that, in this case, several simplifying relationships emerge.

Lemma 1. For any policy π and any state-action pair (s, a), the biases satisfy a Bellman equation

bias Qˆπ(s, a) = Ed(s, a) + γEs′,a′ bias Qˆπ(s′, a′). (6)

Proof. Take the expectation of (3) with respect to d and note that Edδ(s, a) = bias Qˆπ(s, a).

For the purposes of our analysis we will assume that the Bellman residuals at any state-action pair are uncorrelated. This property will certainly not hold in all settings, but may be a reasonable approximation in many settings of interest.

Assumption 1. For any s, s′^ ∈ S a, a′^ ∈ A,

cov((s, a), (s′, a′)) ≤ min(var (s, a), var (s′, a′)),

where cov denotes the covariance.

Assumption 1 implies that the variance of the Q-value estimate satisfies a Bellman inequality.

Lemma 2. For any policy π and any estimator that satisfies assumption 1, the variance satisfies a Bellman inequality at all (s, a),

var Qˆπ(s, a) ≤ β var (s, a) + γ^2 Es′,a′ var Qˆπ(s′, a′), (7)

for some β ∈ [1, 1+ 1 −γγ ] and we write β∗^ for the minimum such β that satisfies this relationship.

Proof. Let var (s, a) = Ed[((s, a) − Ed(s, a))^2 ] be the variance of the Bellman residual at (s, a). We will refer to this quantity as the local (or shallow) uncertainty from finite data. We now consider the variance of the estimator:

var Qˆπ(s, a) = Ed

[(

Qˆπ(s, a) − Ed Qˆπ(s, a))^2

]

= Ed

[(

Qˆπ(s, a) − Qπ(s, a) − bias Qˆπ(s, a)

) 2 ]

= Ed

[(

Qˆπ(s, a) − Err(s, a) − γEs′,a′ Qπ(s′, a′) − bias Qˆπ(s, a))^2

]

= Ed

[(

(s, a) − Ed(s, a) + γEs′,a′^ [ Qˆπ(s′, a′) − Ed Qˆπ(s′, a′)]

) 2 ]

where in the last line we used lemma 1. Expanding the square we obtain

var Qˆπ(s, a) = var (s, a) + γ^2 Ed

[(

Es′,a′ [ Qˆπ(s′, a′) − Ed Qˆπ(s′, a′)]

) 2 ]

  • 2γc(s, a),

where c(s, a) is the cross-term. In the appendix, we prove that under assumption 1 this can be bounded

c(s, a) ≤ α var (s, a)), (8)

where 0 ≤ α ≤ (^1) −^1 γ is a constant that might depend on the MDP, the policy, and the estimator. By Jensen’s inequality we have that

Ed[(Es′,a′^ [ Qˆπ(s′, a′) − Ed Qˆπ(s′, a′)])^2 ] ≤ Ed,s′,a′^ [( Qˆπ(s′, a′) − Ed Qˆπ(s′, a′))^2 ] = Es′,a′ var Qˆπ(s′, a′).

Combining this result with (8) we can then say,

var Qˆπ(s, a) ≤ β var (s, a) + γ^2 Es′,a′^ var Qˆπ(s′, a′), (9)

for some 1 ≤ β ≤ (1 + γ)/(1 − γ).

With these lemmas we are ready to prove our main theorem.

Theorem 1 (Solution of the uncertainty Bellman equation). Under Assumption 1, for any policy π, let β?^ ∈ [1, 1+ 1 −γγ ] be the smallest β that satisfies the conditions for Lemma 2. Then there exists a unique u?^ that satisfies the uncertainty Bellman equation

u?(s, a) = (T (^) uπ u?)(s, a) := β?^ var (s, a) + γ^2 Es′,a′ u?(s′, a′)

for each (s, a), and u?^ ≥ var Qˆπ^ pointwise.

Proof. To show this we use three essential properties of the Bellman operator for a fixed policy [5]. First, the Bellman operator is a γ-contraction in the `∞ norm and so the fixed point u?^ exists and is unique. Second, value iteration converges in that (T (^) uπ )kx → u?^ for any starting x. Finally, the Bellman operator is monotonically increasing in its arguments, i.e., if x ≥ y pointwise then T (^) uπ x ≥ T (^) uπ y pointwise. As the the variance satisfies a Bellman inequality (9), we have

var Qˆπ^ ≤ T (^) uπ var Qˆπ^ ≤ lim k→∞ (T (^) uπ )k^ var Qˆπ^ = u?. (11)

3.2 Comparison to traditional exploration bonus

Consider a simple decision problem with known deterministic transitions, unknown rewards and two actions. We imagine an agent has gathered some data d and produces some unbiased value estimates. According to these estimates, the first action leads to a single reward with expectation zero and variance σ^2. The second action leads to an infinite chain of independent states with expectation zero and variance σ^2 (1 − γ^2 ). These numbers are chosen so that the variance of the estimated value of each action is mean zero and variance σ^2. An optimistic agent motivated by (5) has no reason to value one action over the other. Nonetheless, most existing approaches to optimism that work via exploration bonus would lead to an inconsistent decision rule in this setting [28]. Rather than consider the variance of the value as a whole, the majority of existing approaches to OFU provide exploration bonuses at each state and action independently and then combine these estimates via union bound. In this context, even a state of the art algorithm such as UCRL2 [11] would afford each state a bonus proportional to its standard deviation of estimate. For action one this would be proportional to σ, but for action two this would be proportional to,

ExplorationBonus(a 2 ) ∝

∑^ T

t=

γtσ

1 − γ^2 = σ

1 + γ 1 − γ

by the Sherman-Morrison-Woodbury formula [8], the cost of this update is one matrix multiply and one matrix-matrix subtraction per step.

Neural networks value estimate. If we are approximating our Q-value function using a neural network then the above analysis does not hold. However if the last layer of the network is linear, then the Q-values are approximated as Qπ(s, a) = φ(s)T^ wa, where wa are the weights of the last layer associate with action a and φ(s) is the output of the network up to the last layer for state s. In other words we can think of a neural network as learning a useful set of basis functions such that a linear combination of them approximates the Q-values. Then, if we ignore the uncertainty in the φ mapping, we can reuse the analysis for the purely linear case to derive an approximate measure of local uncertainty that might be useful in practice. This scheme has some advantages. As the agent progresses it is learning a state representation that helps it achieve the goal of maximizing the return. The agent will learn to pay attention to small but important details (e.g., the ball in Atari ‘breakout’) and learn to ignore large but irrelevant changes (e.g., if the back- ground suddenly changes). This is a desirable property from the point of view of using these features to drive exploration, because the states that differ only in irrelevant ways will be aliased to (roughly) the same state representation, and states that differ is small but important ways will be mapped to quite different state vectors, permitting a more task-relevant measure of local uncertainty.

5 Algorithm for Deep Reinforcement Learning

In this section we describe an exploration heuristic for deep reinforcement learning whereby we attempt to learn both the Q-values and the uncertainties associated with them simultaneously (we assume the biases are small enough to ignore). The goal is for the agent to explore areas where it learns that it has higher uncertainty. This is in contrast to the commonly used -greedy [20] and Boltzmann exploration strategies [19, 23] which simply inject noise into the agents actions. Our policy uses Thompson sampling [38], where the action is selected as a = argmax b

( Qˆπ(s, b) + αζ(b)u(s, b)^1 /^2 ) (15)

where ζ(b) ∼ N (0, 1) and α > 0 is a hyper-parameter. In this case the probability of selecting action a is the probability that a has the maximum value if each action b was distributed normally with mean Qˆπ(s, b) and variance α^2 u(s, b). The technique is described in pseudo-code in algorithm 1. We refer to the technique as one-step since the uncertainty values are updated using a one-step SARSA Bellman backup, but it is easily extendable to the n-step case. The algorithm takes as input a neural network which has two output heads, one which is attempting to learn the optimal Q-values as normal, the other is attempting to learn the uncertainty values of the current policy (which is constantly changing). We do not allow the gradients from the uncertainty head to flow into the trunk; this ensures the Q-value estimates are not perturbed by the changing uncertainty signal. For the local uncertainty measure we use the linear basis approximation described previously. We have dropped the constant β from the uncertainty Bellman equation (10) and the unknown σ^2 term from the local-uncertainty in equation (13) because they are both absorbed by the hyper-parameter α in the policy. Most of the assumptions that allowed us to bound the true Q-values in equation (5) are violated by this scheme, in particular we have ignored the bias term, and the policy is changing as the Q-values change. However, we might expect this strategy to provide a useful signal of novelty to the agent and therefore perform well in practice.

Algorithm 1 One-step UBE exploration with linear uncertainty estimates.

// Input: Neural network outputting Q and u estimates // Input: Q-value learning subroutine qlearn // Input: Thompson sampling hyper-parameter α > 0 Initialize Σa = μI for each a, where μ > 0 Get initial state s, take initial action a repeat Retrieve feature mapping φ(s) from input to last layer of Q-head Receive new state s′^ and reward r Calculate Q-value estimates Q(s′, ·), uncertainty estimates u(s′, ·) Calculate action a′^ = argmaxb(Q(s′, b) + αζ(b)u(s′, b)^1 /^2 ), where ζ(b) ∼ N (0, 1) Calculate y =

φ(s)T^ Σaφ(s), for terminal s′ φ(s)T^ Σaφ(s) + γ^2 u(s′, a′), o.w. Take gradient step in u subnetwork with respect to error (y − u(s, a))^2 Update Q-values using qlearn(s, a, r, s′, a′) Update Σa′^ according to eq. (14) Take action a′ until T > Tmax

5.1 Experimental results

Here we present results of algorithm (1) on the Atari suite of games [3], where the network is attempting to learn the Q-values as in DQN [20, 21] and the uncertainties simultaneously. The only change to vanilla DQN we made was to replace the -greedy policy with Thompson sampling over the learned uncertainty values, where the α constant in (15) was chosen to be 0. 01 for all games, by a parameter sweep. We used the exact same network architecture, learning rate, optimizer, pre-processing and replay scheme as [21]. For the uncertainty head we used a single fully connected hidden layer with 512 hidden units followed by the output layer. We trained the uncertainty head using a separate RMSProp optimizer [39] with learning rate 10 −^3. The addition of the uncertainty head and the computation associated with it, only reduced the frame-rate compared to vanilla DQN by about 10% on a GPU, so the speed cost of the approach is negligible. We compare two versions of our approach: a 1 -step method and an n-step method where we set n to

  1. The n-step method accumulates the uncertainty signal over n time-steps before performing an update which should lead to the uncertainty signal propagating to earlier encountered states faster, at the expense of increased variance of the signal. Note that in all cases the Q-learning update is always 1 -step; our n-step implementation only affects the uncertainty update. We compare our approaches to vanilla DQN, and also to an exploration bonus intrinsic motivation approach, where the agent receives an augmented reward consisting of the extrinsic reward and the square root of the linear uncertainty (13), which was scaled by a hyper-parameter chosen to be 0. 1 by a sweep. In this case a stochastic policy was still required for good performance and so we used -greedy with the DQN annealing schedule. In the recent work by Bellemare et al. [2], and the follow-up work [29], the authors add an intrinsic mo- tivation signal to a DQN-style agent that has been modified to use the full Monte Carlo return of the episode when learning the Q-values. Using Monte Carlo returns dramatically improves the performance of DQN in a way unrelated to exploration, and due to that change we can’t compare the numerical results directly. In order to have a point of comparison we implemented our own intrinisic motivation exploration signal, as

1 10 20 50 100 200 Millions of Frames

0

5

10

15

20

25

30

35 Games at 100% human performance.

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

Figure 1: Number of games at super-human performance.

1 10 20 50 100 200 Millions of Frames

1.4 Median performance for all games

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

Figure 2: Median performance across all games.

0 100 200 300 400 500 Million Frames

0

500

1000

1500

2000

2500

3000

Average Episode Return

montezuma_revenge DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

Figure 3: UBE on Montezuma’s Revenge for 500M frames.

programming recursion. This uncertainty can be used by the agent to make decisions about which states and actions to explore, in order to gather more data about the environment and learn a better policy. Since the uncertainty satisfies a Bellman recursion, the agent can learn it using the same reinforcement learning machinery that have been developed for value functions. We showed that an algorithm based on this learned uncertainty can boost the performance of standard deep-RL techniques. Our technique was able to improve the average performance of DQN across the Atari suite of games, when compared against DQN using - greedy.

7 Acknowledgments

We thank Marc Bellemare, David Silver, Koray Kavukcuoglu, Tejas Kulkarni, Mohammad Gheshlaghi Azar, and Bilal Piot for discussion and suggestions on the paper.

References

[1] M. G. AZAR, R. MUNOS, AND B. KAPPEN, On the sample complexity of reinforcement learning with a generative model, in Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.

[2] M. BELLEMARE, S. SRINIVASAN, G. OSTROVSKI, T. SCHAUL, D. SAXTON, AND R. MUNOS, Uni- fying count-based exploration and intrinsic motivation, in Advances in Neural Information Processing Systems, 2016, pp. 1471–1479.

[3] M. G. BELLEMARE, Y. NADDAF, J. VENESS, AND M. BOWLING, The arcade learning environment: An evaluation platform for general agents, Journal of Artificial Intelligence Research, (2012).

[4] R. BELLMAN, Dynamic programming, Princeton University Press, 1957.

[5] D. P. BERTSEKAS, Dynamic programming and optimal control, vol. 1, Athena Scientific, 2005.

[6] J. A. BOYAN, Least-squares temporal difference learning, in ICML, 1999, pp. 49–56.

[7] R. I. BRAFMAN AND M. TENNENHOLTZ, R-max: A general polynomial time algorithm for near- optimal reinforcement learning, Journal of Machine Learning Research, 3 (2002), pp. 213–231.

[8] G. H. GOLUB AND C. F. VAN LOAN, Matrix computations, vol. 3, JHU Press, 2012.

[9] A. GUEZ, D. SILVER, AND P. DAYAN, Efficient Bayes-adaptive reinforcement learning using sample- based search, in Advances in Neural Information Processing Systems, 2012, pp. 1025–1033.

[10] G. H. HARDY, J. E. LITTLEWOOD, AND G. P ´OLYA, Inequalities, Cambridge university press, 1952.

[11] T. JAKSCH, R. ORTNER, AND P. AUER, Near-optimal regret bounds for reinforcement learning, Jour- nal of Machine Learning Research, 11 (2010), pp. 1563–1600.

[12] S. M. KAKADE, On the sample complexity of reinforcement learning, PhD thesis, University of Lon- don London, England, 2003.

[29] G. OSTROVSKI, M. G. BELLEMARE, A. V. D. OORD, AND R. MUNOS, Count-based exploration with neural density models, arXiv preprint arXiv:1703.01310, (2017).

[30] J. SCHMIDHUBER, Driven by compression progress: A simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes, in Anticipatory Behavior in Adaptive Learning Systems, Springer, 2009, pp. 48–76.

[31] S. P. SINGH, A. G. BARTO, AND N. CHENTANEZ, Intrinsically motivated reinforcement learning, in NIPS, vol. 17, 2004, pp. 1281–1288.

[32] M. J. SOBEL, The variance of discounted Markov decision processes, Journal of Applied Probability, 19 (1982), pp. 794–802.

[33] A. STREHL AND M. LITTMAN, Exploration via modelbased interval estimation, 2004.

[34] M. STRENS, A Bayesian framework for reinforcement learning, in ICML, 2000, pp. 943–950.

[35] R. SUTTON AND A. BARTO, Reinforcement Learning: an Introduction, MIT Press, 1998.

[36] R. S. SUTTON, Learning to predict by the methods of temporal differences, Machine learning, 3 (1988), pp. 9–44.

[37] A. TAMAR, D. DI CASTRO, AND S. MANNOR, Learning the variance of the reward-to-go, Journal of Machine Learning Research, 17 (2016), pp. 1–36.

[38] W. R. THOMPSON, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (1933), pp. 285–294.

[39] T. TIELEMAN AND G. HINTON, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, 4 (2012).

[40] C. J. C. H. WATKINS, Learning from delayed rewards, PhD thesis, University of Cambridge England,

8 Appendix

8.1 Bound on the cross-term

Here we prove the bound on the cross-term given in equation (8)

c(s, a) = Ed

[

((s, a) − Ed(s, a))Es′,a′ [ Qˆπ(s′, a′) − Qπ(s′, a′) − bias Qˆπ(s′, a′)]

]

= Es′,a′^ Ed

[(

(s, a) − Ed(s, a)

(s′, a′) − Ed(s′, a′) +γEs′′,a′′ [ Qˆ(s′′, a′′) − Q(s′′, a′′) − bias Qˆπ(s′′, a′′)]

)]

= Es′,a′,s′′,a′′,...Ed

[(

(s, a) − Ed(s, a)

×

(s′, a′) − Ed(s′, a′) + γ((s′′, a′′) − Ed(s′′, a′′) + γ^2 (...

)]

≤ maxs′,a′,s′′,a′′,... Ed

[(

(s, a) − Ed(s, a)

×

(s′, a′) − Ed(s′, a′) + γ((s′′, a′′) − Ed(s′′, a′′) + γ^2 (...

)]

= var (s, a)(1 + γ + γ^2 +.. .) = var (s, a)/(1 − γ),

where we have repeatedly used the fact that Qˆπ(t, b) − Qπ(t, b) = γEt′,b′^ ( Qˆπ(t′, b′) − Qπ(t′, b′)) + (t, b) for any (t, b), the fact that the biases satisfy a Bellman equation, and assumption 1 which implies that the max over each s′, a′^ is attained when (s′, a′) = (s, a). If we have more knowledge about the policy and the MDP then we can provide tighter bounds. For example, if we know that under the policy the agent cannot visit the same state before T time-steps then the cross-term can be bounded by c(s, a) ≤ γT^ −^1 var (s, a)/(1 − γT^ ) and the multiplicative factor becomes (1 + γT^ )/(1 − γT^ ). In particular a policy which never visits the same state twice (e.g., where the MDP is acyclic) has c(s, a) = 0 and the multiplicative factor in the uncertainty Bellman equation is one.

8.2 Learning curves

0 50 100 150 200 Million Frames

0

500

1000

1500

2000

2500

3000

3500

Average Episode Return

alien

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames

0

200

400

600

800

1000

1200

1400

Average Episode Return

amidar DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

500

1000

1500

2000

2500

3000

3500

Average Episode Return

assault DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

2000

4000

6000

8000

10000

12000

Average Episode Return

asterix DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

200

400

600

800

1000

1200

1400

1600

1800

2000

Average Episode Return

asteroids

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

Average Episode Return

atlantis DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

100

200

300

400

500

600

700

800

900

Average Episode Return

bank_heist DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

5000

10000

15000

20000

25000

30000

Average Episode Return

battle_zone

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

5000

10000

15000

20000

25000

Average Episode Return

beam_rider DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

200

400

600

800

1000

1200

1400

1600

Average Episode Return

berzerk DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

20

25

30

35

40

45

50

55

60

Average Episode Return

bowling DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

20

0

20

40

60

80

100

Average Episode Return

boxing

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

100

80

60

40

20

0

Average Episode Return

fishing_derby

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames

0

5

10

15

20

25

30

35

Average Episode Return

freeway

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

100

200

300

400

500

600

Average Episode Return

frostbite DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Average Episode Return

gopher DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

50

100

150

200

250

300

350

Average Episode Return

gravitar

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames

0

5000

10000

15000

20000

Average Episode Return

hero DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

18

16

14

12

10

8

6

4

Average Episode Return

ice_hockey

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames

0

200

400

600

800

1000

1200

1400

Average Episode Return

jamesbond DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

2000

4000

6000

8000

10000

12000

14000

16000

Average Episode Return

kangaroo

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

1000

2000

3000

4000

5000

6000

7000

8000

9000

Average Episode Return

krull

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

5000

10000

15000

20000

25000

30000

Average Episode Return

kung_fu_master

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames

0

100

200

300

400

500

600

Average Episode Return

montezuma_revenge DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

500

1000

1500

2000

2500

3000

3500

Average Episode Return

ms_pacman

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames

1000

2000

3000

4000

5000

6000

7000

8000

Average Episode Return

name_this_game

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step

0 50 100 150 200 Million Frames

0

2000

4000

6000

8000

10000

12000

Average Episode Return

phoenix

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames

500

400

300

200

100

0

Average Episode Return

pitfall

DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step