















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A comparison of the performance of different q-learning algorithms, including dqn, dqn intrinsic motivation, ube 1-step, and ube n-step, on the atari suite of games. The results show the average episode return for each algorithm on various games, with the number of frames played increasing from 0 to 30 million. The n-step ube approach outperformed the other strategies in most games, with the highest scores being 35 for asteroids and 94.54 for dqn ube 1-step on alien.
Typology: Study Guides, Projects, Research
1 / 23
This page cannot be seen from the preview
Don't miss anything!
Abstract We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar uncertainty Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for -greedy improves DQN performance on 51 out of 57 games in the Atari suite.
We consider the reinforcement learning (RL) problem of an agent interacting with its environment to max- imize cumulative rewards through time [35]. We model the environment as a Markov decision process (MDP), but where the agent is initially uncertain of the true dynamics of the MDP [4, 5]. At each time-step, the agent performs an action, receives a reward, and moves to the next state; from these data it can learn which actions lead to higher payoffs. This leads to the exploration versus exploitation trade-off: Should the agent investigate poorly understood states and actions to improve future performance or instead take actions that maximize rewards given its current knowledge? Separating estimation and control in RL via ‘greedy’ algorithms can lead to premature and subopti- mal exploitation. To offset this, the majority of practical implementations introduce some random noise or dithering into their action selection (such as -greedy). These algorithms will eventually explore every reachable state and action infinitely often, but can take exponentially long to learn the optimal policy [12]. By contrast, for any set of prior beliefs the optimal exploration policy can be computed directly by dynamic programming in the Bayesian belief space. However this approach can be computationally intractable for even very small problems [9] while direct computational approximations can fail spectacularly badly [22]. For this reason, most provably-efficient approaches to reinforcement learning rely upon the optimism in the face of uncertainty (OFU) heuristic [14, 13, 7]. These algorithms give a bonus to poorly-understood states and actions and subsequently follow the policy that is optimal for this augmented optimistic MDP. This optimism incentivises exploration but, as the agent learns more about the environment, the scale of the bonus should decrease and the agent’s performance should approach optimality. At a high level these approaches to OFU-RL build up confidence sets that contain the true MDP with high probability [33, 16,
11]. These techniques can provide performance guarantees that are ‘near-optimal’ in terms of the problem parameters.However, apart from the simple ‘multi-armed bandit’ setting with only one state, there is still a significant gap between the upper and lower bounds for these algorithms [15, 11, 27]. One inefficiency in these algorithms is that, although the concentration may be tight at each state and action independently, the combination of simultaneously optimistic estimates may result in an extremely over-optimistic estimate for the MDP as a whole [28]. Other works have suggested that a Bayesian posterior sampling approach may not suffer from these inefficiencies and can lead to performance improvements over OFU methods [34, 25]. In this paper we explore an alternative approach that harnesses the simple relationship of the uncertainty Bellman equation (UBE), where we define uncertainty to be the variance of the value estimator the agent is learning, in a sense similar to the parametric variance of Mannor et. al. [17]. Intuitively speaking, if the agent has high uncertainty (as measured by high estimator variance) in a region of the state-space then it should explore there, in order to get a better estimate of those Q-values. We show that, just as the Bellman equation relates the value of a policy beyond a single time-step, so too does the uncertainty Bellman equation propagate uncertainty values over multiple time-steps, thereby facilitating ‘deep exploration’ [26]. The benefit of our approach (which learns the solution to the UBE and uses this to guide exploration) is that we can harness the existing machinery for deep reinforcement learning with minimal change to existing network architectures. The resulting algorithm shares an intimate connection to the existing literature of both OFU and intrinsic motivation [31, 30]. Recent work has further connected these approaches through the notion of ‘pseudo-count’ [2, 29] or some generalization of the number of visits to a state and action. Rather than pseudo-count, our work builds upon the idea that the more fundamental quantity relates to the uncertainty of the estimated value function and that naively compounding count-based bonuses may lead to inefficient confidence sets [28]. The key difference is that the UBE compounds the sum of the variances at each step, rather than standard deviation. The observation that the higher moments of a value function also satisfy a form of Bellman equation is not new and has been observed by some of the early papers on the subject [32]. Unlike most prior work, we focus upon the epistemic uncertainty over the mean of the value function, rather than the higher moments of the reward-to-go [16, 1, 18]. For application to rich environments with complex generalization we will use a deep learning architecture to learn a solution to the UBE according to our observed data, in the style of [37].
2 Problem formulation
We consider an infinite horizon, discounted, finite state and action space MDP, with state space S, action space A and rewards at each time period denoted by rt ∈ R. A policy π : S × A → R+ is a mapping from state-action pair to the probability of taking that action at that state. At each time-step t the agent receives a state st and a reward rt and selects an action at from the policy πt, and the agent moves to the next state st+1 ∼ P (·, st, at). Here P (s′, s, a) is the probability of transitioning from state s to state s′^ after taking action a. The goal of the agent is maximize the expected total discounted return J under its policy π, where J(π) = E
t=0 γ trt | π]. Here the expectation is with respect to the initial state distribution, the
state-transition probabilities, and the policy π. The discount factor γ ∈ (0, 1) controls how much the agent prioritizes long-term versus short-term rewards. The action-value, or Q-value, of a particular state under policy π is the expected total discounted return from taking that action at that state and following π thereafter, Qπ(s, a) = E
t=0 γ
trt | s 0 = s, a 0 = a, π].
The value of state s under policy π, V π(s) = E [Qπ(s, a) | a ∼ π] is the expected total discounted return
Therefore if we know, or can bound, the variance and bias of an estimator then we can use them to construct intervals that contain the true Q-values with high-probability. An agent can then apply the OFU-principle to prioritize its exploration towards potentially rewarding policies [13, 22]. We argue that, for many settings of interest, this error is dominated by the variance term and that, in this case, several simplifying relationships emerge.
Lemma 1. For any policy π and any state-action pair (s, a), the biases satisfy a Bellman equation
bias Qˆπ(s, a) = Ed(s, a) + γEs′,a′ bias Qˆπ(s′, a′). (6)
Proof. Take the expectation of (3) with respect to d and note that Edδ(s, a) = bias Qˆπ(s, a).
For the purposes of our analysis we will assume that the Bellman residuals at any state-action pair are uncorrelated. This property will certainly not hold in all settings, but may be a reasonable approximation in many settings of interest.
Assumption 1. For any s, s′^ ∈ S a, a′^ ∈ A,
cov((s, a), (s′, a′)) ≤ min(var (s, a), var (s′, a′)),
where cov denotes the covariance.
Assumption 1 implies that the variance of the Q-value estimate satisfies a Bellman inequality.
Lemma 2. For any policy π and any estimator that satisfies assumption 1, the variance satisfies a Bellman inequality at all (s, a),
var Qˆπ(s, a) ≤ β var (s, a) + γ^2 Es′,a′ var Qˆπ(s′, a′), (7)
for some β ∈ [1, 1+ 1 −γγ ] and we write β∗^ for the minimum such β that satisfies this relationship.
Proof. Let var (s, a) = Ed[((s, a) − Ed(s, a))^2 ] be the variance of the Bellman residual at (s, a). We will refer to this quantity as the local (or shallow) uncertainty from finite data. We now consider the variance of the estimator:
var Qˆπ(s, a) = Ed
Qˆπ(s, a) − Ed Qˆπ(s, a))^2
= Ed
Qˆπ(s, a) − Qπ(s, a) − bias Qˆπ(s, a)
= Ed
Qˆπ(s, a) − Err(s, a) − γEs′,a′ Qπ(s′, a′) − bias Qˆπ(s, a))^2
= Ed
(s, a) − Ed(s, a) + γEs′,a′^ [ Qˆπ(s′, a′) − Ed Qˆπ(s′, a′)]
where in the last line we used lemma 1. Expanding the square we obtain
var Qˆπ(s, a) = var (s, a) + γ^2 Ed
Es′,a′ [ Qˆπ(s′, a′) − Ed Qˆπ(s′, a′)]
where c(s, a) is the cross-term. In the appendix, we prove that under assumption 1 this can be bounded
c(s, a) ≤ α var (s, a)), (8)
where 0 ≤ α ≤ (^1) −^1 γ is a constant that might depend on the MDP, the policy, and the estimator. By Jensen’s inequality we have that
Ed[(Es′,a′^ [ Qˆπ(s′, a′) − Ed Qˆπ(s′, a′)])^2 ] ≤ Ed,s′,a′^ [( Qˆπ(s′, a′) − Ed Qˆπ(s′, a′))^2 ] = Es′,a′ var Qˆπ(s′, a′).
Combining this result with (8) we can then say,
var Qˆπ(s, a) ≤ β var (s, a) + γ^2 Es′,a′^ var Qˆπ(s′, a′), (9)
for some 1 ≤ β ≤ (1 + γ)/(1 − γ).
With these lemmas we are ready to prove our main theorem.
Theorem 1 (Solution of the uncertainty Bellman equation). Under Assumption 1, for any policy π, let β?^ ∈ [1, 1+ 1 −γγ ] be the smallest β that satisfies the conditions for Lemma 2. Then there exists a unique u?^ that satisfies the uncertainty Bellman equation
u?(s, a) = (T (^) uπ u?)(s, a) := β?^ var (s, a) + γ^2 Es′,a′ u?(s′, a′)
for each (s, a), and u?^ ≥ var Qˆπ^ pointwise.
Proof. To show this we use three essential properties of the Bellman operator for a fixed policy [5]. First, the Bellman operator is a γ-contraction in the `∞ norm and so the fixed point u?^ exists and is unique. Second, value iteration converges in that (T (^) uπ )kx → u?^ for any starting x. Finally, the Bellman operator is monotonically increasing in its arguments, i.e., if x ≥ y pointwise then T (^) uπ x ≥ T (^) uπ y pointwise. As the the variance satisfies a Bellman inequality (9), we have
var Qˆπ^ ≤ T (^) uπ var Qˆπ^ ≤ lim k→∞ (T (^) uπ )k^ var Qˆπ^ = u?. (11)
Consider a simple decision problem with known deterministic transitions, unknown rewards and two actions. We imagine an agent has gathered some data d and produces some unbiased value estimates. According to these estimates, the first action leads to a single reward with expectation zero and variance σ^2. The second action leads to an infinite chain of independent states with expectation zero and variance σ^2 (1 − γ^2 ). These numbers are chosen so that the variance of the estimated value of each action is mean zero and variance σ^2. An optimistic agent motivated by (5) has no reason to value one action over the other. Nonetheless, most existing approaches to optimism that work via exploration bonus would lead to an inconsistent decision rule in this setting [28]. Rather than consider the variance of the value as a whole, the majority of existing approaches to OFU provide exploration bonuses at each state and action independently and then combine these estimates via union bound. In this context, even a state of the art algorithm such as UCRL2 [11] would afford each state a bonus proportional to its standard deviation of estimate. For action one this would be proportional to σ, but for action two this would be proportional to,
ExplorationBonus(a 2 ) ∝
t=
γtσ
1 − γ^2 = σ
1 + γ 1 − γ
by the Sherman-Morrison-Woodbury formula [8], the cost of this update is one matrix multiply and one matrix-matrix subtraction per step.
Neural networks value estimate. If we are approximating our Q-value function using a neural network then the above analysis does not hold. However if the last layer of the network is linear, then the Q-values are approximated as Qπ(s, a) = φ(s)T^ wa, where wa are the weights of the last layer associate with action a and φ(s) is the output of the network up to the last layer for state s. In other words we can think of a neural network as learning a useful set of basis functions such that a linear combination of them approximates the Q-values. Then, if we ignore the uncertainty in the φ mapping, we can reuse the analysis for the purely linear case to derive an approximate measure of local uncertainty that might be useful in practice. This scheme has some advantages. As the agent progresses it is learning a state representation that helps it achieve the goal of maximizing the return. The agent will learn to pay attention to small but important details (e.g., the ball in Atari ‘breakout’) and learn to ignore large but irrelevant changes (e.g., if the back- ground suddenly changes). This is a desirable property from the point of view of using these features to drive exploration, because the states that differ only in irrelevant ways will be aliased to (roughly) the same state representation, and states that differ is small but important ways will be mapped to quite different state vectors, permitting a more task-relevant measure of local uncertainty.
5 Algorithm for Deep Reinforcement Learning
In this section we describe an exploration heuristic for deep reinforcement learning whereby we attempt to learn both the Q-values and the uncertainties associated with them simultaneously (we assume the biases are small enough to ignore). The goal is for the agent to explore areas where it learns that it has higher uncertainty. This is in contrast to the commonly used -greedy [20] and Boltzmann exploration strategies [19, 23] which simply inject noise into the agents actions. Our policy uses Thompson sampling [38], where the action is selected as a = argmax b
( Qˆπ(s, b) + αζ(b)u(s, b)^1 /^2 ) (15)
where ζ(b) ∼ N (0, 1) and α > 0 is a hyper-parameter. In this case the probability of selecting action a is the probability that a has the maximum value if each action b was distributed normally with mean Qˆπ(s, b) and variance α^2 u(s, b). The technique is described in pseudo-code in algorithm 1. We refer to the technique as one-step since the uncertainty values are updated using a one-step SARSA Bellman backup, but it is easily extendable to the n-step case. The algorithm takes as input a neural network which has two output heads, one which is attempting to learn the optimal Q-values as normal, the other is attempting to learn the uncertainty values of the current policy (which is constantly changing). We do not allow the gradients from the uncertainty head to flow into the trunk; this ensures the Q-value estimates are not perturbed by the changing uncertainty signal. For the local uncertainty measure we use the linear basis approximation described previously. We have dropped the constant β from the uncertainty Bellman equation (10) and the unknown σ^2 term from the local-uncertainty in equation (13) because they are both absorbed by the hyper-parameter α in the policy. Most of the assumptions that allowed us to bound the true Q-values in equation (5) are violated by this scheme, in particular we have ignored the bias term, and the policy is changing as the Q-values change. However, we might expect this strategy to provide a useful signal of novelty to the agent and therefore perform well in practice.
Algorithm 1 One-step UBE exploration with linear uncertainty estimates.
// Input: Neural network outputting Q and u estimates // Input: Q-value learning subroutine qlearn // Input: Thompson sampling hyper-parameter α > 0 Initialize Σa = μI for each a, where μ > 0 Get initial state s, take initial action a repeat Retrieve feature mapping φ(s) from input to last layer of Q-head Receive new state s′^ and reward r Calculate Q-value estimates Q(s′, ·), uncertainty estimates u(s′, ·) Calculate action a′^ = argmaxb(Q(s′, b) + αζ(b)u(s′, b)^1 /^2 ), where ζ(b) ∼ N (0, 1) Calculate y =
φ(s)T^ Σaφ(s), for terminal s′ φ(s)T^ Σaφ(s) + γ^2 u(s′, a′), o.w. Take gradient step in u subnetwork with respect to error (y − u(s, a))^2 Update Q-values using qlearn(s, a, r, s′, a′) Update Σa′^ according to eq. (14) Take action a′ until T > Tmax
Here we present results of algorithm (1) on the Atari suite of games [3], where the network is attempting to learn the Q-values as in DQN [20, 21] and the uncertainties simultaneously. The only change to vanilla DQN we made was to replace the -greedy policy with Thompson sampling over the learned uncertainty values, where the α constant in (15) was chosen to be 0. 01 for all games, by a parameter sweep. We used the exact same network architecture, learning rate, optimizer, pre-processing and replay scheme as [21]. For the uncertainty head we used a single fully connected hidden layer with 512 hidden units followed by the output layer. We trained the uncertainty head using a separate RMSProp optimizer [39] with learning rate 10 −^3. The addition of the uncertainty head and the computation associated with it, only reduced the frame-rate compared to vanilla DQN by about 10% on a GPU, so the speed cost of the approach is negligible. We compare two versions of our approach: a 1 -step method and an n-step method where we set n to
1 10 20 50 100 200 Millions of Frames
0
5
10
15
20
25
30
35 Games at 100% human performance.
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
Figure 1: Number of games at super-human performance.
1 10 20 50 100 200 Millions of Frames
1.4 Median performance for all games
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
Figure 2: Median performance across all games.
0 100 200 300 400 500 Million Frames
0
500
1000
1500
2000
2500
3000
Average Episode Return
montezuma_revenge DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
Figure 3: UBE on Montezuma’s Revenge for 500M frames.
programming recursion. This uncertainty can be used by the agent to make decisions about which states and actions to explore, in order to gather more data about the environment and learn a better policy. Since the uncertainty satisfies a Bellman recursion, the agent can learn it using the same reinforcement learning machinery that have been developed for value functions. We showed that an algorithm based on this learned uncertainty can boost the performance of standard deep-RL techniques. Our technique was able to improve the average performance of DQN across the Atari suite of games, when compared against DQN using - greedy.
7 Acknowledgments
We thank Marc Bellemare, David Silver, Koray Kavukcuoglu, Tejas Kulkarni, Mohammad Gheshlaghi Azar, and Bilal Piot for discussion and suggestions on the paper.
References
[1] M. G. AZAR, R. MUNOS, AND B. KAPPEN, On the sample complexity of reinforcement learning with a generative model, in Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.
[2] M. BELLEMARE, S. SRINIVASAN, G. OSTROVSKI, T. SCHAUL, D. SAXTON, AND R. MUNOS, Uni- fying count-based exploration and intrinsic motivation, in Advances in Neural Information Processing Systems, 2016, pp. 1471–1479.
[3] M. G. BELLEMARE, Y. NADDAF, J. VENESS, AND M. BOWLING, The arcade learning environment: An evaluation platform for general agents, Journal of Artificial Intelligence Research, (2012).
[4] R. BELLMAN, Dynamic programming, Princeton University Press, 1957.
[5] D. P. BERTSEKAS, Dynamic programming and optimal control, vol. 1, Athena Scientific, 2005.
[6] J. A. BOYAN, Least-squares temporal difference learning, in ICML, 1999, pp. 49–56.
[7] R. I. BRAFMAN AND M. TENNENHOLTZ, R-max: A general polynomial time algorithm for near- optimal reinforcement learning, Journal of Machine Learning Research, 3 (2002), pp. 213–231.
[8] G. H. GOLUB AND C. F. VAN LOAN, Matrix computations, vol. 3, JHU Press, 2012.
[9] A. GUEZ, D. SILVER, AND P. DAYAN, Efficient Bayes-adaptive reinforcement learning using sample- based search, in Advances in Neural Information Processing Systems, 2012, pp. 1025–1033.
[10] G. H. HARDY, J. E. LITTLEWOOD, AND G. P ´OLYA, Inequalities, Cambridge university press, 1952.
[11] T. JAKSCH, R. ORTNER, AND P. AUER, Near-optimal regret bounds for reinforcement learning, Jour- nal of Machine Learning Research, 11 (2010), pp. 1563–1600.
[12] S. M. KAKADE, On the sample complexity of reinforcement learning, PhD thesis, University of Lon- don London, England, 2003.
[29] G. OSTROVSKI, M. G. BELLEMARE, A. V. D. OORD, AND R. MUNOS, Count-based exploration with neural density models, arXiv preprint arXiv:1703.01310, (2017).
[30] J. SCHMIDHUBER, Driven by compression progress: A simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes, in Anticipatory Behavior in Adaptive Learning Systems, Springer, 2009, pp. 48–76.
[31] S. P. SINGH, A. G. BARTO, AND N. CHENTANEZ, Intrinsically motivated reinforcement learning, in NIPS, vol. 17, 2004, pp. 1281–1288.
[32] M. J. SOBEL, The variance of discounted Markov decision processes, Journal of Applied Probability, 19 (1982), pp. 794–802.
[33] A. STREHL AND M. LITTMAN, Exploration via modelbased interval estimation, 2004.
[34] M. STRENS, A Bayesian framework for reinforcement learning, in ICML, 2000, pp. 943–950.
[35] R. SUTTON AND A. BARTO, Reinforcement Learning: an Introduction, MIT Press, 1998.
[36] R. S. SUTTON, Learning to predict by the methods of temporal differences, Machine learning, 3 (1988), pp. 9–44.
[37] A. TAMAR, D. DI CASTRO, AND S. MANNOR, Learning the variance of the reward-to-go, Journal of Machine Learning Research, 17 (2016), pp. 1–36.
[38] W. R. THOMPSON, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (1933), pp. 285–294.
[39] T. TIELEMAN AND G. HINTON, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, 4 (2012).
[40] C. J. C. H. WATKINS, Learning from delayed rewards, PhD thesis, University of Cambridge England,
8 Appendix
Here we prove the bound on the cross-term given in equation (8)
c(s, a) = Ed
((s, a) − Ed(s, a))Es′,a′ [ Qˆπ(s′, a′) − Qπ(s′, a′) − bias Qˆπ(s′, a′)]
= Es′,a′^ Ed
(s, a) − Ed(s, a)
(s′, a′) − Ed(s′, a′) +γEs′′,a′′ [ Qˆ(s′′, a′′) − Q(s′′, a′′) − bias Qˆπ(s′′, a′′)]
= Es′,a′,s′′,a′′,...Ed
(s, a) − Ed(s, a)
(s′, a′) − Ed(s′, a′) + γ((s′′, a′′) − Ed(s′′, a′′) + γ^2 (...
≤ maxs′,a′,s′′,a′′,... Ed
(s, a) − Ed(s, a)
(s′, a′) − Ed(s′, a′) + γ((s′′, a′′) − Ed(s′′, a′′) + γ^2 (...
= var (s, a)(1 + γ + γ^2 +.. .) = var (s, a)/(1 − γ),
where we have repeatedly used the fact that Qˆπ(t, b) − Qπ(t, b) = γEt′,b′^ ( Qˆπ(t′, b′) − Qπ(t′, b′)) + (t, b) for any (t, b), the fact that the biases satisfy a Bellman equation, and assumption 1 which implies that the max over each s′, a′^ is attained when (s′, a′) = (s, a). If we have more knowledge about the policy and the MDP then we can provide tighter bounds. For example, if we know that under the policy the agent cannot visit the same state before T time-steps then the cross-term can be bounded by c(s, a) ≤ γT^ −^1 var (s, a)/(1 − γT^ ) and the multiplicative factor becomes (1 + γT^ )/(1 − γT^ ). In particular a policy which never visits the same state twice (e.g., where the MDP is acyclic) has c(s, a) = 0 and the multiplicative factor in the uncertainty Bellman equation is one.
0 50 100 150 200 Million Frames
0
500
1000
1500
2000
2500
3000
3500
Average Episode Return
alien
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames
0
200
400
600
800
1000
1200
1400
Average Episode Return
amidar DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
500
1000
1500
2000
2500
3000
3500
Average Episode Return
assault DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
2000
4000
6000
8000
10000
12000
Average Episode Return
asterix DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
200
400
600
800
1000
1200
1400
1600
1800
2000
Average Episode Return
asteroids
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
Average Episode Return
atlantis DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
100
200
300
400
500
600
700
800
900
Average Episode Return
bank_heist DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
5000
10000
15000
20000
25000
30000
Average Episode Return
battle_zone
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
5000
10000
15000
20000
25000
Average Episode Return
beam_rider DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
200
400
600
800
1000
1200
1400
1600
Average Episode Return
berzerk DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
20
25
30
35
40
45
50
55
60
Average Episode Return
bowling DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
20
0
20
40
60
80
100
Average Episode Return
boxing
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
100
80
60
40
20
0
Average Episode Return
fishing_derby
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames
0
5
10
15
20
25
30
35
Average Episode Return
freeway
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
100
200
300
400
500
600
Average Episode Return
frostbite DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Average Episode Return
gopher DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
50
100
150
200
250
300
350
Average Episode Return
gravitar
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames
0
5000
10000
15000
20000
Average Episode Return
hero DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
18
16
14
12
10
8
6
4
Average Episode Return
ice_hockey
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames
0
200
400
600
800
1000
1200
1400
Average Episode Return
jamesbond DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
2000
4000
6000
8000
10000
12000
14000
16000
Average Episode Return
kangaroo
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
1000
2000
3000
4000
5000
6000
7000
8000
9000
Average Episode Return
krull
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
5000
10000
15000
20000
25000
30000
Average Episode Return
kung_fu_master
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames
0
100
200
300
400
500
600
Average Episode Return
montezuma_revenge DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
500
1000
1500
2000
2500
3000
3500
Average Episode Return
ms_pacman
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames
1000
2000
3000
4000
5000
6000
7000
8000
Average Episode Return
name_this_game
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step
0 50 100 150 200 Million Frames
0
2000
4000
6000
8000
10000
12000
Average Episode Return
phoenix
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step 0 50 100 150 200 Million Frames
500
400
300
200
100
0
Average Episode Return
pitfall
DQN DQN Intrinsic Motivation DQN UBE 1-step DQN UBE n-step