Comparison of Q-learning Algorithms on Atari Games: DQN, DQN Intrinsic Motivation, UBE | Study Guides, Projects, Research Machine Learning

The Uncertainty Bellman Equation and Exploration

Brendan O’Donoghue, Ian Osband, Remi Munos, Volodymyr Mnih

Deepmind

{bodonoghue, iosband, munos, vmnih}@google.com

September 19, 2017

Abstract

We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is

well known that the Bellman equation connects the value at any time-step to the expected value at

subsequent time-steps. In this paper we consider a similar uncertainty Bellman equation (UBE), which

connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby

extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the

unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed

policy. This bound can be much tighter than traditional count-based bonuses that compound standard

deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this

method scales naturally to large systems with complex generalization. Substituting our UBE-exploration

strategy for -greedy improves DQN performance on 51 out of 57 games in the Atari suite.

1 Introduction

We consider the reinforcement learning (RL) problem of an agent interacting with its environment to max-

imize cumulative rewards through time [35]. We model the environment as a Markov decision process

(MDP), but where the agent is initially uncertain of the true dynamics of the MDP [4, 5]. At each time-step,

the agent performs an action, receives a reward, and moves to the next state; from these data it can learn

which actions lead to higher payoffs. This leads to the exploration versus exploitation trade-off: Should the

agent investigate poorly understood states and actions to improve future performance or instead take actions

that maximize rewards given its current knowledge?

Separating estimation and control in RL via ‘greedy’ algorithms can lead to premature and subopti-

mal exploitation. To offset this, the majority of practical implementations introduce some random noise

or dithering into their action selection (such as -greedy). These algorithms will eventually explore every

reachable state and action infinitely often, but can take exponentially long to learn the optimal policy [12].

By contrast, for any set of prior beliefs the optimal exploration policy can be computed directly by dynamic

programming in the Bayesian belief space. However this approach can be computationally intractable for

even very small problems [9] while direct computational approximations can fail spectacularly badly [22].

For this reason, most provably-efficient approaches to reinforcement learning rely upon the optimism

in the face of uncertainty (OFU) heuristic [14, 13, 7]. These algorithms give a bonus to poorly-understood

states and actions and subsequently follow the policy that is optimal for this augmented optimistic MDP.

This optimism incentivises exploration but, as the agent learns more about the environment, the scale of

the bonus should decrease and the agent’s performance should approach optimality. At a high level these

approaches to OFU-RL build up confidence sets that contain the true MDP with high probability [33, 16,

arXiv:1709.05380v1 [cs.AI] 15 Sep 2017

Comparison of Q-learning Algorithms on Atari Games: DQN, DQN Intrinsic Motivation, UBE, Study Guides, Projects, Research of Machine Learning

Related documents

Partial preview of the text

Download Comparison of Q-learning Algorithms on Atari Games: DQN, DQN Intrinsic Motivation, UBE and more Study Guides, Projects, Research Machine Learning in PDF only on Docsity!

The Uncertainty Bellman Equation and Exploration

Brendan O’Donoghue, Ian Osband, Remi Munos, Volodymyr Mnih

Deepmind

{bodonoghue, iosband, munos, vmnih}@google.com

September 19, 2017

1 Introduction

arXiv:1709.05380v1 [cs.AI] 15 Sep 2017

[∑∞

[∑∞

[(

]

[(

) 2 ]

[(

]

[(

) 2 ]

[(

) 2 ]

3.2 Comparison to traditional exploration bonus

∑^ T

5.1 Experimental results

8.1 Bound on the cross-term

[

]

[(

)]

[(

×

)]

[(

×

)]

8.2 Learning curves