DQN


Deep Q-Learning algorithm

  • Represents the optimal action-value function $q_*$ as a neural network (instead of a table)

  • Reinforcement learning is unstable when neural networks are used to represent the action values.

  • In this lesson, you’ll learn all about the Deep Q-Learning algorithm, which addressed these instabilities by using two key features:

    • Experience Replay
    • Fixed Q-Targets

Atari DQN

For each Atari game, the DQN was trained from scratch on that game.

  • Input: images of game
    • Images: spatial information
    • Stacked images: capture temporal information
  • NN:
    • CNN
    • Fully Connected Layers
  • Output:
    • The predicted action values for each possible game action

Experience Replay

  • Based on the idea that we can learn better, if we do multiple passes over the sample experience
  • To Generate uncorrelated experience data for online training

Fixed Q-Targets

  • In Q-Learning, we update a guess with a guess, which can potentially lead to harmful correlations
  • To avoid this, we can update the parameters $w$ in the network $\hat{q}$ to better approximate the action value corresponding to state $S$ and action A with the following update rule

$\Delta{w} = \alpha \Bigl(\color{#fc8d59} {R + \gamma max \hat{q} (S',a, w^-)} - \color{#4575b4} {\hat{q}(S,A,w)} \color{black} {\Bigr)\nabla_w \hat{q}(S,A,w)}$

  • TD target:
    • $\color{#fc8d59} {R + \gamma max \hat{q} (S',a, w^-)}$
    • $w^-$:
      • fixed
      • are the weights of a separate target network that are not changed during the learning step,
  • Current value: $\color{#4575b4} {\hat{q}(S,A,w)}$
  • TD error: $\Bigl(\color{#fc8d59} {R + \gamma max \hat{q} (S',a, w^-)} - \color{#4575b4} {\hat{q}(S,A,w)} \color{black} {\Bigr)}$

Why?

  • Decoupling the target’s position from the agent’s actions (parameters)
  • Giving the agent a more stable learning environment
  • Making the learning algorithm more stable and less likely to diverge or fall into oscillations.

Non-fixed target
[Non-fixed target, image from Udacity nd839]


Fixed Target
[Fixed target, image from Udacity nd839]

Deep Q-Learning

  • Uses two separate networks with identical architectures
  • The target Q-Network’s weights are updated less often (or more slowly) than the primary Q-Network
  • Without fixed Q-targets, we could encounter a harmful form of correlation, whereby we shift the parameters of the network based on a constantly moving target

Deep Q-Learning
Illustration of DQN Architecture (Source)


Improvements

  • Double DQN
  • Prioritized Experience Replay
  • Deep Q-Learning samples experience transitions uniformly from a replay memory.
  • Prioritized experienced replay is based on the idea that the agent can learn more effectively from some transitions than from others, and the more important transitions should be sampled with higher probability.
  • Dueling DQN
  • Currently, in order to determine which states are (or are not) valuable, we have to estimate the corresponding action values for each action.
  • However, by replacing the traditional Deep Q-Network (DQN) architecture with a dueling architecture, we can assess the value of each state, without having to learn the effect of each action.

Rainbow
(Source)

In practice

  • Try different env in openAI and evaluate the performance of Q-Learning
  • Assess trained RL agents to generalize to new tasks
    • In mid-2018, OpenAI held a contest, where participants were tasked to create an algorithm that could learn to play the Sonic the Hedgehog game. The participants were tasked to train their RL algorithms on provided game levels; then, the trained agents were ranked according to their performance on previously unseen levels.

    • One of the provided baseline algorithms was Rainbow DQN. If you’d like to play with this dataset and run the baseline algorithms, you’re encouraged to follow the setup instructions.


Reading Papers

Questions

  • What kind of tasks are the authors using deep reinforcement learning (RL) to solve? What are the states, actions, and rewards?
  • What neural network architecture is used to approximate the action-value function?
  • How are experience replay and fixed Q-targets used to stabilize the learning algorithm?
  • What are the results?

Papers