DQN

Saturday, December 26, 2020

Deep Q-Learning algorithm

Represents the optimal action-value function $q_*$ as a neural network (instead of a table)
Reinforcement learning is unstable when neural networks are used to represent the action values.
In this lesson, you’ll learn all about the Deep Q-Learning algorithm, which addressed these instabilities by using two key features:
- Experience Replay
- Fixed Q-Targets

Atari DQN

For each Atari game, the DQN was trained from scratch on that game.

Input: images of game
- Images: spatial information
- Stacked images: capture temporal information
NN:
- CNN
- Fully Connected Layers
Output:
- The predicted action values for each possible game action

Experience Replay

Based on the idea that we can learn better, if we do multiple passes over the sample experience
To Generate uncorrelated experience data for online training

Fixed Q-Targets

In Q-Learning, we update a guess with a guess, which can potentially lead to harmful correlations
To avoid this, we can update the parameters $w$ in the network $\hat{q}$ to better approximate the action value corresponding to state $S$ and action A with the following update rule

$\Delta{w} = \alpha \Bigl(\color{#fc8d59} {R + \gamma max \hat{q} (S',a, w^-)} - \color{#4575b4} {\hat{q}(S,A,w)} \color{black} {\Bigr)\nabla_w \hat{q}(S,A,w)}$

TD target:
- $\color{#fc8d59} {R + \gamma max \hat{q} (S',a, w^-)}$
- $w^-$:
  - fixed
  - are the weights of a separate target network that are not changed during the learning step,
Current value: $\color{#4575b4} {\hat{q}(S,A,w)}$
TD error: $\Bigl(\color{#fc8d59} {R + \gamma max \hat{q} (S',a, w^-)} - \color{#4575b4} {\hat{q}(S,A,w)} \color{black} {\Bigr)}$

Why?

Decoupling the target’s position from the agent’s actions (parameters)
Giving the agent a more stable learning environment
Making the learning algorithm more stable and less likely to diverge or fall into oscillations.

[Non-fixed target, image from Udacity nd839]

[Fixed target, image from Udacity nd839]

Deep Q-Learning

Uses two separate networks with identical architectures
The target Q-Network’s weights are updated less often (or more slowly) than the primary Q-Network
Without fixed Q-targets, we could encounter a harmful form of correlation, whereby we shift the parameters of the network based on a constantly moving target

Deep Q-Learning
Illustration of DQN Architecture (Source)

Improvements

Double DQN

Deep Q-Learning tends to overestimate action values.
Double Q-Learning has been shown to work well in practice to help with this.

Prioritized Experience Replay

Deep Q-Learning samples experience transitions uniformly from a replay memory.
Prioritized experienced replay is based on the idea that the agent can learn more effectively from some transitions than from others, and the more important transitions should be sampled with higher probability.

Dueling DQN

Currently, in order to determine which states are (or are not) valuable, we have to estimate the corresponding action values for each action.
However, by replacing the traditional Deep Q-Network (DQN) architecture with a dueling architecture, we can assess the value of each state, without having to learn the effect of each action.

Learning from multi-step bootstrap targets
Distributional DQN
Noisy DQN
Rainbow
- An agent that incorporated all above six DQN extensions
- It outperforms each of the individual modifications and achieves state-of-the-art performance on Atari 2600 games!

Rainbow
(Source)

In practice

Try different env in openAI and evaluate the performance of Q-Learning
Assess trained RL agents to generalize to new tasks
- In mid-2018, OpenAI held a contest, where participants were tasked to create an algorithm that could learn to play the Sonic the Hedgehog game. The participants were tasked to train their RL algorithms on provided game levels; then, the trained agents were ranked according to their performance on previously unseen levels.
- One of the provided baseline algorithms was Rainbow DQN. If you’d like to play with this dataset and run the baseline algorithms, you’re encouraged to follow the setup instructions.

Reading Papers

Questions

What kind of tasks are the authors using deep reinforcement learning (RL) to solve? What are the states, actions, and rewards?
What neural network architecture is used to approximate the action-value function?
How are experience replay and fixed Q-targets used to stabilize the learning algorithm?
What are the results?

Papers

Riedmiller, Martin. “Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method." European Conference on Machine Learning. Springer, Berlin, Heidelberg, 2005.
Mnih, Volodymyr, et al. “Human-level control through deep reinforcement learning” Nature518.7540 (2015): 529.
van Hasselt, Guez, Silver “Deep Reinforcement Learning with Double Q-learning” arXiv (2015)
Thrun, Schwartz. “Issues in Using Function Approximation for Reinforcement Learning” (1993)
Schaul, Quan, Antonoglou, Silver. “Prioritized Experience Replay” arXiv (2016)
Wang, Schaul, et. al. “Dueling Network Architectures for Deep Reinforcement Learning” arXiv (2015)
Hessel, Modayil, et. al. “Rainbow: Combining Improvements in Deep Reinforcement Learning” arXiv (2017)

← Previous