DQN
Deep Q-Learning algorithm
-
Represents the optimal action-value function $q_*$ as a neural network (instead of a table)
-
Reinforcement learning is unstable when neural networks are used to represent the action values.
-
In this lesson, you’ll learn all about the Deep Q-Learning algorithm, which addressed these instabilities by using two key features:
- Experience Replay
- Fixed Q-Targets
Atari DQN
For each Atari game, the DQN was trained from scratch on that game.
- Input: images of game
- Images: spatial information
- Stacked images: capture temporal information
- NN:
- CNN
- Fully Connected Layers
- Output:
- The predicted action values for each possible game action
Experience Replay
- Based on the idea that we can learn better, if we do multiple passes over the sample experience
- To Generate uncorrelated experience data for online training
Fixed Q-Targets
- In Q-Learning, we update a guess with a guess, which can potentially lead to harmful correlations
- To avoid this, we can update the parameters $w$ in the network $\hat{q}$ to better approximate the action value corresponding to state $S$ and action A with the following update rule
$\Delta{w} = \alpha \Bigl(\color{#fc8d59} {R + \gamma max \hat{q} (S',a, w^-)} - \color{#4575b4} {\hat{q}(S,A,w)} \color{black} {\Bigr)\nabla_w \hat{q}(S,A,w)}$
- TD target:
- $\color{#fc8d59} {R + \gamma max \hat{q} (S',a, w^-)}$
- $w^-$:
- fixed
- are the weights of a separate target network that are not changed during the learning step,
- Current value: $\color{#4575b4} {\hat{q}(S,A,w)}$
- TD error: $\Bigl(\color{#fc8d59} {R + \gamma max \hat{q} (S',a, w^-)} - \color{#4575b4} {\hat{q}(S,A,w)} \color{black} {\Bigr)}$
Why?
- Decoupling the target’s position from the agent’s actions (parameters)
- Giving the agent a more stable learning environment
- Making the learning algorithm more stable and less likely to diverge or fall into oscillations.
[Non-fixed target, image from Udacity nd839]
[Fixed target, image from Udacity nd839]
Deep Q-Learning
- Uses two separate networks with identical architectures
- The target Q-Network’s weights are updated less often (or more slowly) than the primary Q-Network
- Without fixed Q-targets, we could encounter a harmful form of correlation, whereby we shift the parameters of the network based on a constantly moving target
Illustration of DQN Architecture (Source)
Improvements
- Double DQN
- Deep Q-Learning tends to overestimate action values.
- Double Q-Learning has been shown to work well in practice to help with this.
- Prioritized Experience Replay
- Deep Q-Learning samples experience transitions uniformly from a replay memory.
- Prioritized experienced replay is based on the idea that the agent can learn more effectively from some transitions than from others, and the more important transitions should be sampled with higher probability.
- Dueling DQN
- Currently, in order to determine which states are (or are not) valuable, we have to estimate the corresponding action values for each action.
- However, by replacing the traditional Deep Q-Network (DQN) architecture with a dueling architecture, we can assess the value of each state, without having to learn the effect of each action.
-
Learning from multi-step bootstrap targets
-
- An agent that incorporated all above six DQN extensions
- It outperforms each of the individual modifications and achieves state-of-the-art performance on Atari 2600 games!
(Source)
In practice
- Try different env in openAI and evaluate the performance of Q-Learning
- Assess trained RL agents to generalize to new tasks
-
In mid-2018, OpenAI held a contest, where participants were tasked to create an algorithm that could learn to play the Sonic the Hedgehog game. The participants were tasked to train their RL algorithms on provided game levels; then, the trained agents were ranked according to their performance on previously unseen levels.
-
One of the provided baseline algorithms was Rainbow DQN. If you’d like to play with this dataset and run the baseline algorithms, you’re encouraged to follow the setup instructions.
-
Reading Papers
Questions
- What kind of tasks are the authors using deep reinforcement learning (RL) to solve? What are the states, actions, and rewards?
- What neural network architecture is used to approximate the action-value function?
- How are experience replay and fixed Q-targets used to stabilize the learning algorithm?
- What are the results?
Papers
-
Riedmiller, Martin. “Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method." European Conference on Machine Learning. Springer, Berlin, Heidelberg, 2005.
-
Mnih, Volodymyr, et al. “Human-level control through deep reinforcement learning” Nature518.7540 (2015): 529.
-
van Hasselt, Guez, Silver “Deep Reinforcement Learning with Double Q-learning” arXiv (2015)
-
Thrun, Schwartz. “Issues in Using Function Approximation for Reinforcement Learning” (1993)
-
Schaul, Quan, Antonoglou, Silver. “Prioritized Experience Replay” arXiv (2016)
-
Wang, Schaul, et. al. “Dueling Network Architectures for Deep Reinforcement Learning” arXiv (2015)
-
Hessel, Modayil, et. al. “Rainbow: Combining Improvements in Deep Reinforcement Learning” arXiv (2017)