Proximal Policy Optimization
Problems of REINFORCE
There are three issues:
-
The update process is very inefficient! We run the policy once, update once, and then throw away the trajectory.
-
The gradient estimate $\hat{g}$ is very noisy. By chance the collected trajectory may not be representative of the policy.
-
There is no clear credit assignment! A trajectory may contain many good/bad actions and whether these actions are reinforced depends only on the final total output.
Solutions
Noise Reduction
- Problem
- Too many trajectories
- There could easily be well over millions of trajectories for simple problems and infinite for continuous problems.
- So a lot of times, the result of a sampled trajectory comes down to chance, and doesn’t contain that much information about our policy.
- The hope is that after training for a long time, the tiny signal accumulates.
- Too many trajectories
- Solution
- Simply sample more trajectories 😬
- using distributed computing
- estimate the policy gradient by averaging across all the different trajectories
- we can collect all the total rewards and get a sense of how they are distributed
- Simply sample more trajectories 😬
Rewards Normalization
-
In many cases, the distribution of rewards shifts as learning happens. Reward = 1 might be really good in the beginning, but really bad after 1000 training episode.
-
Learning can be improved if we normalize the rewards, where $\mu$ is the mean, and $\sigma$ the standard deviation.
-
$R_i \leftarrow \frac{R_i -\mu}{\sigma} \qquad$
- $ \mu = \frac{1}{N}\sum_i^N R_i \qquad$
- $\sigma = \sqrt{\frac{1}{N}\sum_i (R_i - \mu)^2}R$
-
when all the $R_i$ are the same, $\sigma =0$, we can set all the normalized rewards to 0 to avoid numerical problems
-
This batch-normalization technique is also used in many other problems in AI (e.g. image classification), where normalizing the input can improve learning.
-
Intuitively, normalizing the rewards roughly corresponds to picking half the actions to encourage/discourage, while also making sure the steps for gradient ascents are not too large/small. 🎲
-
Credit Assignment
At time-step $t$
-
Even before an action is decided, the agent has already received all the rewards up until step $t-1$. So we can think of that part of the total reward as the reward from the past. The rest is denoted as the future reward.
- $(\overbrace{…+r_{t-1}}^{\require{cancel}\bcancel{R^{\rm past}_t}} + \overbrace{r_{t}+…}^{R^{\rm future}_t})$
-
Because we have a Markov process, the action at time-step $t$ can only affect the future reward, so the past reward shouldn’t be contributing to the policy gradient.
- So to properly assign credit to the action $a_t$, we should ignore the past reward.
- So a better policy gradient would simply have the future reward as the coefficient
- $g=\sum_t \color{blue}{R_t^{\rm future}} \color{black}{\nabla_{\theta}\log \pi_\theta(a_t|s_t)}$
-
Mathematically, ignoring past rewards might change the gradient for each specific trajectory, but it doesn’t change the averaged gradient.
- So even though the gradient is different during training, on average we are still maximizing the average reward.
- In fact, the resultant gradient is less noisy, so training using future reward should speed things up! 🚀
Importance Sampling
-
The trajectories we generated using the policy $\pi_\theta$. It had a probability $P(\tau;\theta)$, to be sampled.
-
Now Just by chance, the same trajectory can be sampled under the new policy, with a different probability $\color{blue} {P(\tau;\theta')}$.
-
Imagine we want to compute the average of some quantity, say $f(\tau)$. We could simply generate trajectories from the new policy, compute $f(\tau)$ and average them.
-
Mathematically, this is equivalent to adding up all the $f(\tau)$, weighted by a probability of sampling each trajectory under the new policy.
- $\sum_\tau \color{blue} {P(\tau;\theta')} f(\tau)$
-
we could modify this equation, by multiplying and dividing by the same number, $P(\tau;\theta)$ and rearrange the terms
- $\sum_\tau \overbrace{P(\tau;\theta)}^{ \begin{matrix} \scriptsize \textrm{sampling under}\ \scriptsize \textrm{old policy } \pi_\theta \end{matrix} } \overbrace{\frac{ \color{blue} {P(\tau;\theta')}}{P(\tau;\theta)}}^{ \begin{matrix} \scriptsize \textrm{re-weighting}\ \scriptsize \textrm{factor} \end{matrix} } f(\tau)$
- we can reinterpret the first part as the coefficient for sampling under the old policy, with an extra re-weighting factor, in addition to just averaging.
-
we can use old trajectories for computing averages for new policy, as long as we add this extra re-weighting factor, that takes into account how under or over–represented each trajectory is under the new policy compared to the old one.
- The same tricks are used frequently across statistics, where the re-weighting factor is included to unbias surveys and voting predictions.
The re-weighting factor
$\frac{\color{blue} {P(\tau;\theta')}}{P(\tau;\theta)} =\frac {\pi_{\theta'}(a_1|s_1), \pi_{\theta'}(a_2|s_2), \pi_{\theta'}(a_3|s_3),…} {\pi_\theta(a_1|s_1) , \pi_\theta(a_2|s_2), \pi_\theta(a_2|s_2), …}$
-
Because each trajectory contains many steps, the probability contains a chain of products of each policy at different time-step.
-
A Problem: when some of policy gets close to zero, the re-weighting factor can become close to zero, or worse, close to 1 over 0 which diverges to infinity.
- When this happens, the re-weighting trick becomes unreliable.
-
A Solution: In practice, we want to make sure the re-weighting factor is not too far from 1 when we utilize importance sampling
Proximal Policy Optimization (PRO)
- closely related to Trust Region Policy Optimization (TRPO)
- developed by OpenAI
Algorithm
- First, collect some trajectories based on some policy $\pi_\theta$ , and initialize theta prime $\theta'=\theta$
- Next, compute the gradient of the clipped surrogate function using the trajectories
- Update $\theta$ using gradient ascent
- $\theta'\leftarrow\theta' +\alpha \nabla_{\theta'}L_{\rm sur}^{\rm clip}(\theta', \theta)$
- Then we repeat step 2-3 without generating new trajectories. Typically, step 2-3 are only repeated a few times
- Set $\theta=\theta'$ go back to step 1, repeat.
Test
Policy Gradient Quiz
Suppose we are training an agent to play a computer game.
- There are only two possible action: 0 = Do nothing, 1 = Move
- There are three time-steps in each game, and our policy is completely determined by one parameter $\theta$, such that the probability of “moving” is $\theta$, and the probability of doing nothing is $1-\theta$
- Initially $\theta=0.5$. Three games are played, the results are:
-
Game 1: actions: (1,0,1) rewards: (1,0,1)
-
Game 2: actions: (1,0,0) rewards: (0,0,1)
-
Game 3: actions: (0,1,0) rewards: (1,0,1)
-
- Question 1: What are the future rewards for the first game?
(1+0+1, 1+0, 1) = (2, 1, 1)
- Question 2: What is the policy gradient computed from the second game, using future rewards?
-2