Proximal Policy Optimization

Wednesday, December 30, 2020

Problems of REINFORCE

There are three issues:

The update process is very inefficient! We run the policy once, update once, and then throw away the trajectory.
The gradient estimate $\hat{g}$ is very noisy. By chance the collected trajectory may not be representative of the policy.
There is no clear credit assignment! A trajectory may contain many good/bad actions and whether these actions are reinforced depends only on the final total output.

Solutions

Noise Reduction

Problem
- Too many trajectories
  - There could easily be well over millions of trajectories for simple problems and infinite for continuous problems.
  - So a lot of times, the result of a sampled trajectory comes down to chance, and doesn’t contain that much information about our policy.
  - The hope is that after training for a long time, the tiny signal accumulates.
Solution
- Simply sample more trajectories 😬
  - using distributed computing
  - estimate the policy gradient by averaging across all the different trajectories
  - we can collect all the total rewards and get a sense of how they are distributed

Source

Rewards Normalization

In many cases, the distribution of rewards shifts as learning happens. Reward = 1 might be really good in the beginning, but really bad after 1000 training episode.
Learning can be improved if we normalize the rewards, where $\mu$ is the mean, and $\sigma$ the standard deviation.
- $R_i \leftarrow \frac{R_i -\mu}{\sigma} \qquad$
  - $ \mu = \frac{1}{N}\sum_i^N R_i \qquad$
  - $\sigma = \sqrt{\frac{1}{N}\sum_i (R_i - \mu)^2}R$
- when all the $R_i$ are the same, $\sigma =0$, we can set all the normalized rewards to 0 to avoid numerical problems
- This batch-normalization technique is also used in many other problems in AI (e.g. image classification), where normalizing the input can improve learning.
- Intuitively, normalizing the rewards roughly corresponds to picking half the actions to encourage/discourage, while also making sure the steps for gradient ascents are not too large/small. 🎲

Credit Assignment

At time-step $t$

Even before an action is decided, the agent has already received all the rewards up until step $t-1$. So we can think of that part of the total reward as the reward from the past. The rest is denoted as the future reward.
- $(\overbrace{…+r_{t-1}}^{\require{cancel}\bcancel{R^{\rm past}_t}} + \overbrace{r_{t}+…}^{R^{\rm future}_t})$
Because we have a Markov process, the action at time-step $t$ can only affect the future reward, so the past reward shouldn’t be contributing to the policy gradient.
- So to properly assign credit to the action $a_t$, we should ignore the past reward.
- So a better policy gradient would simply have the future reward as the coefficient
- $g=\sum_t \color{blue}{R_t^{\rm future}} \color{black}{\nabla_{\theta}\log \pi_\theta(a_t|s_t)}$
Mathematically, ignoring past rewards might change the gradient for each specific trajectory, but it doesn’t change the averaged gradient.
- So even though the gradient is different during training, on average we are still maximizing the average reward.
- In fact, the resultant gradient is less noisy, so training using future reward should speed things up! 🚀

Importance Sampling

The trajectories we generated using the policy $\pi_\theta$. It had a probability $P(\tau;\theta)$, to be sampled.
Now Just by chance, the same trajectory can be sampled under the new policy, with a different probability $\color{blue} {P(\tau;\theta')}$.
Imagine we want to compute the average of some quantity, say $f(\tau)$. We could simply generate trajectories from the new policy, compute $f(\tau)$ and average them.
Mathematically, this is equivalent to adding up all the $f(\tau)$, weighted by a probability of sampling each trajectory under the new policy.
- $\sum_\tau \color{blue} {P(\tau;\theta')} f(\tau)$
we could modify this equation, by multiplying and dividing by the same number, $P(\tau;\theta)$ and rearrange the terms
- $\sum_\tau \overbrace{P(\tau;\theta)}^{ \begin{matrix} \scriptsize \textrm{sampling under}\ \scriptsize \textrm{old policy } \pi_\theta \end{matrix} } \overbrace{\frac{ \color{blue} {P(\tau;\theta')}}{P(\tau;\theta)}}^{ \begin{matrix} \scriptsize \textrm{re-weighting}\ \scriptsize \textrm{factor} \end{matrix} } f(\tau)$
- we can reinterpret the first part as the coefficient for sampling under the old policy, with an extra re-weighting factor, in addition to just averaging.
we can use old trajectories for computing averages for new policy, as long as we add this extra re-weighting factor, that takes into account how under or over–represented each trajectory is under the new policy compared to the old one.
- The same tricks are used frequently across statistics, where the re-weighting factor is included to unbias surveys and voting predictions.

The re-weighting factor

$\frac{\color{blue} {P(\tau;\theta')}}{P(\tau;\theta)} =\frac {\pi_{\theta'}(a_1|s_1), \pi_{\theta'}(a_2|s_2), \pi_{\theta'}(a_3|s_3),…} {\pi_\theta(a_1|s_1) , \pi_\theta(a_2|s_2), \pi_\theta(a_2|s_2), …}$

Because each trajectory contains many steps, the probability contains a chain of products of each policy at different time-step.
A Problem: when some of policy gets close to zero, the re-weighting factor can become close to zero, or worse, close to 1 over 0 which diverges to infinity.
- When this happens, the re-weighting trick becomes unreliable.
A Solution: In practice, we want to make sure the re-weighting factor is not too far from 1 when we utilize importance sampling

Proximal Policy Optimization (PRO)

closely related to Trust Region Policy Optimization (TRPO)
developed by OpenAI

Algorithm

First, collect some trajectories based on some policy $\pi_\theta$ , and initialize theta prime $\theta'=\theta$
Next, compute the gradient of the clipped surrogate function using the trajectories
Update $\theta$ using gradient ascent
- $\theta'\leftarrow\theta' +\alpha \nabla_{\theta'}L_{\rm sur}^{\rm clip}(\theta', \theta)$
Then we repeat step 2-3 without generating new trajectories. Typically, step 2-3 are only repeated a few times
Set $\theta=\theta'$ go back to step 1, repeat.

Test

Policy Gradient Quiz

Suppose we are training an agent to play a computer game.

There are only two possible action: 0 = Do nothing, 1 = Move
- There are three time-steps in each game, and our policy is completely determined by one parameter $\theta$, such that the probability of “moving” is $\theta$, and the probability of doing nothing is $1-\theta$
- Initially $\theta=0.5$. Three games are played, the results are:
  - Game 1: actions: (1,0,1) rewards: (1,0,1)
  - Game 2: actions: (1,0,0) rewards: (0,0,1)
  - Game 3: actions: (0,1,0) rewards: (1,0,1)
Question 1: What are the future rewards for the first game?

(1+0+1, 1+0, 1) = (2, 1, 1)
Question 2: What is the policy gradient computed from the second game, using future rewards?

-2

Ref

Proximal Policy Optimization Algorithms – by OpenAI

← Previous