Proximal Policy Optimization


Problems of REINFORCE

There are three issues:

  • The update process is very inefficient! We run the policy once, update once, and then throw away the trajectory.

  • The gradient estimate $\hat{g}$ is very noisy. By chance the collected trajectory may not be representative of the policy.

  • There is no clear credit assignment! A trajectory may contain many good/bad actions and whether these actions are reinforced depends only on the final total output.

Solutions

Noise Reduction

  • Problem
    • Too many trajectories
      • There could easily be well over millions of trajectories for simple problems and infinite for continuous problems.
      • So a lot of times, the result of a sampled trajectory comes down to chance, and doesn’t contain that much information about our policy.
      • The hope is that after training for a long time, the tiny signal accumulates.
  • Solution
    • Simply sample more trajectories 😬
      • using distributed computing
      • estimate the policy gradient by averaging across all the different trajectories
      • we can collect all the total rewards and get a sense of how they are distributed


Source


Rewards Normalization

  • In many cases, the distribution of rewards shifts as learning happens. Reward = 1 might be really good in the beginning, but really bad after 1000 training episode.

  • Learning can be improved if we normalize the rewards, where $\mu$ is the mean, and $\sigma$ the standard deviation.

    • $R_i \leftarrow \frac{R_i -\mu}{\sigma} \qquad$

      • $ \mu = \frac{1}{N}\sum_i^N R_i \qquad$
      • $\sigma = \sqrt{\frac{1}{N}\sum_i (R_i - \mu)^2}R$
    • when all the $R_i$ are the same, $\sigma =0$, we can set all the normalized rewards to 0 to avoid numerical problems

    • This batch-normalization technique is also used in many other problems in AI (e.g. image classification), where normalizing the input can improve learning.

    • Intuitively, normalizing the rewards roughly corresponds to picking half the actions to encourage/discourage, while also making sure the steps for gradient ascents are not too large/small. 🎲


Credit Assignment

At time-step $t$

  • Even before an action is decided, the agent has already received all the rewards up until step $t-1$. So we can think of that part of the total reward as the reward from the past. The rest is denoted as the future reward.

    • $(\overbrace{…+r_{t-1}}^{\require{cancel}\bcancel{R^{\rm past}_t}} + \overbrace{r_{t}+…}^{R^{\rm future}_t})$
  • Because we have a Markov process, the action at time-step $t$ can only affect the future reward, so the past reward shouldn’t be contributing to the policy gradient.

    • So to properly assign credit to the action $a_t$, we should ignore the past reward.
    • So a better policy gradient would simply have the future reward as the coefficient
    • $g=\sum_t \color{blue}{R_t^{\rm future}} \color{black}{\nabla_{\theta}\log \pi_\theta(a_t|s_t)}$
  • Mathematically, ignoring past rewards might change the gradient for each specific trajectory, but it doesn’t change the averaged gradient.

    • So even though the gradient is different during training, on average we are still maximizing the average reward.
    • In fact, the resultant gradient is less noisy, so training using future reward should speed things up! 🚀

Importance Sampling

  • The trajectories we generated using the policy $\pi_\theta$. It had a probability $P(\tau;\theta)$, to be sampled.

  • Now Just by chance, the same trajectory can be sampled under the new policy, with a different probability $\color{blue} {P(\tau;\theta')}$.

  • Imagine we want to compute the average of some quantity, say $f(\tau)$. We could simply generate trajectories from the new policy, compute $f(\tau)$ and average them.

  • Mathematically, this is equivalent to adding up all the $f(\tau)$, weighted by a probability of sampling each trajectory under the new policy.

    • $\sum_\tau \color{blue} {P(\tau;\theta')} f(\tau)$
  • we could modify this equation, by multiplying and dividing by the same number, $P(\tau;\theta)$ and rearrange the terms

    • $\sum_\tau \overbrace{P(\tau;\theta)}^{ \begin{matrix} \scriptsize \textrm{sampling under}\ \scriptsize \textrm{old policy } \pi_\theta \end{matrix} } \overbrace{\frac{ \color{blue} {P(\tau;\theta')}}{P(\tau;\theta)}}^{ \begin{matrix} \scriptsize \textrm{re-weighting}\ \scriptsize \textrm{factor} \end{matrix} } f(\tau)$
    • we can reinterpret the first part as the coefficient for sampling under the old policy, with an extra re-weighting factor, in addition to just averaging.
  • we can use old trajectories for computing averages for new policy, as long as we add this extra re-weighting factor, that takes into account how under or over–represented each trajectory is under the new policy compared to the old one.

    • The same tricks are used frequently across statistics, where the re-weighting factor is included to unbias surveys and voting predictions.

The re-weighting factor

$\frac{\color{blue} {P(\tau;\theta')}}{P(\tau;\theta)} =\frac {\pi_{\theta'}(a_1|s_1), \pi_{\theta'}(a_2|s_2), \pi_{\theta'}(a_3|s_3),…} {\pi_\theta(a_1|s_1) , \pi_\theta(a_2|s_2), \pi_\theta(a_2|s_2), …}$

  • Because each trajectory contains many steps, the probability contains a chain of products of each policy at different time-step.

  • A Problem: when some of policy gets close to zero, the re-weighting factor can become close to zero, or worse, close to 1 over 0 which diverges to infinity.

    • When this happens, the re-weighting trick becomes unreliable.
  • A Solution: In practice, we want to make sure the re-weighting factor is not too far from 1 when we utilize importance sampling


Proximal Policy Optimization (PRO)


Algorithm

  • First, collect some trajectories based on some policy $\pi_\theta$ , and initialize theta prime $\theta'=\theta$
  • Next, compute the gradient of the clipped surrogate function using the trajectories
  • Update $\theta$ using gradient ascent
    • $\theta'\leftarrow\theta' +\alpha \nabla_{\theta'}L_{\rm sur}^{\rm clip}(\theta', \theta)$
  • Then we repeat step 2-3 without generating new trajectories. Typically, step 2-3 are only repeated a few times
  • Set $\theta=\theta'$ go back to step 1, repeat.

Test

Policy Gradient Quiz

Suppose we are training an agent to play a computer game.

  • There are only two possible action: 0 = Do nothing, 1 = Move
    • There are three time-steps in each game, and our policy is completely determined by one parameter $\theta$, such that the probability of “moving” is $\theta$, and the probability of doing nothing is $1-\theta$
    • Initially $\theta=0.5$. Three games are played, the results are:
      • Game 1: actions: (1,0,1) rewards: (1,0,1)

      • Game 2: actions: (1,0,0) rewards: (0,0,1)

      • Game 3: actions: (0,1,0) rewards: (1,0,1)

  • Question 1: What are the future rewards for the first game?

    (1+0+1, 1+0, 1) = (2, 1, 1)

  • Question 2: What is the policy gradient computed from the second game, using future rewards?

    -2


Ref