Temporal Difference


Temporal Difference (TD) Control

  • A class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function.
  • These methods sample from the environment, like Monte Carlo methods, and perform updates based on current estimates, like dynamic programming methods.
  • Ref

TD vs MC Control

  • TD: update Q-table after every time step
  • MC: update the Q-table after complete an entire episode of interaction

Sarsa(0) Algorithm

  • Naming: State–action–reward–state–action
  • In the algorithm, the number of episodes the agent collects is equal to $num$ $episodes$. For every time step $t\ge 0$, the agent:
    • takes the action $A_t$ (from the current state $S_t$ that is $\epsilon$-greedy with respect to the Q-table
    • receives the reward $R_{t+1}$ and next state $S_{t+1}$
    • chooses the next action $A_{t+1}$ (from the next state $S_{t+1}$) that is $\epsilon$-greedy with respect to the Q-table
    • uses the information in the tuple ($S_t,A_t, R_{t+1}, S_{t+1}, A_{t+1}$) tp ipdate the entry $Q(S_t, A_t)$ in the Q-table corresponding to the current state $S_t$ and the action $A_t$.

Sarsa Algorithm
Sarsa Algorithm, image from Udacity nd893


Sarsamax (Q-Learning)

Sarsamax
Sarsamax Algorithm, image from Udacity nd893


Optimism 🔭

  • For any TD control method, you must begin by initializing the values in the Q-table. It has been shown that initializing the estimates to large values can improve performance.
  • If all of the possible rewards that can be received by the agent are negative, then initializing every estimate in the Q-table to zeros is a good technique.
  • In this case, we refer to the initialized Q-table as optimistic, since the action-value estimates are guaranteed to be larger than the true action values.

Excercise

OpenAI Gym: CliffWalkingEnv


Test

Sarsa

Sarsa
Sarsa Algorithm, image from Udacity nd893

Answer: 6.16