Temporal Difference

Tuesday, December 22, 2020

Temporal Difference (TD) Control

A class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function.
These methods sample from the environment, like Monte Carlo methods, and perform updates based on current estimates, like dynamic programming methods.
Ref

Naming: State–action–reward–state–action
In the algorithm, the number of episodes the agent collects is equal to $num$ $episodes$. For every time step $t\ge 0$, the agent:
- takes the action $A_t$ (from the current state $S_t$ that is $\epsilon$-greedy with respect to the Q-table
- receives the reward $R_{t+1}$ and next state $S_{t+1}$
- chooses the next action $A_{t+1}$ (from the next state $S_{t+1}$) that is $\epsilon$-greedy with respect to the Q-table
- uses the information in the tuple ($S_t,A_t, R_{t+1}, S_{t+1}, A_{t+1}$) tp ipdate the entry $Q(S_t, A_t)$ in the Q-table corresponding to the current state $S_t$ and the action $A_t$.

For any TD control method, you must begin by initializing the values in the Q-table. It has been shown that initializing the estimates to large values can improve performance.
If all of the possible rewards that can be received by the agent are negative, then initializing every estimate in the Q-table to zeros is a good technique.
In this case, we refer to the initialized Q-table as optimistic, since the action-value estimates are guaranteed to be larger than the true action values.

Answer: 6.16