RL Solutions
Policies
-
A deterministic policy: a mapping π:S→A
- For each state s∈S, it yields the action a∈A that the agent will choose while in state s.
- input: state
- output: action
- Example
- π(low)=recharge
- π(high)=search
-
A stochastic policy: a mapping π:S×A→[0,1]
-
For each state s∈S and action a∈A, it yields the probability π(a|s) that the agent chooses action a while in state s.
-
input: state and action
-
output: probability that the agent takes action A while in state S
-
π(a|s)=P(At=a|St=s)
-
Example
- low
- π(recharge|low)=0.5
- π(wait|low)=0.4
- π(search|low)=0.1
- high
- π(search|high)=0.6
- π(wait|high)=0.4
- low
-
Gridworld Example
In this gridworld example, once the agent selects an action,
- it always moves in the chosen direction (contrasting general MDPs where the agent doesn’t always have complete control over what the next state will be),
- the reward can be predicted with complete certainty (contrasting general MDPs where the reward is a random draw from a probability distribution).
- the value of any state can be calculated as the sum of the immediate reward and the (discounted) value of the next state.
Grid Word images from Udacity nd893
State-Value Functions
The state-value function for a policy is denoted vπ. For each state s∈S, it yields the expected return if the agent starts in state s and then uses the policy to choose its actions for all time steps.
- The value of state s under policy π: vπ(s)=Eπ[Gt|St=s]
Bellman Equations
-
Bellman equations attest to the fact that value functions satisfy recursive relationships.
-
Read More at Chapter 3.5 and 3.6
-
It expresses the value of any state s in terms of the expected immediate reward and the expected value of the next state
- Vπ(s)=Eπ[Rt+1+γvπ(St+1)|St)=s]
-
In the event that the agent’s policy π is deterministic, the agent selects action π(s) when in state s, and the Bellman Expectation Equation can be rewritten as the sum over two variables (s′ and r):
- vπ(s)=∑s′∈S+,r∈Rp(s′,r|s,πs)(r+γvπ(s′))
- In this case, we multiply the sum of the reward and discounted value of the next state (r+γvπ(s′)) by its corresponding probability p(s′,r|s,πs) and sum over all possibilities to yield the expected value.
-
If the agent’s policy π is stochastic, the agent selects action a with probability when in state s, and the Bellman Expectation Equation can be rewritten as the sum over three variables
- vπ(s)=∑s′∈S+,r∈R,a∈A(s)π(a|s)p(s′,r|s,a)(γ+γvπ(s′))
-In this case, we multiply the sum of the reward and discounted value of the next state (γ+γvπ(s′)) by its corresponding probability π(a|s)p(s′,r|s,a) and sum over all possibilities to yield the expected value.
- vπ(s)=∑s′∈S+,r∈R,a∈A(s)π(a|s)p(s′,r|s,a)(γ+γvπ(s′))
Action-value Functions
- The action-value function for a policy: qπ.
- The value qπ(s,a) is the value of taking action a in state s under a policy π
- For each state s∈S and action a∈A, it yields the expected return if the agent starts in state s, takes action a, and then follows the policy for all future time steps.
- qπ(s,a)=Eπ[Gt|St=s,At=a]
Optimality
- A policy π is defined to be better than or equal to a policy π if and only if vπ′(s)≥vπ(s) for all.
- An optimal policy π∗ satisfies π∗≥π is guaranteed to exist but may not be unique.
- How do you define better policy?
- Positive reward, which gives agent more motivation
Optimal State-value Function
All optimal policies have the same state-value function v∗
Optimal Action-value Function
All optimal policies have the same action-value function q∗
Optimal Policies
- Interaction → q∗ → π∗
- Once the agent determines the optimal action-value function q∗, it can quickly obtain an optimal polity πast by setting
- π(s)=argmaxa∈A(s)q∗(s,a): Select the maximal actiona value q at each step
Quiz
Calculate State-value Function vπ
-
Assuming γ=1, calculate vπ(s4) and vπ(s1)
-
Solve the problem using Bellman Equations
Image from Udacity nd893 -
Answer: vπ(s4)=1 and vπ(s1)=2
About Deterministic Policy
True or False?:
For a deterministic policy π:
vπ(s)=qπ(s,π(s)) holds for all s∈S.
Answer: True.
- It also follows from how we have defined the action-value function.
- The value of the state-action pair s, π(s) is the expected return if the agent starts in state s, takes action π(s), and henceforth follows the policy π.
- In other words, it is the expected return if the agent starts in state s, and then follows the policy π, which is exactly equal to the value of state s.
Optimal Policies
Consider a MDP (in the below table), with a corresponding optimal action-value function. Which of the following describes a potential optimal policy that corresponds to the optimal action-value function?
a1 | a2 | a3 | |
---|---|---|---|
s1 | 1 | 3 | 4 |
s2 | 2 | 2 | 1 |
s3 | 3 | 1 | 1 |
Answer
-
- State s1: the agent will always selects action a3
-
- State s2: the agent is free to select either a1 or a2
-
- State s3: the agent must select a1