Objectives

Previously, we have learned qbout Model Free Prediction which enables us to estimate the value function of unknown MDP using sampling (Monte Carlo and Temporal Different Learning Algorithms).

In this notebook, we will talk about model free control which we will try to optimize these value functions.

Monte Carlo for Control

Model Free Policy Iteration Using Action-Value Function

To improve a policy (policy improvement) as we have noticed in Dynamic Programming, we can make it greedy. However, rhe greedy policy improvement over state value function does require the knowledge of MDP model. $$ \pi'(s) = \argmax _ {a} \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) | S_t=s, A_t=a] $$

Because we are not in model-free for the calculation of the expectation, so it is become curbersome to perform the policy improvement without a model.

Fortunately, instead of estimating state values, we could choose to estimate state action values, then, we can much more easily find the greedy policy by pick the highest valued action in every state.

$$ \pi '(s) = \argmax _a q(s,a) $$

Then, we can apply the Generalised Policy Iteration with Action Value Function. First, starting from a random action value and policy, we can iteratively estimate the action value (e.g using Monte Carlo policy evaluation) and perform policy improvement.

However, if we choose to do greedy policy improvement then we wouldn't explore which means we cant sample all s,a when learning by interacting. And it makes the policy evaluation susceptible to problems. For example, if we have a fully greedy policy and we are trying to evaluate that with MC algorithm. Then, it might not select certain actions in certain states and that means we do not have a reliable estimate of this action value from those actions and the improvement step could be wrong.

To deal with that problem, we can consider $\epsilon$-greedy policy where we allow a small probability of picking any action.

by doing so, we might have a full model free algorithm that can be used for policy iteration.

TD for Control

Basically, one would like to apply the TD learning to the action value functions to have the same convenience of being able to pick a greedy or $\epsilon -$ greedy policy. Plus, we can continue the update every time step because TD can learn from individual transition without completing the full episode.

SARSA Algorithm

One way to do it is to use SARSA Algorithm which is just a TD learning for state-action value. $$ q_{t+1}(S_t,A_t) = q_t(S_t,A_t) + \alpha _t(R_{t+1} + \gamma q(S_{t+1},A_{t+1}) - q(S_t,A_t) ) $$

Then, we can use this inplace of MC learning to do the policy evaluation step. The policy improvement now is still supportedly considered as $\epsilon$ -greedy improvement.

Off-policy TD and Q-learning

In previous cases, we considered MC and SARSA algorithms where we are interleaving the policy with evaluation and improvement. Now, we turn to Off-policy learning which means that learning about a policy different from the on we are following and there is a specific very popular algorithm which is Q-learning.

Q-learning corresponds to sampling of the following value interation $q_{k+1}(s,a) = \mathbb{E}[R_{t+1} + \gamma \max _{a'} q_k(S_{t+1},a') | S_t=s, A_t=a ] $

Here, the interation equation has a maximization over the action in the one step ahead , the one that maximize our action values.

we can eventually sample this with the following equation: $$ q_{k+1}(s,a) = q_t(S_t,A_t) + \alpha _t(R_{t+1} + \gamma \max _{a'} q_t(S_{t+1},a') -q_t(S_t,A_t)) $$

Inside the paranthesis, we bootstrap differently with SARSA. Instead of considering the next action in the next state, we consider the best action that we can possibly take according to our current estimates and then use that for the greedy value of our current value function.

On and Off-Policy Learning

On policy learning is about learning from a behaviour policy from experienced sample from that policy. Here, we always consider there are just one policy (Monte Carlo, SARSA) and that policy would be used to generate the behaviour because we want to take it to evaluate/predict the value functions of that policy. We called that is On-policy learning because we are studying the policy that we are following.

On the other hand, Off policy learns about a target policy $\pi$ but the experience sample is from a different policy $\mu$. This refers essentially to learning counterfactual about the other things we could do "what if...?"

In general, in off policy, we would like to evaluate a target policy $\pi(a|s)$ to compute the estimated value of that policy $v_\pi(s)$ or $q_\pi(s,a)$ while using behavioud policy $\mu(a|s)$ to generate actions.

In practice, it is important because of several possible reasons:

We want to learn from observing database (stored experience)
Reuse experience from old policies (past experience)
Learn about multiple policies while following one policy
Learn about greedy policy while following exploratory policy.

Q learning estimates the value of greedy policy $$ q_{t+1}(S_t,A_t) = q_t(S_t,A_t) + \alpha _t(R_{t+1} + \gamma \max _{a'} q_t(S_{t+1},a') -q_t(S_t,A_t)) $$

It imples that we are learning about the action value if we take this action and then in the next step we would be greedy. It is a valid target to update and it would be learning about the greedy policy but we do not need to react all the time according to that policy because it does not explore efficiently.

In this case, the action value function directly approximate $q_*$ independantly from the policy being followed.

Theorem: Q-learning control converges to the optimal action-value function, $q \rightarrow q^* $ as long as we take each action in each state infinitely often.

IT is proved that it works for any policy that eventually selects all actions sufficiently often (requires appropriately decaying step size $\sum _t \alpha _t = \infty$ and $\sum _t \alpha _t ^2 < \infty$)

Overestimation in Q-learning

Basically, Q-learning can overestimate the action values. Let's see the bootstrap target that is used in Q-learning. $$ \max _a q_t(S_{t+1},a) = q_t(S_{t+1},\argmax _a q_t(s_{t+1},a)) $$

The issue of the Q-learning by looking on this equation is that we use the same value function to both select the action and evaluate the action. However, the action value are approximate. It means that the estimated action value could be wrong and it can comes up with selecting overestimated values rather than underestimated values.

In order to deal with the overestimation in Q-learning, we need to decouple the action values for the selection and evaluation. Therefore, in gives birth to an algorithm which is socalled Double-Q Learning.

Double-Q Learning

Double-Q learning maintain two different state action value functions; $q$ and $q'$. $$ R_{t+1} + \gamma q'_t (S_{t+1}, \argmax _a q_t(S_{t+1},a)) \quad \quad (1) \\ R_{t+1} + \gamma q_t (S_{t+1}, \argmax _a q'_t(S_{t+1},a)) \quad \quad (2)\\ $$

The first equation is the standard thing with one step reward and discounted value of the next state. But the value of the next state is defined by picking an action according to our normal action value function $q$ but then evaluating the value of that action with $q'$.

Then, we can use each of these as targets for these two different action value functions. Every timestep, we will pick one of the action value functions ($q$ for (1) and $q'$ for (2)) to update. We never update both at the same time because it will correlate them. Then we can get unbiased estimate of the greedy action according to one of them.

It has been proved that Double Q-Learning also converges to the optimal policy under the same condition as Q-Learning.

Off Policy Learning: Importance Sampling Corrections

The goal under some function f with random input X and a distribution $d'$ , we would like to estimate the expectation of f(X) under a different distribution $d$ (target).

The solution is to re-weight the function. $$ \mathbb{E} _{x \sim d} [f(x)] = \sum d(x) f(x) \\ = \sum d'(x) \frac{d(x)}{d'(x)} f(x) \\ = \mathbb{E} _{x \sim d'} [\frac{d(x)}{d'(x)} f(x)] $$

It means that if we sample the final thing, we would actually get un unbiased estimate of the thing that we were looking for on the left hand side.

There is an assumption where the distribution $d'(x)$ should be different to 0, otherwise it intuitively states that of we never actually sample x under $d'$, we cant expect to learn from it. The intuition will be scale up events that are rare under $d'$, but common under $d$ because $d'$ will be small and $d$ will be large and vice versa.

Important Sampling for Off-Policy TD Updates

In order to learn a valid greedy policy, we have to select all of the actions exactly right but it might not happen that often or it is a very high variance estimate. Thus, we turn to temporal difference learning to construct temporal difference targets that will weight TD target $r + \gamma v(s')$ by importance sampling.

$$ v(S_t) \leftarrow v(S_t) + \alpha \begin{pmatrix} \frac{\pi (A_t | S_t)}{\mu (A_t |S_t)} (R_{t+1} + \gamma v(S_{t+1})) - v(S_t)\end{pmatrix} $$

Here, we only need to have one ratio of the probability of the target policy divided by the probability of the behaviour policy. By doing that, it has much lower variance on Monte Carlo importance sampling.

Expected SARSA

Now, we are considering the off-policy learning for action value functions without requiring importance sampling corrections. Instead of the maximum over next state-action pairs, we use the expected values taking into account how likely each action is under the current policy.

On the next action, we can just choose not under the behaviour policy but we can choose under a target policy that we are considering. So, the actual action we pick might be on our behaviour policy and we can bootstrap under the probabilities in target policy. $$ q(S_t,A_t) \leftarrow q(S_t,A_t) + \alpha \begin{pmatrix} R_{t+1} + \gamma \sum _a \pi(a | S_{t+1}) q(S_{t+1},a) - q(S_t,A_t) \end{pmatrix} $$

We are going to update action value function $q(S_t,A_t)$ by considering a temporal difference error which takes first reward and discount the next state value under our policy $\pi$. It turns out that it does not matter how we select $A_t$ and we can learn about the target policy for whatever behavioural policy we have.

This algorithm is called Expected SARSA that refers to the own policy algorithm. The expectation means that we dont necessarily have to pick the behaviour policy but we can pick any target policy. Thus, Q-learning is a special case with greedy target policy $\pi$.