Preliminaries

  • Markov property: a process is Markov if the future is independent on the past given the present.
  • State transition $$ p(s'|s,a)= \sum _r p(s',r|s,a) $$
  • Expected reward $$ r(s,a) = \mathbb{E} [R|s,a] = \sum _r r \sum _{s'} p(r,s'|s,a) $$

Markov Decision Processes (MDP)

MDPs are a classical formulations of sequential decision making, where each action influences not only immediate rewards, but also subsequent situations, or states and through those future rewards. Thus MDPs involve delayed reward and the need to trade-off immediate reward and delay reward.

In bandit problems, we estimated the value $q_*(a)$ of each action a. In MDP, we estimate the value $q_*(s,a)$ of each action a in each state s, or we estimate the value $v_*(s)$ of each state given optimal action selections. These states dependent quantities are essential to accurately assign credit for long term consequences to individual action selections.

MDPs are meant to be a straight forward framing of problem of learningfrom interaction to achieve a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent. The environment also gives rise to rewards through choice of actions.

A state s has the Markov property when for states $\forall s' \in \mathcal{S} $ and all reward $r \in \mathbb{R}$ $$ p(R_{t+1}=r, S_{t+1}=s' | S_t=s) = p(R_{t+1}=r, S_{t+1}=s' | S_1,...,S_{t-1},S_t=s) $$

Then, the distribution of the reward and the next state when we condition only the current state is the same as we would conditional all previous states. So, if we know all the previous states, this give us no additional information about what the next state and reward would be. It means that we just need to look at the current state.

$p(r,s'|s,a)$ is the joint probability of a reward r and next state s' if we are at the state s and action a has been taken.

In a finite MDP, the sets of state, actions and rewards ($\mathcal{S}, \mathcal{A},\mathcal{R} $) all have a finite number of element. The funtion p gives the dynamics of MDPs and $\gamma \in [0,1]$ is a discount factor that trades off later rewards. Then, a MDP is a tuple of $(\mathcal{S}, \mathcal{A},\mathcal{R}, p, \gamma)$.

Acting in a MDP results in returns $G_t$: total discounted reward from time-step t. $$ G_t = R_{t+1} + \gamma R_{t+2} + ... = \sum _{k=0} ^\infty \gamma^k R_{t+k+1} $$

In principle, $G_t$ is a random variables that depends on MDP and policy.

Polices and Value functions

Almost all reinforcement learning algorithms ilvolve estimating value functions - functions fo states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The rewards that the agent can expect to receive in the future depend on what action it will takes. Accordingly, value function are defined with respect to particular ways of acting, called policy.

Formally, policy maps states to probabilities of selecting each possible action. If the agent follows policy $\pi$ at time t, then $\pi(a|s)$ is the probability that $A_t=a$ if $S_t=s.$

Recalling that the value function of a state s under policy $\pi$, $v_\pi(s)$ is the expected return when starting in state s and following $\pi$ thereafter. $$ v_\pi(s)= \mathbb{E}_\pi [G_t | S_t=s] = \mathbb{E}_\pi [\sum_k \gamma ^k R_{t+k+1} | S_t=s] \\\\ =\mathbb{E}_\pi [ R_{t+1} + \gamma G_{t+1} | S_t=s, \pi] \\\\ = \mathbb{E}_\pi [ R_{t+1} + \gamma v_\pi (S_{t+1}) | S_t=s, A_t \sim \pi(S_t)] \\\\ = \sum _a \pi(a|s) \sum _r \sum _{s'} p(r,s'|s,a)(r+\gamma v_\pi(s')) $$

This recursive equation is the Bellman equation for $v_\pi$. It expresses a relatioship between the value of a state and the values of its successor states. It also staes that the value of the start state must equal the discounted value of the expected next state, plus the reward expected along the way.

The same principle is applied for state-action value. $$ q_\pi(s,a) = \mathbb{E} [G_t | S_t=s, A_t=a,\pi] $$

It implies that: $$ q_\pi(s,a) = \mathbb{E}_\pi [ R_{t+1} + \gamma G_{t+1} | S_t=s, A_t=a] \\\\ = \mathbb{E}_\pi [ R_{t+1} + \gamma q_\pi (S_{t+1},A_{t+1}) | S_t=s, A_t=a] \\\\ = \sum _r \sum _{s'} p(r,s'|s,a)(r+\gamma \sum_{a'}\pi(a'|s')q_\pi(s',a')) $$

Then, the relationship between action value and state-action value can be represented:

$$ v_\pi(s) = \sum _a \pi(a|s) q_\pi (s,a) = \mathbb{E} [q_\pi (s,a) | S_t=s, \pi] $$

The value of a state equal to the weighted sum of state-action values.

IF we represent the Bellman Equation in the matrix form, we can see that it is a linear equation form that can be solved directly. $$ v= r+ \gamma P^\pi v \\\\ = (I-\gamma P^ \pi)^{-1}r $$

So, if we know the transition matrix, discount factor and reward vector, we can just compute the value function as the solution of the Bellman's Equation. However, it is burdensome to make it in reality if we care about a hugh system with millions of states.

By using other iterative method such as Dynamic Programming, Monte Carlo Evaluation and Temporal Difference Learning, we can have another method to solve Bellman Equation.

Optimal Value Function

Solving a reinforcement learning task means finding a policy that achieves a lot of rewards over the long run or optimal policy.

The optimal state value $v_*(s)$ is the maximum value function over all policies $$ v_*(s) = \max _ \pi v^\pi(s) $$

The optimal action value function $q_*(s,a)$ is the maximum action-value function over all policies.

$$ q_*(s,a) = \max _\pi q^\pi(s,a) $$

These optimal values give us what the best possible performance that we can get in a problem. MDP can be considered as solved when we can find these values.

Estimating the value of certain policy either using state value of action value is called Policy evaluation or prediction because we are are making a prediction what about happens when we follow certain policy.

Estimating their optimal values is called control because it can be used for policy optimization.

Once we can get $v_*$, it is relatively easy to determine optimal policy. For each state s, there will be one or more actions at which the maximum is obtained in the Bellman optimality equation. Any policy that assign non probability only to these actions is an optimal policy. It is related to one step search because the action appears best after a one-step search is the optimal actions. The beauty of $v_*$ is that if one uses it to evaluate the short-term consequences of actions, then a greedy policy is actually optimal in the long term sense because $v_*$ already takes into iaccount the rewards of all possible future behaviours.

Having the $q_*$ make choosing the optimal actions easier because the agent does not have to do one-step ahead search, for any state s, it can simply find any action that maximizes $q_*(s,a)$. Hence, at the cost of representing a function of state-action pairs, insteads of the state, the optimal action-value function allows optimal action to be selected without having to know everything about possible successor states and their values (environment's dynamic) .

There are equivalences between state and action values: $$ v_\pi(s) = \sum _a \pi(a|s) q_\pi(s,a) $$

$$ v_*(s) = \max _a q_*(s,a) $$

In order to solve the Bellman Optimality Equation, because it is non linear (contain max operator) so we can not use the same matrix solution as for policy evaluation. So we will use iterative methods to deal with that. If we use a model, it is called Dynamic Programming to get into the optimal policy. If it is based on sample, we can use Monte Carlo, Q-learning and SARSA.

Dynamic Programming

By definition, Dynamic Programming (DP) refers to a collection of algorithms that can be used to compute optimal polices given a perfect model of the environment as a MDP.

Policy evaluation (Prediction)

It refers to how to compute the state-value function $v_\pi$ for a policy $\pi$. $$ v_\pi(s) = \mathbb{E} [R_{t+1} + \gamma v_ \pi(S_{t+1}) | s, \pi] $$

The basic idea is to initialize the first guess to zero for all the possible states and then we are going to iterate this equation, we are going to assign values which is subscripted as k to indicate that it is an estimate at that time. The way we assign is based on the expectation of one-step ahead relying on true model and we are going to boostrap at the value of iteration k at the next state, and we will use that to find new values at k+1.

Here, the bootstraping means we use the estimate at time k to estimate the value at time k+1.

$$ v_{k+1} (s) = \mathbb{E}[R_{t+1} + \gamma v_k(S_{t+1}) | s, \pi] $$

Whenever $v_{k+1}(s) = v_k(s)$, then we must found $v_\pi$.

Policy improvement

By doing policy evaluation, we can evaluate the state value under the policy $\pi$. By doing that, we know how good it is to follow the current policy by choosing a particular action. However, we do not know it is better or worst to change to a new policy or not because another policy can have better value function (better policy).

Let's consider the action value: $$ q_\pi(s,a) = \mathbb{E} [R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s, A_t=a] \\\\ = \sum _{s',r} p(s',r|s,a)[r+\gamma v_\pi(s')] $$

By picking the greedy action according to action values, we can examine if it is better to change a policy or not. $$ \pi_{new}(s) = \argmax_{a} q_\pi(s,a) \\\\ = \argmax _a \mathbb{E} [r_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s, A_t=a] $$

The decide of picking a new policy depends on the greedy action value according to policies. In priciple, we can repeat these processes to change the policy to a better one and re-evaluate the value function, perform the policy improvement and so on.

By doing that, we can continually evaluate value function and improve our policy to a better ones until the value of new policy is the same as the old one, then the new policy should be the optimal one.

These processes are called Policy Iteration . Policy Iteration comprises 2 alternative processes which are policy evaluation and policy improvement.

Value Iteration

One draw back of the policy iteration is that its policy evaluation requires multiple sweeps through the state set and we do need to compute the state values of the policy we are considering at that time. Basically, it is the inner loop to calculate the value function of current policy itself and we have the outer loop for the policy imprvoment.

In principle, we can truncate the policy evaluation step into several ways without losing the convergences guarantees of policy iteration. One special case us when policy evaluation is stopped every iteration. This algorithm is called value iteration.

We can turn the Bellman optimality equation into an update. $$ v_{k+1}(s) \leftarrow \max _a \mathbb{E} [R_{t+1} + \gamma v_k(S_{t+1}) | S_t=s, A_t=a] $$

This is equivalent to policy iteration, with k=1 step of policy evaluation between each two policy improvement steps.

Bootstraping in Dynamic Programming

Dynamic programming improves the estimates of the value at a state using the estimate of the value function at next states, it is sometimes called learning a guess from a guess and it is a core idea in Reinforcement Learning (bootstraping).