Preliminaries

A policy defines the agent's behaviours
- Deterministic policy A = $\pi(S)$
- Stochastic policy: $\pi(A|S) = p(A|S)$
The actual value function is the expected return $$ v_\pi(s) = \mathbb{E} [G_t | S_t = s, \pi] = \mathbb{E} [R_{t+1} + \gamma R_{t+2 + ...} | S_t = s, \pi] \\\\ =\mathbb{E} [R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s, A_t ~= \pi(s)] $$ where $\gamma$ is a discount factor
Optimal value is the highest possible value for any policy

$$ v_*(s) = \max_{a} \mathbb{E} [R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t=a] $$

The true action value for action a is the expected reward $$ q(a) = \mathbb{E}\left\{R_t | A_t=a \right\} $$

A simple estimate of the action value is the average of the sampled rewards: $$ Q_t(a)=\frac{\sum_{n}^{}R_n\mathbb{I}(A_n=a)}{\sum_{n}^{}\mathbb{I}(A_n=a)} $$

where $R_n$ is the reward at time n.

The update of the action values at tume step n+1 $$ Q_{n+1} = Q_n + \frac{1}{n} (R_n - Q_n) $$

where $\alpha=\frac{1}{n}$ is a step size.

The optimal value is $$ v_* = \max_{a \in A} q(a) $$
Regret is the opportunity loss for one step

$$ v_* - q(A_t) $$

Action Regret $\Delta_a$ for a fiven action is the difference between optimal value and true value of a $$ \Delta_a = v_* -q(a) $$

The trade off between exploration and exploitation will be done by minimizing the total regret

$$ L_t = \sum_{i=1}^t (v_* - q(a_i)) \\\\ = \sum _{a \in \mathcal{A}} N_t(a)(v_* - q(a_i)) = \sum _{a \in \mathcal{A}} N_t(a) \Delta _a $$

Categorizing Agents
- Value Based (Value Function)
- Policy Based (Policy)
- Actor Critic (Policy and Value Function)
Prediction and Control
- Prediction: evaluate the future for a given policy
- Control: optimize the future (find the best policy) $$ \pi_*(s) = \argmax_{\pi} v_\pi (s) $$

Introduction

The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the action taken rather than instructs by giving correct actions. It creates the need for active exploration, for an explicit search of good behaviour.

Evaluative feedback indicates how good the action taken was, but not whether it was the best of worst action possible. On the other hand, instructive feedback indicates the correct action to take which is the basis of supervised learning.

The well known trade-off between exploitation and exploration is essentially the compromise between maximizing performance (exploitation) and increasing the knowledge (exploration). It is the typical problem in online decision making because we are actively collecting our information to make the best overall decisions.

Multi-Armed Bandit

Consider the following learning problem: we are amongs the choice of k different options, after each choice, we receiver a numerical reward which are sampled from a stationary probability distribution that depends on what we selected. The objective is to maximize the expected reward $\sum _i R_i$ over some time period. This is the original form of the k-armed bandit problem.

We do the maximization objective by learning a policy: a distribution on $\mathcal{A}$

When we look into the total regrets, we see the differences between the optimal value and the actual action value which are accumulated as time evolves. The objective is to minimise that regrets because the faster it grows, the worst it is.

In principle, the regret can grow unbounded, so the interesting part is to study how fast it grows. For example, greedy policy has linear regret as it grows in the number of step we have taken.

$\epsilon$-greedy algorithm

To explore the new information, the most basic strategy is to apply the common solution $\epsilon$-greedy where a randomness is added to the policy for picking a random action uniformly across all of the actions on each time step with a certain probability. It is considered as the most common exploration strategy in current Reinforcement Learning. By doing this, we can avoid the sticking at the sub optimal action forever as the greedy policy does and it continues to explore forever.

$$ \pi_t(a)=\left\{\begin{matrix} (1-\varepsilon ) + \frac{\varepsilon }{\mathcal{A}} & \text{if $Q_t(a)=\max_{b}Q_t(b)$} \\ \frac{\varepsilon }{\mathcal{A}}& \text{otherwise} \\ \end{matrix}\right. $$

Typically, the performance of $\epsilon$-greedy is better than greedy because we do learn event and we select the optimal action with probability 1-$\epsilon$ which is not guaranteed in greedy case. However, the regret still grows linearly so the question is whether ncessary to explore something that are already bad?

Lower Bound

A decade ago, there was an investigation about the exploration problem to see what is the best possible thing we could hope to get. It is related to the similarities between the optimal action and all the others actions (similar distribution but different means).

Lai and Robbins Theorem: Asymtotic total regret is at least logarithmic in number of steps $$ \lim_{t \to \infty} L_t \geq \log t \sum ... $$

It shows out that the total expected regrets that we will incur for any algorithm will grow on the logarithm of time. So, it means that the regrets grows unbounded as expected.

However, the grow speed of $\epsilon$-greedy is much slower than the greedy.

In the optimism in the face of uncertainty, practically, if we are more uncertain about the value of an action, maybe it is more important to explore that action but at first we need to be more greedy to try the action with the highest mean.

Upper Confidence Bound

Exploration is needed because there is always uncertainty about the accuracy of the action-value estimates. So it will be better to select amongs the non-greedy actions according to their potential for actually being optimal , taking into account how close their estimates are maximal and the uncertainties in those estimates.

One effective method to do so is upper confidence bound $U_t(a)$ for each action value $q(a)$. We"ll do such as the true value with a high probability is smaller than the current estimate plus the bonus $q(a)\leq Q_t(a)+U_t(a)$ .

It signifies that although we do not knwo the true value is but we are pretty certain that it is less than something. We will then design an algorithm that just greedily selects among the actions whose estimates are added a bonus. Normally the bonus will be high if the uncertainties are high.

$$ A_t = \argmax_{a} [Q_t(a) + c\sqrt{\frac{ln t}{N_t(a)}}] $$

Where $N_t(a)$ represents the number of times action $a$ has been selected prior to time t. Then, the uncertainty about the true value will typically decrease as the square root of the number of times you selected an action.

Recall that we want to minimize $\sum_a N_t(a)\Delta_a$, so if the gap $\Delta_a$ is big, we want $N_t(a)$ to be small and vice versa. As $N_t(a)$ grows over the time, so we hope to often select the actions with low gap than the actions with high gap.

Hoeffding's Inequality: Let $X_1, .., X_n$ be i.i.d random variables in [0,1], and let $\overline{X_t}$ be the mean. Then $$ p(\mathbb{E}[X] \geq \overline{X_n}+u) \leq \exp{(-2nu^2)} $$

It means that if we selected a number of points and averaged these, it is an added bonus that we can bound the probability that it is still an underestimate. The bigger u us, the smaller this probability will be.

Now, let's say that we want to pick certain probability $p$ that the true value exceeds the upper confidence bound.

$$ \exp{(-2N_t(a)U_t(a)^2)} = p $$

Then,

$$ U_t(a) = \sqrt{(\frac{-\log p}{2N_t(a)})} = \sqrt{(\frac{\log t}{2N_t(a)})} $$

Then, it leads to our UCB algorithm:

$$ a_t = \argmax_{a \in \mathcal{A}} Q_t(a) + c\sqrt{\frac{\log t}{N_t(a)})} $$

Intuitively speaking, if we consider an action that we have not selected in a long long while, it means that the bonus keep growing ultil it is higher than all of the other estimate plus bonuses for all the other actions. At that point, we will select it.

Bayesian Bandits

Besides value-based algorithm, we can come up with model-based approach which predict the reward from reward model for each action. However, we are based on the model, so we can model the distribution of rewards as well and it gives us Bayesian bandit.

Bayesian bandits model parameterized distributions over rewards, $p(q(a) | \theta_t)$ which is a Bayesian probability. This is interpreted as our belief that $q(a)=x$ $ \forall x \in \mathcal{R}$

We are going to update this using probability of the reward under the model. $$ p_t(\theta |a) \propto p(R_t | \theta,a)p_{t-1}(\theta|a) $$

$\theta$ represents the parameters of our parametric model. Then, it allows us to inject rich prior knowledge $p_0(\theta|a)$, it means that we could update the parameters each time we see our reward from taken action.

Here, we are not trying to match the distribution of the reward, we are trying to learn a belief distribtion over where the true value is. Afterwards, we are going to use these distributions to do the exploration.

Bayesian Bandits with Upper Confidence Bounds

Let's imagine that we have multiple distributions for different actions. With UCB, we pick the bound which basically defines our confidence interval. Now, we could look at the distribution and define our confidence interval as might be the standand deviation of our distribution.

Probability matching selects action a according to probability that a is the optimal action. $$ \pi_t(a) = p(q(a)=\max_{b}q(b) | H_{t-1}) $$

It means that we are defining a policy and making this policy equal to the probability that the true value is the maxmimum true value under the history. So we can calculate the probability of an action is optimal one under the distribution (belief distribution) that we updated so far.

Probability matching is optimistic in the face of uncertainty because if we have a large uncertainty about where our action will be, then the probability of it being optimal also goes up.

So actions with larger probabilities are either high-valued actions or we have a lot of uncertainty about them.

However, it can be difficult to compute $\pi(a)$ analytically from posterior.

Thompson Sampling

Thompson Sampling is an algorithm to solve bandits and it is a Baysian approach and related to probability matching. The idea is to keep track of the posterior distributions so we are going to update these via Bayes's rule, and then, we sample each from belief distributions an actual action value. ($Q_t(a) \sim p_t(q(a)) $)

We have the belief distribution at time step t about where we believe the mean value for that action to be, then we are going to sample the distribution which gives us an action value and we keep doing that for each of these actions.

Then, we are going to pick the greedy action according to the sample action values. $$ A_t = \argmax_{a\in \mathcal{A}} Q_t(a) $$

It turns out that Thompson Sampling will select actions according to the same probability as probability matching would do.

$$ \pi_t(a) = \mathbb{E}[I(Q_t(a)=\max_{b}Q_t(b))] = p(q(a)=\max_{b}q(b))) $$

The event of action a is picked means that action a had the highest sampled action value, but the expectation of this instance is equal to the probability of that happening. So it turns our that if we first sample the action values and then we pick greedily, this is the same as trying to compute these whole probabilities and selecting our action according to that.

So in some sense, TS is simply a technique that allow us to go from Bayesian distributions to a policy. Interestingly, TS achieves Lai and Robbins lower bounds on regret, there for it is optimal in the same way UCB is and it has logarithmic regrets.