Deep Reinforcement Learning
Fifth in a series on understanding Reinforcement Learning.
In some senses, the policy, value function , model can be viewed as being a function and we want to learn these functions from experiences.
When we use neural network to represent these functions, then it is often called deep reinforcement learning.
So far, we have talked about the tabular case of using look up tabke for every state s (v(s)) or every state-action pair s,a (q(s,a)). However, the problem will be rising if there are too many states and/or actions which can not fit into memory. Besides, it is too slow to learn the value of each state individually. Additionally, individual environment states are often not fully observable.
The solution for all of these problems is to estimate value function with function approximation. $$ v_w(s) \approx v _\pi(s) \quad \quad \quad \text{or $v_*(s)$} \\ q_w(s,q) \approx q _\pi(s,a) \quad \quad \quad \text{or $q_*(s,a)$} $$
Now, the value function will be parameterized with parameter vector $w$. Then, we will update parameter $w$ using Monte Carlo or TD learning algorithm. With the hope that we are able to select our functions class correctly, we will able to generalize to unseen state.
Then, we will rely on agent state instead and we are going to use agent state update function which is a parametric function of its inputs (previous agent states, action and observation).
Function classes
We have several classes of function approximation:
- Tabular: is a table with an entry for each MDP state
- State aggregation: we have entries that merge the values of a couple of states (grouping state spaces into some small set).
- Linear function approximation: We consider a fixed agent state update function and fixed feature map on top of that. The linear function here is because of value functions do not have any parameters. Instead, we are going to learn the parameters of the value functions which uses thoese features as inputs $ v_w(s) = w^T x(s)$. Then, state aggregation and tabular case are special cases of linear FA.
- Differentiable function approximation: oue value function will be a differential function of parameter $w$ which could be non linear.
In principle, any function approximator can be used, but RL has several special properties:
- Experience is not i.i.d which means that successive time steps are correlated.
- Agent's policy affects the data it receives which means that we can actively sample data in a way that is useful for learning our function.
- Regression targets can be non-stationary because of changing policies as we go but it might change the target and the data and because of bootstraping.