MDP and reinforcement learning

In this article, I continue my exploration of MDP (cf my previous article) and learn how to use reinforcement learning to obtain the optimal policy.

We've seen that if we know an MDP's \(A\),\(S\),\(P\) and \(R\), we can easily calculate its optimal policy. However, in practice we may not have these information. In particular, we may not have \(P\), which defines how we transition from states to states given actions. Instead we may have a model of the environment, i.e. a separate algorithm which somehow calculates the result state given a current state and action. By using this model and reinforcement learning techniques it is also possible to calculate the optimal policy.

The Monte Carlo method (MC) consists of sampling trajectories (sequence of states from an initial one to a final one) and estimating the value of each states by averaging over these trajectories. The sampling step can actually be replaced by a dataset of trajectories, in which case MC does not even need a model of the environment (which is why it is called a "model-free" method). But if we are sampling ourselves, to step through the trajectory we need to choose actions at each step. In MC they are choosen purely at random. As such, MC does not learn a policy (a choice of action for each state), it learns the value of each state.

The core equation of MC is: \(V(s_t)=(1-\alpha)V(s_t)+\alpha g_t\) where \(V(s_t)\) is the value of state \(s\) at step \(t\) in the trajectory, \(\alpha\) is a learning rate (in ]0,1] but usually a very small value), \(g_t=\sum_{i=1}^N\gamma^{i-1}r_{t+i}\) is the sum of discounted rewards, \(N\) is the number of samples in the trajectory, \(\gamma\) is the discounting coefficient (in ]0,1]), \(r_t\) is the reward at step \(t\). For each trajectory, state's value are updated using this equation, usually backward to factorise the calculation of \(g_t\).

Now, if you remember how the optimal policy is calculated in my previous article, you should understand that the equation above solve the problem of not knowing MDP's \(P\) for the value update. Wouldn't it be nice if we could modify the action update the same way? Well indeed we can, and that's really easy: we just reuse the same equation for action value instead: \(Q(s_t,a_t)=(1-\alpha)Q(s_t,a_t)+\alpha g_t\). Then we can estimate the optimal policy with \(\pi(s)=argmax_a\left\lbrace Q(s,a)\right\rbrace\). Nice!

Here, one would think: instead of using random action to sample trajectories, wouldn't it be better to use \(\pi\) (updated after each trajectory)? Unfortunately it doesn't work. As soon as the policy as been updated to something better than the initial arbitrary one, it will never make any more progress because there is no incentive to explore other actions. It is known as the "exploration-exploitation trade-off" problem. If we explore (try random actions) too much we loose time trying actions which have no chance to improve our policy. If we exploit (stick to the current best actions) too much we never take the chance to find a better policy.

A way to solve that problem is to use a \(\epsilon\)-greedy version of the current best policy to sample new trajectories. It means, we use the current best policy (updated after each trajectory) but with a given probability \(\epsilon\) (the exploration coefficient) we choose a random action instead of the current best one. That coefficient allows to control the exploration-exploitation trade-off.

Another approach consists of using a separate policy (defined by the user) to choose action for exploration, and exploit the result to update the estimate of optimal policy. Using the estimate of optimal policy to explore is called an "on-policy" method, while using another policy is called an "off-policy" method. This however necessites to modify a bit the calculation of \(g\) to compensate the fact that the exploration and exploitation policies have different distributions. This is called importance sampling. The equation becomes: \(g_t=\rho\sum_{i=1}^N\gamma^{i-1}r_{t+i}\) where \(\rho=\prod_{i=1}^N\pi(a_{t+i}|s_{t+i})/b(a_{t+i}|s_{t+i})\), \(b(a|s)\) is the probability of choosing action \(a\) when in state \(s\) with the exploration policy \(b\), \(\pi(a|s)\) is the probability of choosing action \(a\) when in state \(s\) with the estimate of optimal policy. But wait, we don't know \(\pi(a|s)\). Yeap, that's a complete mistery to me...

In the equation of sum of rewards we are summing over the whole remainder of the trajectory from a given state. If we sum only on a few steps it becomes the method called Temporal Difference learning (TD). We use then: \(g_t=\sum_{i=1}^D\gamma^{i-1}r_{t+i}\), where \(D\) is the number of steps used. TD(1) would use only the next reward. TD(\(\infty\)) uses the whole remaining trajectory, hence is identical to MC. For proper choices of \(\alpha\) and \(D\), TD actually performs better than MC. Which \(\alpha\) and \(D\) ? Well, as usual no one knows, it's different for each problem and you'll have to find them by yourself!

So, TD is like MC, it's a method to estimate state values. And as for MC, it's straightforward to apply the same idea to action values instead of state values to obtain an estimate of the optimal policy. This is called the SARSA method and its core equation is \(g_t=\sum_{i=1}^D(\gamma^{i-1}r_{t+i})+\gamma^DQ(s_{t+D},a_{t+D})\). \(Q\) values are updated exactly the same way as before: \(Q(s_t,a_t)=(1-\alpha)Q(s_t,a_t)+\alpha g_t\)

In case you're not yet tired of all these variations, here is a final one: Q-Learning. This version has a small difference compare to SARSA, the gain is calculated as follow: \(g_t=\sum_{i=1}^D(\gamma^{i-1}r_{t+i})+\gamma^Dmax_aQ(s_{t+D},a)\). Note the usage of the best action value for the last component, instead of the action value for the actual action in the trajectory. This modification provides a potential improvement over other versions, but it depends on the problem and the value of the parameters you're using (\(\gamma\), \(\epsilon\), \(\alpha\), \(D\)). In the end, it's a matter of experimenting and finding what works best for you.

To end this article, here comes my pseudocode for Q-Learning (limited to \(D=1\), but handling both stochastic and determinic transitions, and optional \(P\) and \(Q\) prior estimates).

Implementing the pseudocode above I could reproduce results on the "frozen lake" example of this article, and the "black jack" example in video.

2025-01-21
in AI/ML, All,
58 views
Copyright 2021-2025 Baillehache Pascal