The Reinforce and PPO algorithm

This article is the last one in a series of three about MDP and reinforcement learning. In the previous articles, I've studied how to calculate the optimal policy for a known Markov Decision Process, and how to use Monte Carlo method to achieve the same goal when the MPD transitions probability are not known. This article is about upgrading from the discrete state space of previous articles to a continuous one.

Lets consider a new example: the mountain car problem. States in this problem encode the position and speed of the car. Actions are: push in one or the other direction, or do nothing. The transitions follows the equations describing the physic of that problem's world (it's deterministic). Then we have everything we need to describe the problem as a MDP, and we would like to use previously studied methods to get the optimal policy, i.e. which action to apply in a given state to bring the car to the top of the mountain.

The action space is still discrete (there are only 3 actions), but the state space is now continuous (position and speed are real values), hence there are an infinity of states. In previous methods we were memorizing values, actions and other intermediate variables into arrays, one value per state or combination of state/action. With a continuous state space we would have an infinite size array, so we need another way to encode these values: approximation functions. They are functions which can map the states to values and probabilities given a finite number of parameters. These parameters are the variables we are going to calculate: we've brought back the problem to a finite space, hurrah!

What kind of approximation function to use really depends on your problem, but artificial neural networks are quite popular these days (in case you haven't noticed) and I'll stick to them in what follows. Just keep in mind that it doesn't have to be an ANN, (almost) any approximator with a finite number of parameters will do (as long it can approximate sufficiently well the approximated functions of course).

So, we're back to finite space and we're happy, but that's still not enough to apply the algorithms described in previous articles. Previously we were updating the values directly. How to translate this into parameters update? If you're familiar with ANN training you surely already know the answer: define a loss function and apply gradient descent (more accurately, stochastic gradient descent as we are using Monte Carlo). Different choices for the loss function give a family of algorithms (the policy gradient methods), which all share the same core idea as algorithms studied so far: record a trajectory, calculate the sum of reward along the trajectory to evaluate the states, calculate a gradient from the states evaluation, update the values and policy according to the gradient.

The first invented policy gradient method is the Reinforce algorithm. The loss function used in that case for the action probability approximator is \(L(\theta)=log(\pi_{\theta}(s,a))A\) where \(\theta\) are the appoximator parameters, \(\pi_{\theta}\) is the policy given parameters \(\theta\), \(s\) is the state, \(a\) is the action, and \(A\) is the 'advantage' (the current estimated value of the state minus the sum of discounted reward for that state in the trajectory).

A modern variation of Reinforce is the Proximal Policy Optimisation algorithm. The loss function becomes: \(L(\theta)=min(r(\theta)A,clip(r(\theta),1-\epsilon,1+\epsilon)A)\) where \(r(\theta)=\frac{\pi_{\theta}(s,a)}{\pi_{\theta^{old}}(s,a)}\), \(\pi_{\theta^{old}}\) is the policy before the update step, and \(\epsilon\) is a coefficient usually set to \(0.2\). Compared to Reinforce, PPO is a more 'careful' version, avoiding to be too optimist during the update step (\(\epsilon\) controls how much optimism is allowed). It may looks like it would slow down the training process, but it actually does the opposite by avoiding to move too much the parameters into a seemingly good but actually wrong direction.

The loss function for the state value approximator is the usual squared difference between target value and current value: \(L(\theta)=(V_{\theta}-G)^2\), where \(V_{\theta}\) is the estimated state value, and \(G\) is the sum of discounted reward for the state in the trajectory.

For gradient descent, one can experiment with various optimisers. To keep the pseudocode below short and limited to the essential it uses standard gradient descent. However I've actually used Adam for action probability update (decay rate \([0.9, 0.999]\)) and standard for state value update in the following results. Also, if using large neural networks, backpropagation would speed up the training. Check also my previous articles about optimisers and backpropagation.

The pseudocode then becomes (giving both Reinforce and PPO altogether as the only difference is the calculation of the loss function):

Implementing the pseudocode above I was able to reproduce the results on the mountain car example as seen in this video, and also the cartpole problem.

For the mountain car example, I've used equations given by Barto and Sutton in "Reinforcement learning, an introduction", and an approximator made of a discretisation of the state space into 20 intervals (just to show it works with other approximator than ANN).

For the cartpole example, I've used equations given in "Cart-pole system: Equations of motion" solved using Runge-Kutta (RK4) with a time step equal to \(0.01\). The approximator was an ANN with two layers, 8 nodes per layer, ReLU activation, and softmax on the output. I've used 3 actions instead of 2, the extra one being 'don't push'.

On both examples, the learn rate for action probability and state value was \(0.001\), the discount was \(0.999\), all approximators' parameters were initialised randomly in \([-0.5,0.5]\). I could succesfully get the optimal policy with both Reinforce and PPO in all cases, and PPO was indeed faster to converge in term of number of training episodes. The videos below are for (mountain car, PPO) and (cartpole, Reinforce).

And this ends my study of reinforcement learning techniques, for now at least because there is of course a huge amount of variations on the algorithms I've studied here and so many other ones to learn. If you want to learn more, "Reinforcement learning, an introduction" by Barto and Sutton really seems to be a must. I hope I will have one day the occasion to go through the whole paper edition. So many books, so little time!

2025-03-04
in AI/ML, All,
19 views
Copyright 2021-2025 Baillehache Pascal