Baillehache Pascal's personal website

The Reinforce and PPO algorithm

This article is the last one in a series of three about MDP and reinforcement learning. In the previous articles, I've studied how to calculate the optimal policy for a known Markov Decision Process, and how to use Monte Carlo method to achieve the same goal when the MPD transitions probability are not known. This article is about upgrading from the discrete state space of previous articles to a continuous one.

Lets consider a new example: the mountain car problem. States in this problem encode the position and speed of the car. Actions are: push in one or the other direction, or do nothing. The transitions follows the equations describing the physic of that problem's world (it's deterministic). Then we have everything we need to describe the problem as a MDP, and we would like to use previously studied methods to get the optimal policy, i.e. which action to apply in a given state to bring the car to the top of the mountain.

The action space is still discrete (there are only 3 actions), but the state space is now continuous (position and speed are real values), hence there are an infinity of states. In previous methods we were memorizing values, actions and other intermediate variables into arrays, one value per state or combination of state/action. With a continuous state space we would have an infinite size array, so we need another way to encode these values: approximation functions. They are functions which can map the states to values and probabilities given a finite number of parameters. These parameters are the variables we are going to calculate: we've brought back the problem to a finite space, hurrah!

What kind of approximation function to use really depends on your problem, but artificial neural networks are quite popular these days (in case you haven't noticed) and I'll stick to them in what follows. Just keep in mind that it doesn't have to be an ANN, (almost) any approximator with a finite number of parameters will do (as long it can approximate sufficiently well the approximated functions of course).

So, we're back to finite space and we're happy, but that's still not enough to apply the algorithms described in previous articles. Previously we were updating the values directly. How to translate this into parameters update? If you're familiar with ANN training you surely already know the answer: define a loss function and apply gradient descent (more accurately, stochastic gradient descent as we are using Monte Carlo). Different choices for the loss function give a family of algorithms (the policy gradient methods), which all share the same core idea as algorithms studied so far: record a trajectory, calculate the sum of reward along the trajectory to evaluate the states, calculate a gradient from the states evaluation, update the values and policy according to the gradient.

The first invented policy gradient method is the Reinforce algorithm. The loss function used in that case for the action probability approximator is \(L(\theta)=log(\pi_{\theta}(s,a))A\) where \(\theta\) are the appoximator parameters, \(\pi_{\theta}\) is the policy given parameters \(\theta\), \(s\) is the state, \(a\) is the action, and \(A\) is the 'advantage' (the current estimated value of the state minus the sum of discounted reward for that state in the trajectory).

A modern variation of Reinforce is the Proximal Policy Optimisation algorithm. The loss function becomes: \(L(\theta)=min(r(\theta)A,clip(r(\theta),1-\epsilon,1+\epsilon)A)\) where \(r(\theta)=\frac{\pi_{\theta}(s,a)}{\pi_{\theta^{old}}(s,a)}\), \(\pi_{\theta^{old}}\) is the policy before the update step, and \(\epsilon\) is a coefficient usually set to \(0.2\). Compared to Reinforce, PPO is a more 'careful' version, avoiding to be too optimist during the update step (\(\epsilon\) controls how much optimism is allowed). It may looks like it would slow down the training process, but it actually does the opposite by avoiding to move too much the parameters into a seemingly good but actually wrong direction.

The loss function for the state value approximator is the usual squared difference between target value and current value: \(L(\theta)=(V_{\theta}-G)^2\), where \(V_{\theta}\) is the estimated state value, and \(G\) is the sum of discounted reward for the state in the trajectory.

For gradient descent, one can experiment with various optimisers. To keep the pseudocode below short and limited to the essential it uses standard gradient descent. However I've actually used Adam for action probability update (decay rate \([0.9, 0.999]\)) and standard for state value update in the following results. Also, if using large neural networks, backpropagation would speed up the training. Check also my previous articles about optimisers and backpropagation.

The pseudocode then becomes (giving both Reinforce and PPO altogether as the only difference is the calculation of the loss function):

nbFeature: integer, number of features describing a state
nbAction: integer, number of possible actions
epsilon: float, small value used for calculation of derivative, ex. 1e-9

STRUCT Approximator:
  nbParameter: integer
  parameters: array of 'nbParameter' float, parameters of the approximator
  ...: whatever other properties needed by the approximation function,
       for example the architecture of an artificial neural network

FUNCTION ApproxProb(
    a: Approximator,
    s: array of 'nbFeature' float):
  probs: array of 'nbAction' float
  probs = result of evaluation of the approximator function 'a', it is the
          estimated probabilities for each action for the input state 's'
          and parameters 'a.parameters'. For example if the approximator
          is an artificial neural network, its weights and biases would be
          set to 'a.parameters', its input values would be 's', and the
          result of inference would give the estimated actions probability.
  return probs

FUNCTION ApproxValue(
    a: Approximator,
    s: array of 'nbFeature' float):
  value: float
  value = result of evaluation of the approximator function 'a', it is the
          estimated value for the input state 's' and parameters
          'a.parameters'. For example if the approximator is an artificial
          neural network, its weights and biases would be set to
          'a.parameters', its input values would be 's', and the result of
          inference would give the estimated state value.
  return value

STRUCT Transition:
  step: integer
  fromState: array of 'nbFeature' float
  action: integer
  reward: float
  toState: array of 'nbFeature' float

FUNCTION EnvironmentSimulation(
    s: array of 'nbFeature' float,
    a: integer):
  transition: Transition
  transition.fromState = s
  transition.action = a
  transition.toState = the resulting state after applying action 'a' to
                       'state 's' in the environment simulation
  transition.reward = the reward when reaching 'transition.toState' after
                      applying action 'a' to state 's'
  RETURN transition

FUNCTION UpdateValue(
    evaluation: Approximator,
    transition: Transition,
    targetValue: float,
    learnRate: float):
  grads: array of 'evaluation.nbParameters' float
  FOR iParam IN 0..(evaluation.nbParameter-1):
    oldVal = evaluation.paramaters[iParam]
    evaluation.paramaters[iParam] += epsilon
    v = ApproxValue(evaluation, transition.fromState)
    grads[iParam] = (v - targetValue)^2
    evaluation.paramaters[iParam] -= 2.0 * epsilon
    v = ApproxValue(evaluation, transition.fromState)
    grads[iParam] -= (v - targetValue)^2 
    grads[iParam] /= 2.0 * epsilon
    evaluation.paramaters[iParam] = oldVal
  FOR iParam IN 0..(evaluation.nbParameter-1):
    evaluation.paramaters[iParam] -= learnRate * grads[iParam]

FUNCTION UpdateProb(
    policy: Approximator,
    oldPolicy: Approximator,
    transition: Transition,
    advantage: float,
    learnRate: float,
    thresholdPPO: float):
  grads: array of 'evaluation.nbParameters' float
  FOR iParam IN 0..(evaluation.nbParameter-1):
    oldVal = policy.paramaters[iParam]
    IF using Reinforce method:
      policy.paramaters[iParam] += epsilon
      probs = ApproxProb(policy, transition.fromState)
      grads[iParam] = log(probs[transition.action]) * advantage
      policy.paramaters[iParam] -= 2.0 * epsilon
      probs = ApproxProb(policy, transition.fromState)
      grads[iParam] -= log(probs[transition.action]) * advantage
      grads[iParam] /= 2.0 * epsilon
    IF using the Proximal Policy Optimisation method:
      oldProbs = ApproxProb(oldPolicy, transition.fromState)
      policy.paramaters[iParam] += epsilon
      probs = ApproxProb(policy, transition.fromState)
      r = probs[transition.action] / oldProbs[transition.action]
      rClip = MIN(MAX(r, 1.0 - thresholdPPO), 1.0 + thresholdPPO)
      grads[iParam] = MIN(r * advantage, rClip * advantage)
      policy.paramaters[iParam] -= 2.0 * epsilon
      probs = ApproxProb(policy, transition.fromState)
      r = probs[transition.action] / oldProbs[transition.action]
      rClip = MIN(MAX(r, 1.0 - thresholdPPO), 1.0 + thresholdPPO)
      grads[iParam] -= MIN(r * advantage, rClip * advantage)
      grads[iParam] /= 2.0 * epsilon
    policy.paramaters[iParam] = oldVal
  FOR iParam IN 0..(evaluation.nbParameter-1):
    policy.paramaters[iParam] -= learnRate * grads[iParam]

FUNCTION PolicyGradient(
    nbEpisode: integer,
    policy: Approximator (approximator function for the actions probability),
    evaluation: Approximator (approximator function for the states value),
    alphaValue: float (learning rate for states value),
    alphaProb: float (learning rate for actions probability,
    discount: float (discount facotr for sum of reward),
    thresholdPPO: float (epsilon for PPO optimism)):
  FOR iEpisode IN 0..(nbEpisode-1):
    oldPolicy = copy of policy
    fromState = get initial random state
    trajectory = empty array of Transitions
    step = 0
    WHILE fromState is not an end state:
      actionProb = ApproxProb(policy, fromState)
      action = choose an action according to probabilities 'actionProb'
      transition = EnvironmentSimulation(curState, action)
      transition.step = step
      append transition to 'trajectory'
      fromState = toState
      step += 1
    sumReward = 0.0
    FOR EACH transition IN trajectory from end to start:
      sumReward = transition.reward + discount * sumReward
      value = ApproxValue(evaluation, transition.fromstate)
      advantage = value - sumReward
      UpdateValue(evaluation, transition, sumReward, alphaValue)
      lr = alphaProb * discount^(transition.step)
      UpdateProb(policy, oldPolicy, transition, advantage, lr, thresholdPPO)

Implementing the pseudocode above I was able to reproduce the results on the mountain car example as seen in this video, and also the cartpole problem.

For the mountain car example, I've used equations given by Barto and Sutton in "Reinforcement learning, an introduction", and an approximator made of a discretisation of the state space into 20 intervals (just to show it works with other approximator than ANN).

For the cartpole example, I've used equations given in "Cart-pole system: Equations of motion" solved using Runge-Kutta (RK4) with a time step equal to \(0.01\). The approximator was an ANN with two layers, 8 nodes per layer, ReLU activation, and softmax on the output. I've used 3 actions instead of 2, the extra one being 'don't push'.

On both examples, the learn rate for action probability and state value was \(0.001\), the discount was \(0.999\), all approximators' parameters were initialised randomly in \([-0.5,0.5]\). I could succesfully get the optimal policy with both Reinforce and PPO in all cases, and PPO was indeed faster to converge in term of number of training episodes. The videos below are for (mountain car, PPO) and (cartpole, Reinforce).

And this ends my study of reinforcement learning techniques, for now at least because there is of course a huge amount of variations on the algorithms I've studied here and so many other ones to learn. If you want to learn more, "Reinforcement learning, an introduction" by Barto and Sutton really seems to be a must. I hope I will have one day the occasion to go through the whole paper edition. So many books, so little time!

2025-03-04
in AI/ML, All,
98 views
A comment, question, correction ? A project we could work together on ? Email me!
Learn more about me in my profile.