Baillehache Pascal's personal website

MDP and reinforcement learning

In this article, I continue my exploration of MDP (cf my previous article) and learn how to use reinforcement learning to obtain the optimal policy.

We've seen that if we know an MDP's \(A\),\(S\),\(P\) and \(R\), we can easily calculate its optimal policy. However, in practice we may not have these information. In particular, we may not have \(P\), which defines how we transition from states to states given actions. Instead we may have a model of the environment, i.e. a separate algorithm which somehow calculates the result state given a current state and action. By using this model and reinforcement learning techniques it is also possible to calculate the optimal policy.

The Monte Carlo method (MC) consists of sampling trajectories (sequence of states from an initial one to a final one) and estimating the value of each states by averaging over these trajectories. The sampling step can actually be replaced by a dataset of trajectories, in which case MC does not even need a model of the environment (which is why it is called a "model-free" method). But if we are sampling ourselves, to step through the trajectory we need to choose actions at each step. In MC they are choosen purely at random. As such, MC does not learn a policy (a choice of action for each state), it learns the value of each state.

The core equation of MC is: \(V(s_t)=(1-\alpha)V(s_t)+\alpha g_t\) where \(V(s_t)\) is the value of state \(s\) at step \(t\) in the trajectory, \(\alpha\) is a learning rate (in ]0,1] but usually a very small value), \(g_t=\sum_{i=1}^N\gamma^{i-1}r_{t+i}\) is the sum of discounted rewards, \(N\) is the number of samples in the trajectory, \(\gamma\) is the discounting coefficient (in ]0,1]), \(r_t\) is the reward at step \(t\). For each trajectory, state's value are updated using this equation, usually backward to factorise the calculation of \(g_t\).

Now, if you remember how the optimal policy is calculated in my previous article, you should understand that the equation above solve the problem of not knowing MDP's \(P\) for the value update. Wouldn't it be nice if we could modify the action update the same way? Well indeed we can, and that's really easy: we just reuse the same equation for action value instead: \(Q(s_t,a_t)=(1-\alpha)Q(s_t,a_t)+\alpha g_t\). Then we can estimate the optimal policy with \(\pi(s)=argmax_a\left\lbrace Q(s,a)\right\rbrace\). Nice!

Here, one would think: instead of using random action to sample trajectories, wouldn't it be better to use \(\pi\) (updated after each trajectory)? Unfortunately it doesn't work. As soon as the policy as been updated to something better than the initial arbitrary one, it will never make any more progress because there is no incentive to explore other actions. It is known as the "exploration-exploitation trade-off" problem. If we explore (try random actions) too much we loose time trying actions which have no chance to improve our policy. If we exploit (stick to the current best actions) too much we never take the chance to find a better policy.

A way to solve that problem is to use a \(\epsilon\)-greedy version of the current best policy to sample new trajectories. It means, we use the current best policy (updated after each trajectory) but with a given probability \(\epsilon\) (the exploration coefficient) we choose a random action instead of the current best one. That coefficient allows to control the exploration-exploitation trade-off.

Another approach consists of using a separate policy (defined by the user) to choose action for exploration, and exploit the result to update the estimate of optimal policy. Using the estimate of optimal policy to explore is called an "on-policy" method, while using another policy is called an "off-policy" method. This however necessites to modify a bit the calculation of \(g\) to compensate the fact that the exploration and exploitation policies have different distributions. This is called importance sampling. The equation becomes: \(g_t=\rho\sum_{i=1}^N\gamma^{i-1}r_{t+i}\) where \(\rho=\prod_{i=1}^N\pi(a_{t+i}|s_{t+i})/b(a_{t+i}|s_{t+i})\), \(b(a|s)\) is the probability of choosing action \(a\) when in state \(s\) with the exploration policy \(b\), \(\pi(a|s)\) is the probability of choosing action \(a\) when in state \(s\) with the estimate of optimal policy. But wait, we don't know \(\pi(a|s)\). Yeap, that's a complete mistery to me...

In the equation of sum of rewards we are summing over the whole remainder of the trajectory from a given state. If we sum only on a few steps it becomes the method called Temporal Difference learning (TD). We use then: \(g_t=\sum_{i=1}^D\gamma^{i-1}r_{t+i}\), where \(D\) is the number of steps used. TD(1) would use only the next reward. TD(\(\infty\)) uses the whole remaining trajectory, hence is identical to MC. For proper choices of \(\alpha\) and \(D\), TD actually performs better than MC. Which \(\alpha\) and \(D\) ? Well, as usual no one knows, it's different for each problem and you'll have to find them by yourself!

So, TD is like MC, it's a method to estimate state values. And as for MC, it's straightforward to apply the same idea to action values instead of state values to obtain an estimate of the optimal policy. This is called the SARSA method and its core equation is \(g_t=\sum_{i=1}^D(\gamma^{i-1}r_{t+i})+\gamma^DQ(s_{t+D},a_{t+D})\). \(Q\) values are updated exactly the same way as before: \(Q(s_t,a_t)=(1-\alpha)Q(s_t,a_t)+\alpha g_t\)

In case you're not yet tired of all these variations, here is a final one: Q-Learning. This version has a small difference compare to SARSA, the gain is calculated as follow: \(g_t=\sum_{i=1}^D(\gamma^{i-1}r_{t+i})+\gamma^Dmax_aQ(s_{t+D},a)\). Note the usage of the best action value for the last component, instead of the action value for the actual action in the trajectory. This modification provides a potential improvement over other versions, but it depends on the problem and the value of the parameters you're using (\(\gamma\), \(\epsilon\), \(\alpha\), \(D\)). In the end, it's a matter of experimenting and finding what works best for you.

To end this article, here comes my pseudocode for Q-Learning (limited to \(D=1\), but handling both stochastic and determinic transitions, and optional \(P\) and \(Q\) prior estimates).

Edit on 2025/03/04: cleaning the formatting of functions' arguments for lisibility.

STRUCT Policy
  values: value of each state
  actions: action of each state

FUNCTION EnvironmentSimulation(
    s: integer,
    a: integer):
  sp = the resulting state after applying action 'a' to 'state 's' in the
    environment simulation
  RETURN sp

FUNCTION QLearning(
    S: list of states,
    A: list of actions,
    initialP: estimate of transitions probabilities if available,
    R: transition rewards,
    N: number of sampled trajectories,
    initialPolicy: estimate of the optimal policy if available,
    initalQ: estimate of the action values if available,
    alpha: learning rate,
    gamma: discount factor,
    exploration: exploration coefficient):
  optimalPolicy: Policy, all values and actions initialised to 0 or randomly
  IF initialPolicy IS NOT NULL THEN
    optimalPolicy = initialPolicy
  Q: array of nbState*nbAction*nbState float initialised to 0
  IF initialQ IS NOT NULL THEN
    Q = initialQ
  nbOcc: array of nbState*nbAction*nbState integer initialised to 0
  nbVisit: array of nbState*nbAction*nbState integer initialised to 0
  IF initialP IS NOT NULL THEN
    FOREACH (s, a, sp) IN (S, A, S):
      IF initialP[s, a, sp] > 0 THEN
        nbOcc[s, a, sp] = 100 * initialP[s, a, sp]
        FOREACH spp IN S:
          nbVisit[s, a, spp] = 100
  FOR iTrajectory IN N:
    curState = get random initial state
    WHILE curState is not an end state:
      IF RND() < exploration THEN
        action = get random action
      ELSE:
        action = optimalPolicy.actions[curState]
      toState = EnvironmentSimulation(curState, action)
      nbOcc[curState, action, toState] += 1
      FOREACH s IN S:
        nbVisit[curState, action, s] += 1
      maxValue = GetMaxValue(Q, toState, S, A, nbVisit, nbOccurence)
      Q[curState, action, toState] = alpha * (
        R[curState, action, toState] +
        gamma * maxValue - Q[curState, action, toState])
      UpdatePolicy(optimalPolicy, Q, curState, S, A, nbVisit, nbOccurence)
      curState = toState
  P = array of nbState*nbAction*nbState float initialised to 0
  FOREACH (s, a, sp) IN (S, A, S):
    IF nbVisit[s, a, sp] > 0 THEN
      P[s, a, sp] = nbOccurence[s, a, sp] / nbVisit[s, a, sp]
    ELSE
      P[s, a, sp] = 0
  return optimalPolicy, Q, P

FUNCTION GetMaxValue(
    Q: array of nbState*nbAction*nbState float,
    state: integer,
    S: list of states,
    A: list of actions,
    nbVisit: array of nbState*nbAction*nbState integer,
    nbOccurence: array of nbState*nbAction*nbState integer):
  maxValue = NULL
  FOREACH action in A:
    value = 0
    flag = FALSE
    FOREACH toState in S:
      IF nbVisit[state, a, toState] > 0 THEN
        value += Q[state, a, toState] *
          nbOccurence[state, a, toState] / nbVisit[state, a, toState]
        flag = TRUE
    IF flag IS TRUE
      IF maxValue IS NULL OR maxValue < value THEN
        maxValue = value
  return maxValue

FUNCTION UpdatePolicy(
    policy: Policy,
    Q: array of nbState*nbAction*nbState float,
    state: integer,
    S: list of states,
    A: list of actions,
    nbVisit: array of nbState*nbAction*nbState integer,
    nbOccurence: array of nbState*nbAction*nbState integer):
  maxValue = NULL
  bestAction = NULL
  FOREACH action in A:
    value = 0
    flag = FALSE
    FOREACH toState in S:
      IF nbVisit[state, a, toState] > 0 THEN
        value += Q[state, a, toState] *
          nbOccurence[state, a, toState] / nbVisit[state, a, toState]
        flag = TRUE
    IF flag IS TRUE
      IF maxValue IS NULL THEN
        maxValue = value
        bestAction = action
      ELSE IF maxValue == value THEN
        IF RND() < 0.5 THEN
          maxValue = value
          bestAction = action
      ELSE IF maxValue < value THEN
        maxValue = value
        bestAction = action
  policy.actions[state] = bestAction
  policy.values[state] = maxValues

Implementing the pseudocode above I could reproduce results on the "frozen lake" example of this article, and the "black jack" example in this video.

== frozen lake ==

actions:
→→↓← 
↑X↓X 
↑→↓X 
X→→G

values:
0.590490, 0.656100, 0.729000, 0.629400, 
0.510987, 0.000000, 0.810000, 0.000000, 
0.113865, 0.730225, 0.900000, 0.000000, 
0.000000, 0.856827, 1.000000, 0.000000,

== black jack ==

non usable ace:
HHHHHSSSSS
HSSSSSSSSS
HSSSSSSSSS
HSSSSSSSSS
HSSSSSSSSS
SSSSSSSSSS
HHHHSSSSSS
HHSHSSSSSS
HHHHHSSSSS
HHHHHSSSSS

usable ace:
HHHHHHHHSS
HHHHHHSSSS
HHHHHHSSSS
HHHHHHSSSS
HHHHHHSSSS
HHHHHHSSSS
HHHHHHSSSS
HHHHHHSSSS
HHHHHHHSSS
HHHHHHHSSS

In the next article I move from discrete state space to continuous state space problems.

2025-01-21
in AI/ML, All,
182 views
A comment, question, correction ? A project we could work together on ? Email me!
Learn more about me in my profile.