Baillehache Pascal's personal website

Markov decision process and optimal policy

This article is a memo to myself about Markov decision process (MDP in short).

MDP are defined by 2 sets and 2 functions on these sets:

\(S\): the set of possible states
\(A\): the set of actions (an action is a transition between two states)
\(P(s,a,s')\): the probability of transition from state \(s\) to state \(s'\) through action \(a\)
\(R(s,a,s')\): the value of transition from state \(s\) to state \(s'\) through action \(a\)

If there is only one state accessible from another through a given action, the MDP is said to be deterministic, else it is said to be stochastic.

A super simplistic example: two states (\(s_0\), \(s_1\)) and one action (\(a_0\)). The process starts at state \(s_0\), then transitioned to state \(s_1\) with 1/3 probability receiving 5 points, or transitioned to state \(s_0\) with 2/3 probability receiving 3 points. The process ends on state \(s_1\). This MDP can be summarised as in the table below.

to \ from	\(s_0\)	\(s_1\)
\(s_0\)	\(a_0\):(P:2/3, R:3.0)	\(a_0\):(P:0, R:0.0)
\(s_1\)	\(a_0\):(P:1/3, R:5.0)	\(a_0\):(P:1, R:0.0)

A point of interest about MDP is the expected sum of reward: \(ESR=\sum_{n=0}^{\inf}R(s(n),a(n),s'(n))\), where \(n\) is the step in the sequence of states through transitions. It is how much total reward we can expect to obtain if we transition from state to state randomly according to the probabilities \(P\) (from one initial state to an end state, potentially forever). On this simple example, we can calculate it explicitly, \(ESR=\sum_{n=1}^{\inf}\frac{2^n}{3^{n-1}}(5+3n)=11\). Measuring the average total reward over 100,000 runs on this MDP confirms the explicit calculation: \(ESR=11.01692\).

Lets make this example a little more interesting by adding another action \(a_1\). Transitioning from \(s_0\) through \(a_1\) leads to \(s_1\) with reward 5, but now transitioning from \(s_0\) through \(a_0\) leads to \(s_1\) with a reward of -10. It's like a gambling game: should we continue playing with hope of gaining more but risking to loose some, or should we stop.

to \ from	\(s_0\)	\(s_1\)
\(s_0\)	\(a_0\):(P:2/3, R:3.0) \(a_1\):(P:0, R:0.0)	\(a_0\):(P:0, R:0.0) \(a_1\):(P:0, R:0.0)
\(s_1\)	\(a_0\):(P:1/3, R:-10.0) \(a_1\):(P:1, R:5.0)	\(a_0\):(P:1, R:0.0) \(a_1\):(P:1, R:0.0)

MDP are memory-less, thus they can't encode a process where the player chooses his/her action based on how much points have been received so far. The player must choose once and for all which action is taken for each state. This choice is called a policy and noted \(\pi\). \(\pi(s)=a\) means, when in state \(s\) the action \(a\) is choosen under the policy \(\pi\). Note that we could resolve that memory-less problem by creating one state for every possible sum of reward. There would be an infinite number of states, which isn't forbidden by the theory, but impossible to implement (in a future article we'll see there is actually a way to deal with infinite number of states).

The example is still simple enough that we can work it out by hand. First \(\pi(s_1)\) doesn't matter because \(\forall (a,s)R(s_1,a,s)=0\) and \(\forall (a,s\ne s_1)P(s_1,a,s)=0\). So we only need to consider \(\pi(s_0)\), for which we have two choices. If \(\pi(s_0)=a_1\) then each run is simply transitionning from \(s_0\) to \(s_1\), receiving 5 points and game over. The ESR is then equal to 5.0. If \(\pi(s_0)=a_0\), we can calculate the same way as in the previous example: \(ESR=\sum_{n=1}^{\inf}\frac{2^n}{3^{n-1}}(-10+3n)=-4\).

So, if our objective is to maximise the ESR, \(\pi(s_0)=a_1\) is clearly the optimal policy. Now, we are interested in finding those optimal policies for any MDP, and it happens there are efficient ways to do it. MDPs may have some limitation, but within the limit of the processes they can represent, they rock!

We first need to introduce two new concepts: the discounted reward, and the state value. The discounted reward isn't strictly necessary but it allows to handle the infinite loop in the ESR formula. The discount factor \(\lambda\in]0,1]\) represents the weight decrease of farther states compare to nearer state in the sequence of states from a given one. It is used to calculate the expected sum of discounted reward: \(ESDR=\sum_{n=0}^{\inf}\lambda^nR(s(n),a(n),s'(n))\). In practice one can then stop the summation loop when \(\lambda^n<\epsilon\). The larger \(\lambda\) the more short-term rewards matter above long-term ones. The state value \(V(s)\) is simply equal to the ESDR given \(s\) as the initial state. For a given policy it can be expressed recursively as follow: \(V(s)=\sum_{s'}P(s,\pi(s),s')(R(s,\pi(s),s')+\lambda V(s'))\).

The algorithm to find the optimal policy consists then of two simple steps:

update \(V(s)\) for all \(s\)
update \(\pi(s)=argmax_a\left\lbrace\sum_{s'}P(s,a,s')(R(s,a,s')+\lambda V(s'))\right\rbrace\) for all \(s\)

and repeat until convergence (which the theory guarantees to happen as long as there is no inaccessible state in the MDP). You don't even need to worry about initial values for \(V\) and \(\pi\), any random one will do. Super easy !

The pseudocode goes as follow:

Edit on 2024/12/10: correction of the policy update loop and convergence condition

Edit on 2025/01/21: correction of a typo in variable name of the policy update

STRUCT Policy
  values: value of each state
  actions: action of each state

FUNCTION GetOptimalPolicy(S, A, P, R, lambda):
  optimalPolicy: Policy, all values and actions initialised to 0 or randomly
  bufferPolicy: Policy, all values and actions initialised to 0 or randomly
  deltaValue: real, initiliased to 2*epsilon
  flagPolicy: boolean, initialised to TRUE
  WHILE deltaValue > epsilon OR flagPolicy == TRUE:
    deltaValue = 0
    flagPolicy = FALSE
    FOREACH s IN S:
      bufferPolicy.values[s] = 0
      a = optimalPolicy.actions[s]
      FOREACH sp IN S:
        bufferPolicy.values[s] +=
          P(s,a,sp) * (R(s,a,sp) + lambda * optimalPolicy.values[sp])
      deltaValue = MAX(
        deltaValue, ABS(bufferPolicy.values[s] - optimalPolicy.values[s]))
    FOREACH s IN S:
      ap = NULL
      vp = 0
      FOREACH a IN A:
        v = 0
        FOREACH sp IN S:
          v += P(s,a,sp) * (R(s,a,sp) + lambda * bufferPolicy.values[sp]
        IF ap IS NULL or vp < v:
          vp = v
          ap = a
      IF optimalPolicy.actions[s] != ap THEN
        flagPolicy = TRUE
      END
      bufferPolicy.actions[s] = ap
    FOREACH s IN S:
      optimalPolicy.values[s] = bufferPolicy.values[s]
      optimalPolicy.actions[s] = bufferPolicy.actions[s]
  RETURN optimalPolicy

Variations on this algorithm exist, where the update is made only on values or actions, or with different ordering on the update. I'll let the reader refer to the Wikipedia page.

Applying this algorithm to our example, the update of the policy goes as follow (\(\lambda=0.9\)):

(Pi(s0), Pi(s1)) (V(s0), V(s1))
(0, 0) (0.000000, 0.000000)
(1, 0) (-1.333333, 0.000000)
(1, 0) (5.000000, 0.000000)

As expected, the returned optimal policy is \(\pi(s_0)=a_1\). If modifying the MDP such as \(P(s_0,a_0,s_0)=6/7\) and \(P(s_0,a_0,s_1)=1/7\), then for \(\pi(s_0)=a_0\), \(ESR=\sum_{n=1}^{\inf}\frac{6^n}{7^{n-1}}(-10+3n)=8\) and the optimal policy converges as follow:

(Pi(s0), Pi(s1)) (V(s0), V(s1))
(0, 0) (0.000000, 0.000000)
(1, 0) (1.142857, 0.000000)
(0, 0) (5.000000, 0.000000)

As expected, the returned optimal policy becomes \(\pi(s_0)=a_0\).

Now, scale up your MDP to hundreds or thousands of states/actions and you get a nice tool to solve a large variety of real world problems. Well in theory at least, cause as usual in practice that's not going to be that easy, but that the subject of plenty of super interesting other stuff which I hope to come back to very soon...

Edit on 2024/12/14: implementing the pseudocode above I could succesfully reproduce the results on the "Gambler problem" introduced in this video.

In the next article I continue exploring about MDP, in particular how reinforcement learning can be used to obtain the optimal policy.

2024-12-03
in AI/ML, All,
179 views
A comment, question, correction ? A project we could work together on ? Email me!
Learn more about me in my profile.