Markov decision process and optimal policy

This article is a memo to myself about Markov decision process (MDP in short).

MDP are defined by 2 sets and 2 functions on these sets:

If there is only one state accessible from another through a given action, the MDP is said to be deterministic, else it said to be stochastic.

A super simplistic example: two states (\(s_0\), \(s_1\)) and one action (\(a_0\)). The process starts at state \(s_0\), then transitioned to state \(s_1\) with 1/3 probability receiving 5 points, or transitioned to state \(s_0\) with 2/3 probability receiving 3 points. The process ends on state \(s_1\). This MDP can be summarised as in the table below.

to \ from\(s_0\)\(s_1\)
\(s_0\)\(a_0\):(P:2/3, R:3.0)\(a_0\):(P:0, R:0.0)
\(s_1\)\(a_0\):(P:1/3, R:5.0)\(a_0\):(P:1, R:0.0)

A point of interest about MDP is the expected sum of reward: \(ESR=\sum_{n=0}^{\inf}R(s(n),a(n),s'(n))\), where \(n\) is the step in the sequence of states through transitions. It is how much total reward we can expect to obtain if we transition from state to state randomly according to the probabilities \(P\) (from one initial state to an end state, potentially forever). On this simple example, we can calculate it explicitly, \(ESR=\sum_{n=1}^{\inf}\frac{2^n}{3^{n-1}}(5+3n)=11\). Measuring the average total reward over 100,000 runs on this MDP confirms the explicit calculation: \(ESR=11.01692\).

Lets make this example a little more interesting by adding another action \(a_1\). Transitioning from \(s_0\) through \(a_1\) leads to \(s_1\) with reward 5, but now transitioning from \(s_0\) through \(a_0\) leads to \(s_1\) with a reward of -10. It's like a gambling game: should we continue playing with hope of gaining more but risking to loose some, or should we stop.

to \ from\(s_0\)\(s_1\)
\(s_0\)\(a_0\):(P:2/3, R:3.0)
\(a_1\):(P:0, R:0.0)
\(a_0\):(P:0, R:0.0)
\(a_1\):(P:0, R:0.0)
\(s_1\)\(a_0\):(P:1/3, R:-10.0)
\(a_1\):(P:1, R:5.0)
\(a_0\):(P:1, R:0.0)
\(a_1\):(P:1, R:0.0)

MDP are memory-less, thus they can't encode a process where the player chooses his/her action based on how much points have been received so far. The player must choose once and for all which action is taken for each state. This choice is called a policy and noted \(\pi\). \(\pi(s)=a\) means, when in state \(s\) the action \(a\) is choosen under the policy \(\pi\). Note that we could resolve that memory-less problem by creating one state for every possible sum of reward. There would be an infinite number of states, which isn't forbidden by the theory, but impossible in practice.

The example is still simple enough that we can work it out by hand. First \(\pi(s_1)\) doesn't matter because \(\forall (a,s)R(s_1,a,s)=0\) and \(\forall (a,s\ne s_1)P(s_1,a,s)=0\). So we only need to consider \(\pi(s_0)\), for which we have two choices. If \(\pi(s_0)=a_1\) then each run is simply transitionning from \(s_0\) to \(s_1\), receiving 5 points and game over. The ESR is then equal to 5.0. If \(\pi(s_0)=a_0\), we can calculate the same way as in the previous example: \(ESR=\sum_{n=1}^{\inf}\frac{2^n}{3^{n-1}}(-10+3n)=-4\).

So, if our objective is to maximise the ESR, \(\pi(s_0)=a_1\) is clearly the optimal policy. Now, we are interested in finding those optimal policies for any MDP, and it happens there are efficient ways to do it. MDPs may have some limitation, but within the limit of the processes they can represent, they rocks!

We first need to introduce two new concepts: the discounted reward, and the state value. The discounted reward isn't strictly necessary but it allows to handle the infinite loop in the ESR formula. The discount factor \(\lambda\in]0,1]\) represents the weight decrease of farther states compare to nearer state in the sequence of states from a given one. It is used to calculate the expected sum of discounted reward: \(ESDR=\sum_{n=0}^{\inf}\lambda^nR(s(n),a(n),s'(n))\). In practice one can then stop the summation loop when \(\lambda^n<\epsilon\). The larger \(\lambda\) the more short-term rewards matter above long-term ones. The state value \(V(s)\) is simply equal to the ESDR given \(s\) as the initial state. For a given policy it can be expressed recursively as follow: \(V(s)=\sum_{s'}P(s,\pi(s),s')(R(s,\pi(s),s')+\lambda V(s'))\).

The algorithm to find the optimal policy consists then of two simple states:

  1. update \(V(s)\) for all \(s\)
  2. update \(\pi(s)=argmax_a\left\lbrace\sum_{s'}P(s,a,s')(R(s,a,s')+\lambda V(s'))\right\rbrace\) for all \(s\)

and repeat until convergence (which the theory guarantees to happen as long as there is no inaccessible state in the MDP). You don't even need to worry about initial values for \(V\) and \(\pi\), any random one will do. Super easy !

The pseudocode goes as follow:

Edit on 2024/12/10: correction of the policy update loop and convergence condition

Variations on this algorithm exist, where the update is made only on values or actions, or with different ordering on the update. I'll let the reader refer to the Wikipedia page.

Applying this algorithm to our example, the update of the policy goes as follow (\(\lambda=0.9\)):

As expected, the returned optimal policy is \(\pi(s_0)=a_1\). If modifying the MDP such as \(P(s_0,a_0,s_0)=6/7\) and \(P(s_0,a_0,s_1)=1/7\), then for \(\pi(s_0)=a_0\), \(ESR=\sum_{n=1}^{\inf}\frac{6^n}{7^{n-1}}(-10+3n)=8\) and the optimal policy converges as follow:

As expected, the returned optimal policy becomes \(\pi(s_0)=a_0\).

Now, scale up your MDP to hundreds or thousands of states/actions and you get a nice tool to solve a large variety of real world problems. Well in theory at least, cause as usual in practice that's not going to be that easy, but that the subject of plenty of super interesting other stuff which I hope to come back to very soon...

Edit on 2024/12/14: implementing the pseudocode above I could succesfully reproduce the results on the "Gambler problem" introduced in this video.

2024-12-03
in AI/ML, All,
36 views
Copyright 2021-2024 Baillehache Pascal