Baillehache Pascal's personal website

MiniFrame, a framework for solving combinatorial optimization problems

Around 10 years ago I used to participate to programming contests on the CodinGame platform. These were contests where the goal was to code a bot playing some kind of game against other participants, in other words solving problems related to combinatorial optimization in real time. I had develop a framework based on the Minimax algorithm which I reused on each contest and allowed me to reach a high ranking. Revisiting this framework and making it publicly available was on my todo list since then. I finally took time to do it and will introduce it in this article.

Combinatorial optimization

In combinatorial optimization problems we are looking for a subset of elements, inside a discrete and finite set, which maximise an objective function. This can be interpreted as, choosing one element among all possible, then another one, and so on according to some constraints until an end condition, and finally evaluating the result subset according to the objective function.

The problem is then, how to choose an element at each step to obtain the subset with maximum value. If the problem is simple, the solution can be obtained by exploring all possible combinations and choosing the best one (brute force method). In real life problems, it is however impractical most of the time, as the number of combination is too large and exceeds the computation time/memory constraints.

For example, in the context of a two players game, the subset is made of the moves constituting one game instance, choosen among all the possible sequences of move allowed by the game rules. The objective function is the win of the game for a given player. The rules are generally well defined, but the opponent moves are not known in advance, and even when the winning condition is well defined, the evaluation of a position during the game can be difficult.

As another example, in the context of resource attribution, the problem becomes the selection of pairs resource/user, choosen among all possible pairs. The objective function could then be the maximisation of user satisfaction. Here, the set of possible pairs is known with certainty from the beginning, but the definition of the objective function (what satisfies the user) can be difficult.

Greedy algorithm

The simplest algorithm for this kind of problems is the Greedy algorithm. At each step, the best choice is determined with a heuristic based only on local information available at that step. This method is extremely easy to implement and fast to execute, but rarely produces a good solution when applied to real life complex problems.

MiniMax

The MiniMax algorithm (MM) was designed to solve two players sequential games. The tree of all possible sequences of move is progressively constructed in a breadth first order, and each time a node is computed, it is evaluated with the objective function and its score is backpropagated toward the root of the tree. The backpropagation stops when, for a given node, there exists a sibling having a more advantageous value from the point of view of the active player at that node's parent node (see the Wikipedia page for more details). Solving the game using MM consists then of choosing at each node the next move with best value according to the current player.

The name MiniMax comes from the fact that if the value of a node represents the advantage of a position for player one, then the optimal path in the tree is found by choosing the next node maximising the value when it's player one's turn, and the next node minimising the value when it's player two's turn. I always find that way of thinking confusing, and rather than maximising/minimising relative to one player I personally prefer to see it as maximising from the point of view of the current player. Both ways of thinking are strictly equivalent, but when adapting Minimax to games with more than two players, or even one player, I found the "always maximising" view more natural.

Monte Carlo Tree Search

While Minimax explores the tree in a breadth first manner, another approach is to explore it in a depth first one. This is known as the Monte Carlo Tree Search (MCTS) algorithm. In this algorithm, instead of exploring all subnodes of a node, we choose only one at random. The exploration then goes from the root of the tree down to an end node along one single random path, and starts again from the root along a different random path. The evaluation of a node is then equal to the average of the evaluation of its currently explored subtree's end nodes.

Compare to MM, it is advantageous when there is no well defined way to evaluate incomplete subsets (mid-game positions in the context of a game) and only complete subsets (end-game positions) have a well defined evaluation (win/loss for a game). In the latter case, MM can't evaluate a node value until it has reached end nodes, which generally doesn't happen until nearing the end of the game due to the breadth first exploration and limited time/memory.

In the other hand, as MCTS explores only a random subset of all possible paths, its evaluation of a node is only an approximation of its real value, and by choosing randomly MCTS takes the risk of missing an optimal path. Completely randomly choosing the path to explore then appears suboptimal, and the UCB1 formula (\(max\left(\frac{w_i}{n_i}+c\sqrt{\frac{ln(N_i)}{n_i}}\right)\)) was developped to improve on that point by recommending which next node to explore (MCTS+UCB1 is referred to as UCT). Unfortunately, it's only a recommendation and still doesn't guarantee it won't miss the optimal path.

Also, UCB1 can only be used for evaluation functions with a binary output (win/loss), making UCT less general than MCTS or MM. MCTS doesn't suffer from that limitation, but may not lead to the optimal answer. Even if it has explored the whole tree, due to the averaging of leaves value the optimal solution may be hidden within a subtree full of low value leaves. In that case MCTS would prefer in the early nodes a branch leading to a subtree with lower maximal value but higher average one.

The two problems above make me think that, except if there is no evalutation function on intermediate nodes and so no choice but to avoid MM, MCTS or UCT are much less appealing than MM.

Alpha-Beta Pruning

The above methods all give instructions on how to explore the tree and how to choose a branch from a node to maximise the outcome. However they are all eventually equivalent to a brute force method: given enough resources they will explore the whole tree. UCT tries to improve on that point by spending more time on subtrees that look the more promising, but it would be better if it was possible to completely discard subtrees which are certain to contain only suboptimal solutions. This is the idea behind the Alpha-Beta pruning algorithm (ABP).

In the MiniMax algorithm, one can observe that, in the context of a multi-players sequential game, as each player chooses the branch leading to the best node from its point of view, exploration of branches to other nodes can be discarded as soon as we know they won't lead to a better solution. To ensure the value of a node, MM is modified from breadth first to depth first, and the pruning of not-yet computed branches from \(N_i\) occurs as soon as \(V^*_{P_{i-1}}(N_{i-1})>V^*_{P_{i-1}}(N_i)\) where \(N_{i-1}\) is the parent node of \(N_i\), \(P_i\) is the active player at node \(N_i\), and \(V^*_P(N)\) is the best branch value at node \(N\) from the point of view of player \(P\).

In theory, ABP indeed prunes a large part of the tree, ensuring to return the same optimal path as MM in a much faster time. However, the depth first exploration is generally incompatible with resource constraints: only a part of the tree will ever be explored. In the worst case only one branch of nodes near the root of the tree are explored as the exploration ends deep in their subtree before coming back to them. In the context of a game that would be equivalent of always considering only one single next move. Surely not the right strategy to win. For that reason it is suggested to shuffle the branches duing exploration.

So in practice we must limit the exploration to a carefully choosen depth ensuring it completes within the resources limit. This requires to be able to evaluate all nodes (not only end ones), and then comes the risk of pruning accidentally optimal branches due to imperfect mid-game evaluation.

Choosing a maximum depth also has a big disadvantage: some part of the tree may have higher branching factor than others. Without a maximum depth, MM always goes as deep as it can within the resources limit. Setting a static maximum depth would cause early stop of the exploration even if it could have explored further.

Here again, I personally find difficult to believe that ABP could outperform MM, except in situations where resources constraint is not a problem.

MiniMax siblings pruning

Of course, pruning to improve MM is extremely appealing. Unsatisfied by ABP I've decided to make my own, which I'll call 'siblings' pruning. When a node is visited during exploration its childs are added to the list of nodes to explore only under the condition their value is within a threshold from the value of the best one among them. The threshold is controlled by the user.

When a node is first visited, the pruning occurs based on local information. As the exploration goes deeper into the tree and the values backpropagate, when a node is visited again the pruning occurs based more and more on deep, reliable, values. For a very high threshold value, it behaves as standard MM. For a very low threshold value, it explores only the paths which are the best according to the information available so far. In between, the lower the threshold value the more pruning, hence the deeper the exploration can go in the same amount of time, but the higher the chance to ignore an optimal path hidden behind a seemingly "bad" position in middle game. The higher the threshold value the less pruning, hence less deep exploration in the same amount of time, but lower chance of missing an optimal path. Automatic search for the optimal threshold value can easily be implemented.

Results below will show it is indeed an efficient way of pruning.

MiniFrame

All the methods introduced above can be used to solve combinatorial optimisation problems as long as they can be represented as trees. More precisely, in term of implementation, they can be used if the user can provide:

a definition of data representing one node in the tree
a definition of data representing one transition from one node to one of its childs
a function creating the root node
a function computing the list of possible transitions from a node
a function computing the child node resulting from a transition from its parent node
a function checking if a node is a leaf
a function evaluating a node

It is then possible to create a reusable framework implementing the algorithms introduced above, ready to use by a user who only need to implement the two data structures and five functions above. MiniFrame is a framework written in C doing just that. In addition, it also provides tools useful to implement the user functions, debug them and search for the best parameters. You can download it here (see at the end of the article for old versions).

I describe below how to use it and how it performs on concrete examples.

Tic-tac-toe

Tic-tac-toe is the most simple example possible, so let's start with that. I guess everyone knows this game so I won't explain the rules here (if you need a refresher the Wikipedia article is here).

Some vocabulary: in MiniFrame a node is called a "world", a transition between two nodes is called an "action", and entities at the origin of the actions (in a game, the players) are called the "actors". We first need to tell to MiniFrame how many actors there are, and how many actions there can be from any given world.

#define MF_NB_MAX_POSSIBLE_ACTION 9
#define MF_NB_ACTOR 2

For performance optimisation, MiniFrame uses its own memory management functions and a memory pool to avoid dynamic allocation during execution. Hence the necessity to indicate beforehand the maximum number of actions per world.

The data structures are super simple: a grid of 9 by 9 cells. The cell value encodes an empty cell (-1), a cross (0) or a circle (1). An action is represented with the row and column, and a value (same as cell).

struct MiniFrameAction {
  MiniFrameActionFields;
  int8_t row;
  int8_t col;
  int8_t val;
};

struct MiniFrameWorld {
  MiniFrameWorldFields;
  int8_t vals[9];
};

MiniFrameActionFields and MiniFrameWorldFields are macros to automatically add internal fields used by MiniFrame.

The five functions (plus one to commonalise code) implementing the problem to solve (here, the game) are as follow.

void MiniFrameWorldSetToInitialWorld(
  MiniFrameWorld* const world,
            void* const userData) {
  if(userData) loop(i, 9) world->vals[i] = ((int8_t*)userData)[i];
  else loop(i, 9) world->vals[i] = -1;
  loop(i, 2) world->aliveActors[i] = true;
}

void MiniFrameWorldCreateActions(MiniFrameWorld* const world) {
  int nbActions = 0;
  loop(row, 3) loop(col, 3) {
    if(world->vals[row * 3 + col] == -1) {
      world->actions[nbActions].row = row;
      world->actions[nbActions].col = col;
      world->actions[nbActions].val = world->iSente;
      ++nbActions;
    }
  }
  world->nbActions = nbActions;
}

void MiniFrameWorldApplyAction(
         MiniFrameWorld* const world,
  MiniFrameAction const* const action) {
  world->vals[action->row * 3 + action->col] = action->val;
}

bool IsWinner(
  MiniFrameWorld const* const world,
                    int const iActor) {
  if(
    world->vals[0] == iActor &&
    world->vals[0] == world->vals[1] &&
    world->vals[0] == world->vals[2]
  ) return true;
  if(
    world->vals[3] == iActor &&
    world->vals[3] == world->vals[4] &&
    world->vals[3] == world->vals[5]
  ) return true;
  if(
    world->vals[6] == iActor &&
    world->vals[6] == world->vals[7] &&
    world->vals[6] == world->vals[8]
  ) return true;
  if(
    world->vals[0] == iActor &&
    world->vals[0] == world->vals[3] &&
    world->vals[0] == world->vals[6]
  ) return true;
  if(
    world->vals[1] == iActor &&
    world->vals[1] == world->vals[4] &&
    world->vals[1] == world->vals[7]
  ) return true;
  if(
    world->vals[2] == iActor &&
    world->vals[2] == world->vals[5] &&
    world->vals[2] == world->vals[8]
  ) return true;
  if(
    world->vals[0] == iActor &&
    world->vals[0] == world->vals[4] &&
    world->vals[0] == world->vals[8]
  ) return true;
  if(
    world->vals[2] == iActor &&
    world->vals[2] == world->vals[4] &&
    world->vals[2] == world->vals[6]
  ) return true;
  return false;
}

bool MiniFrameWorldIsEnd(MiniFrameWorld* const world) {
  if(IsWinner(world, 0) || IsWinner(world, 1)) return true;
  bool flagEmptyCell = false;
  loop(i, 9) if(world->vals[i] == -1) flagEmptyCell = true;
  if(flagEmptyCell == false) return true;
  return false;
}

void MiniFrameWorldEvaluateScores(MiniFrameWorld* const world) {
  loop(i, 2) world->scores[i] = 0.0;
  loop(iActor, 2) if(IsWinner(world, iActor)) world->scores[iActor] = 1.0;
}

In MiniFrameWorldSetToInitialWorld, the second argument userData is used to pass optional data for initialisation (cf MiniFrameWorldCreateInitData below). If null, default values are used. The aliveActors array indicates which actors are 'alive'. Here both players are playing until the end of the game, but on 3+ players games, one player may be eliminated in mid-game and only the remaining players keep playing. That array allows to handle such cases. Also, expert readers will surely see many things that could be optimised. The point here is just to give a simple example.

In addition to these structures and functions, MiniFrame actually needs five more functions:

a function comparing if two worlds are to be considered the same (used when looking an instance of world into the tree)
a function copying the user defined properties of a world into another
a function printing a world on a stream in a human readable format
a function printing an action on a stream in a human readable format
a function to create initialisation data used by MiniFrameWorldSetToInitialWorld. In this simple example, it's always an empty grid, but this function becomes useful in more complex games/problems to automatically create random intialisation state used by the other tools (cf below).

bool MiniFrameWorldIsSame(
  MiniFrameWorld const* const worldA,
  MiniFrameWorld const* const worldB) {
  loop(i, 9) if(worldA->vals[i] != worldB->vals[i]) return false;
  return true;
}

void MiniFrameWorldCopy(
        MiniFrameWorld* const worldA,
  MiniFrameWorld const* const worldB) {
  loop(i, 9) worldA->vals[i] = worldB->vals[i];
}

void MiniFrameWorldPrint(
  MiniFrameWorld const* const world,
  FILE* const stream) {
  loop(row, 3) {
    loop(col, 3) {
      if(world->vals[row * 3 + col] == -1) fprintf(stream, ".");
      if(world->vals[row * 3 + col] == 0) fprintf(stream, "X");
      if(world->vals[row * 3 + col] == 1) fprintf(stream, "O");
    }
    if(row < 2) fprintf(stream, "\n");
  }
}

void MiniFrameActionPrint(
  MiniFrameAction const* const action,
                   FILE* const stream) {
  fprintf(stream, "%d,%d %c",
    action->col, action->row, action->val == 0 ? 'X' : 'O');
}

void MiniFrameWorldCreateInitData(
   void** const initData,
  size_t* const sizeData,
    void* const userData) {
  (void)userData;
  *sizeData = sizeof(int8_t) * 9;
  *initData = malloc(*sizeData);
  assert(*initData != NULL);
  loop(i, 9) ((int8_t*)(*initData))[i] = -1;
}

MiniFrame usage

Then, one can use MiniFrame in several different ways. They almost all have in common the creation of a MiniFrame instance, setting its parameter, using it, and finally freeing it.

int main(){
  MiniFrame* mf = MiniFrameAlloc(MiniFrameGetNbWorldForMemSize(1024 * 1024));
  MiniFrameSetInitialWorld(mf, NULL);
  MiniFrameSetExplorationType(mf, MFExplore_minimax);
  MiniFrameSetExplorationDepth(mf, 10);

  // ... cf code in sections below ...

  MiniFrameFree(&mf);
  return EXIT_SUCCESS;
}

As explained above, MiniFrame uses a memory pool for better performance, so one must provide the number of world instances MiniFrame will have to work with. Of course the more the better, it just depends on the resource constraints of the environment MiniFrame is running in. The helper function MiniFrameGetNbWorldForMemSize calculates the number of world instances fitting in a given amount of memory, such as the user can specifiy either the available number of worlds either the available amount of memory in bytes.

Next, the initial world and the algorithm used (MM, MCTS, ABP) are set. Here as we used MM the depth of exploration is also set. For this simple example, we know there can't be more than 9 moves, so setting the max depth to 10 ensures MiniFrame will explore down to the end (in the limit of available time and memory). Note that the maximum exploration depth is relative to the current world of course (at current step 0 it would explore up to step 9, at current step 1 it would explore up to step 10, and so on...).

An executable binary can be compiled as follow. The source file of MiniFrame uses a macro (MINIFRAME_WORLD_IMPL) defined in the compilation options, which points to the source file implementing the game/problem. That source file is included in MiniFrame and the whole code is compiled as one unit. A Makefile rule would look as follow.

tictactoe: /path/to/tictactoe.c
  gcc -o tictactoe /path/to/miniframe.c \
  -DMINIFRAME_WORLD_IMPL=/relative/path/to/tictactoe.c/from/miniframe.c

It's also possible to have several problems solved using MiniFrame at once in the same project. Instead of compiling an executable, compile an object file (-c titactoe.o instead of -o tictactoe) and define in tictactoe.c some exposed functions instead of the main() for the project to interact with the tic-tac-toe solver. Then use the exposed functions and link the object file in the project as needed.

One game simulation

Coming back to the part left undefined in the main(), let see how to run one game simulation using MiniFrame. The framework has two modes of exploration: in the main process, or in background. MiniFrame is indeed multithreaded. The main thread interacts with the user, another thread performs the exploration, and another thread manages the memory. The goal is of course to improve performance in term of execution speed.

If you're wondering why not more threads, well I'd like to, but if you're used to work with multithreading you surely know how much of a headache it quickly becomes. So, moving slowly but surely, I improved from my old CodinGame framework's single thread to just 3 for now. Another motivation (or de-motivation ?) is that using more threads doesn't help on the CodinGame platform.

Back to the code, one can run one game simulation as follow (in background):

  MiniFrameWorld actualWorld = MiniFrameGetCurrentWorld(mf);
  MiniFrameRunBackground(mf);
  while(MiniFrameWorldIsEnd(&actualWorld) == false) {
    MiniFrameSleepMs(10);
    MiniFrameAction action = {0};
    MiniFrameGetBestAction(mf, &action);
    printf("Selected move:\n");
    MiniFrameActionPrint(&action, stdout);
    printf("\n");
    MiniFrameWorldApplyAction(&actualWorld, &action);
    printf("Current world:\n");
    MiniFrameWorldPrint(&actualWorld, stdout);
    printf("\n");
    actualWorld.step += 1;
    actualWorld.iSente = MiniFrameWorldGetNextSente(&actualWorld);
    MiniFrameSetActualWorld(mf, &actualWorld);
  }

One local copy of the current world is kept in memory. On more complex example it would be updated by the main process with external input (like the opponent actual moves). MiniFrameRunBackground(mf) starts the exploration thread in background. No need to worry about the memory management thread, MiniFrame does it for you. Then we loop until the end of the simulated game, wait some time for the exploration thread to do its job, get the computed best action, apply to the local copy of the world and display it, step the local world and inform MiniFrame of the new current actual world. An example output looks like this:

Selected move:
0,0 X
Current world:
X..
...
...
Selected move:
1,1 O
Current world:
X..
.O.
...
Selected move:
0,1 X
Current world:
X..
XO.
...
Selected move:
0,2 O
Current world:
X..
XO.
O..
Selected move:
2,0 X
Current world:
X.X
XO.
O..
Selected move:
1,0 O
Current world:
XOX
XO.
O..
Selected move:
1,2 X
Current world:
XOX
XO.
OX.
Selected move:
2,1 O
Current world:
XOX
XOO
OX.
Selected move:
2,2 X
Current world:
XOX
XOO
OXX

As one would expect, playing perfecty against itself leads to a tie !

Running the same simulation but in the main thread would look like this:

  MiniFrameWorld actualWorld = MiniFrameGetCurrentWorld(mf);
  while(MiniFrameWorldIsEnd(&actualWorld) == false) {
    MiniFrameRunOneStep(mf);
    MiniFrameAction action = {0};
    MiniFrameGetBestAction(mf, &action);
    printf("Selected move:\n");
    MiniFrameActionPrint(&action, stdout);
    printf("\n");
    MiniFrameWorldApplyAction(&actualWorld, &action);
    printf("Current world:\n");
    MiniFrameWorldPrint(&actualWorld, stdout);
    printf("\n");
    actualWorld.step += 1;
    world.iSente = MiniFrameWorldGetNextSente(&world);
    MiniFrameSetActualWorld(mf, &actualWorld);
  }

MiniFrameRunOneStep(mf) is blocking and computes as many worlds as it can in the limit of the available memory. A time limit can also be set with MiniFrameSetMaxExploreTimeMs(mf, timeMs). Note that if no time limit is set MiniFrameRunOneStep(mf) performs only one pass on the tree. That's probably not what you want, in particular when using MCTS.

Debug mode

Even on an example as simple as tic-tac-toe I've been able to mess up the world implementation and not get it right from the first compilation, genius! Once the world implementation is burried into the MiniFrame framework, debugging what's wrong is really challenging. A debugging tool is a must, and of course I provide one. It allows the user to manually run the exploration process and check the exploration tree.

The undefined portion of the main file becomes in that case just one line:

  MiniFrameConsole(mf);

This starts an interactive console which can be used as follow. Available commands are:

'm' or 'mem': print informations related to memory
'w' or 'world': print informations related to the current world
'r' or 'run': execute one step of exploration. Display the average time used to explore one world.
's' or 'step n': step the current world through its n-th action, or the best action if n=='b'
'b' or 'back': step back the current world to its parent (if any)
'a' or 'actualise': set the current world as the actual current world
You can chain commands like 'r s 0 w'

An example of use on tic-tac-toe would look like this.

-- MiniFrame console --
'h' or 'help' to list commands
'q' or 'quit' to terminate the console
> mem
1048400 bytes, 5242 worlds, 200 bytes per world
5241 free worlds, 1 used worlds, 0 explored worlds, +00 explored steps
> world
step: 0, sente: 0, nb visits: 0
scores: [+000.000, +000.000]
...
...
...
> run mem
Explored 5241 worlds in 1ms
Average time per world: 0.000191ms
1048400 bytes, 5242 worlds, 200 bytes per world
0 free worlds, 5242 used worlds, 5241 explored worlds, +05 explored steps
> world
step: 0, sente: 0, nb visits: 0
scores: [-000.000, -000.000]
...
...
...
-- action #0 (0 visits) best! :
0,0 X
scores: [-000.000, -000.000]
-- action #1 (0 visits) :
1,0 X
scores: [-000.000, -000.000]
-- action #2 (0 visits) :
2,0 X
scores: [-000.000, -000.000]
-- action #3 (0 visits) :
0,1 X
scores: [-000.000, -000.000]
-- action #4 (0 visits) :
1,1 X
scores: [-000.000, -000.000]
-- action #5 (0 visits) :
2,1 X
scores: [-000.000, -000.000]
-- action #6 (0 visits) :
0,2 X
scores: [-000.000, -000.000]
-- action #7 (0 visits) :
1,2 X
scores: [-000.000, -000.000]
-- action #8 (0 visits) :
2,2 X
scores: [-000.000, -000.000]
> step b world
step through:
0,0 X
step: 1, sente: 1, nb visits: 0
scores: [-000.000, -000.000]
X..
...
...
-- action #0 (0 visits) :
1,0 O
scores: [-000.000, -000.000]
-- action #1 (0 visits) best! :
2,0 O
scores: [-000.000, -000.000]
-- action #2 (0 visits) :
0,1 O
scores: [-000.000, -000.000]
-- action #3 (0 visits) :
1,1 O
scores: [-000.000, -000.000]
-- action #4 (0 visits) :
2,1 O
scores: [-000.000, -000.000]
-- action #5 (0 visits) :
0,2 O
scores: [-000.000, -000.000]
-- action #6 (0 visits) :
1,2 O
scores: [-000.000, -000.000]
-- action #7 (0 visits) :
2,2 O
scores: [-000.000, -000.000]
> back world
step: 0, sente: 0, nb visits: 0
scores: [-000.000, -000.000]
...
...
...
-- action #0 (0 visits) best! :
0,0 X
scores: [-000.000, -000.000]
-- action #1 (0 visits) :
1,0 X
scores: [-000.000, -000.000]
-- action #2 (0 visits) :
2,0 X
scores: [-000.000, -000.000]
-- action #3 (0 visits) :
0,1 X
scores: [-000.000, -000.000]
-- action #4 (0 visits) :
1,1 X
scores: [-000.000, -000.000]
-- action #5 (0 visits) :
2,1 X
scores: [-000.000, -000.000]
-- action #6 (0 visits) :
0,2 X
scores: [-000.000, -000.000]
-- action #7 (0 visits) :
1,2 X
scores: [-000.000, -000.000]
-- action #8 (0 visits) :
2,2 X
scores: [-000.000, -000.000]
> step 1 world
step through:
1,0 X
step: 1, sente: 1, nb visits: 0
scores: [-000.000, -000.000]
.X.
...
...
-- action #0 (0 visits) :
0,0 O
scores: [-000.000, -000.000]
-- action #1 (0 visits) :
2,0 O
scores: [-000.000, -000.000]
-- action #2 (0 visits) :
0,1 O
scores: [-000.000, -000.000]
-- action #3 (0 visits) :
1,1 O
scores: [-000.000, -000.000]
-- action #4 (0 visits) best! :
2,1 O
scores: [-000.000, -000.000]
-- action #5 (0 visits) :
0,2 O
scores: [-000.000, -000.000]
-- action #6 (0 visits) :
1,2 O
scores: [-000.000, -000.000]
-- action #7 (0 visits) :
2,2 O
scores: [-000.000, -000.000]
> actualise run mem
Explored 4841 worlds in 3ms
Average time per world: 0.000620ms
1048400 bytes, 5242 worlds, 200 bytes per world
0 free worlds, 5242 used worlds, 10082 explored worlds, +06 explored steps
> quit

The user first checks the memory information, then the initial world. He/she runs one step of exploration and checks the updated memory information. Next he/she checks that actions have been created on the current initial world, and moves to the next world according to the current best action. Next he/she comes back to the previous world, move to the next world through another action, and set that new world as the actual current world before running one more step of exploration. Finally he/she displays again the information about memory.

Thanks to this debugging tool, one can quickly check that the initial world, calculated actions and result of actions, etc... are as expected. It is also helpful when designing the evaluation function: one can see how the exploration tree varies according to it, and explore the tree to better understand why an action ends up being the best.

Qualification and server mode

From my experience, implementing the world is the easy part of the process. Even when it isn't straight forward or well defined, tree exploration is robust enough to provide 'ok' solution even with a clumsy implementation. For example, in a tournament of CodinGame, the referee will send you the current actual game at each turn, hence the eventual mistakes get corrected. The exploration will be incorrect, but if the implementation is not completely meaningless, it will in general lead to an acceptable next action.

Of course, the more exact your implementation is, the more accurate the exploration will be, and the better next action you can expect. But anyway, you'll never have enough time and memory to calculate 'the' best action. It's a good example of "don't care if it's the best, care if it's good enough", which is a general truth on a real world battleground.

Rather, the design of a good evaluation function is where the headaches and sleepless nights await you. Unclear and contradicting goals are common, in a game for example you often need to balance aggressive and defensive moves. And in practice, tweaking 'just a little' your evaluation function generally leads to completely different behaviour for your AI, which is very confusing and need careful examination with the debugging tool. At the same time you generally want the evaluation function to be as fast as possible, while using as few memory as possible. It's really the place where one can shine.

How to evaluate wether one evaluation function leads to better results than another is also a big matter of concern. The only reasonable way is to run many times your different implementations against each other, and have a way to evaluate the results. The first problem here is time. At beginning, there will be clear improvement on the evaluation function and few runs should give clear results. But as you approach a good solution, improvements will be more and more subtle, and you'll need more and more runs to get an idea of which version is the best. More runs equal more time, which you generally do not have.

The second problem is how to evaluate the results. Here, my go-to has always been the ELO ranking system. It's easy to implement and use, supports any number of players (including one: run two single player games and choose the winner as the one with higher score) and give a clear and reliable ranking. To make the evaluation function design task easier, MiniFrame provides a qualification mode which automatically evaluates several implementations using ELO ranking.

The qualification mode uses another mode of execution: the server mode. In that mode, MiniFrame runs as a TCP/IP server, with the exploration process in background, and communicates with an external client to update the actual current world and get the best action. In the context of a game with two players using that server mode, there would be two servers (one for each player) and one client (acting as the referee between the players).

One can simply starts MiniFrame in server mode with the following code (still as the undefined portion of the main() introduced earlier):

  MiniFrameSetServerVerbose(mf, false);
  int ret = MiniFrameRunServer(mf, filenameServerAddr);
  if(ret != 0) {
    printf("The server returned with error code %d \n", ret);
  }

The server automatically chooses an ip and port, which are written into the file located at filenameServerAddr for a client application to retrieve. In the context of MiniFrame it's fine to use the same file for all server, there is no risk of conflicts by overwriting. MiniFrameSetServerVerbose(mf, false) allows to turn on/off the verbose mode for the server. A client interacts with the server as follow:

int main() {
  ip = ...
  port = ...
  MiniFrameServerConnection* conn = MiniFrameServerConnectionAlloc(ip, port);
  void* initData = NULL;
  size_t sizeInitData = 0;
  userData = ...
  MiniFrameWorldCreateInitData(&initData, &sizeInitData, userData);
  int ret = MiniFrameServerInit(conn, initData, sizeInitData);
  if(ret < 0) { printf("failed to initialise\n"); exit(1); }
  MiniFrameWorld world = {0};
  MiniFrameWorldSetToInitialWorld(&world, initData);
  while(MiniFrameWorldIsEnd(&world) == false) {
    ret = MiniFrameServerActualiseWorld(conn, &world);
    if(ret < 0) { printf("failed to actualise\n"); exit(1); }
    MiniFrameAction action = {0};
    ret = MiniFrameServerGetAction(conn, &action);
    if(ret < 0) { printf("failed to get the next action\n"); exit(1); }
    MiniFrameWorldApplyAction(&world, &action);
    world.step += 1;
    world.iSente = MiniFrameWorldGetNextSente(&world);
  }
  ret = MiniFrameServerTerminate(conn);
  if(ret < 0) { printf("failed to terminate\n"); exit(1); }
  MiniFrameServerConnectionFree(conn);
  return 0;
}

The ip and port are retrieved from the file the server has updated, and used to start a connection between the client and server(s) (only one server in that example but there would be one per player in a multiplayer game). Initialisation data are created using MiniFrameWorldCreateInitData(...) and sent to the server(s) with MiniFrameServerInit(...).

Then the simulation runs in a loop until the current world becomes an end world. At each step (game's turn) the current world is actualised on the server (the one of the current turn's player if multiple players) using MiniFrameServerActualiseWorld(conn, &world). The next action is requested with MiniFrameServerGetAction(conn, &action) and the current world is updated accordingly. Here the referee may actually update the current world in a differet manner than what the server thought. It would be the case for example for incomplete information games.

Running MiniFrame in server mode also allows to interface it with any code able to communicate through TCP/IP. It opens the door to many various uses of MiniFrame by other softwares.

Back to the qualification mode. First we need several instances to be qualified. I like to do it using compilation arguments and macros and keeping everything in one single source file for the world implementation. For example, in tic-tac-toe two different evaluation functions would be defined like this (the second one being wrong on purpose):

void MiniFrameWorldEvaluateScores(MiniFrameWorld* const world) {
  loop(i, 2) world->scores[i] = 0.0;
#if VER==1
  loop(iActor, 2) if(IsWinner(world, iActor)) world->scores[iActor] = 1.0;
#elif VER==2
  loop(iActor, 2) if(IsWinner(world, iActor)) world->scores[iActor] = -1.0;
#endif
}

Obviously, the user is not limited to the evaluation function. The type of search, or MiniFrame's parameters, or even world implementation could also vary according to VER. Only the two data structures defining the world and action must be identical between all versions under qualification (for the referee to have a common communication pattern with all servers).

Then the two binaries, one per version could be compiled like this:

tictactoe_version_1: /path/to/tictactoe.c
  gcc -o tictactoe /path/to/miniframe.c \
  -DMINIFRAME_WORLD_IMPL=/relative/path/to/tictactoe.c/from/miniframe.c \
  -DVER=1

tictactoe_version_2: /path/to/tictactoe.c
  gcc -o tictactoe /path/to/miniframe.c \
  -DMINIFRAME_WORLD_IMPL=/relative/path/to/tictactoe.c/from/miniframe.c \
  -DVER=2

To be used with the qualification mode, each version's binary should be running in server mode (cf details above). The binary for the qualification itself looks like this:

int main() {
  char const* pathInstances[2] =
    {"tictactoe_version_1", "tictactoe_version_2"};
  MiniFrameQualifier qualif = {
    .nbRound = 3,
    .nbInstance = 2,
    .pathInstances = pathInstances,
    .filenameServerAddr = filenameServerAddr,
    .timeStepMs = 150,
    .userData = NULL,
    .thresholdScore = -1.0,
  };
  MiniFrameQualificationResult* res = MiniFrameQualification(&qualif);
  if(res) MiniFrameQualificationResultFree(&res);
  return 0;
}

A MiniFrameQualifier is created to set the qualification parameters, and the qualification itself is done with MiniFrameQualification(). The qualification output its result on stdout, but also returns it as a MiniFrameQualificationResult structure for the user to use it in anyway he/she likes. The parameters of the qualification are: the number of rounds (one round consists of one game for each combination of actors), the number of qualified instances and the path to the binay of each instance, the path to the file used to exchange the server ip and port, the 'thinking' time between each step, optional user data (argument of MiniFrameWorldCreateInitData()).

The thresholdScore parameter is used for early termination of the qualification. If negative, the qualification runs for the given number of rounds. If positive, the qualification early quits if the best instance's score is larger than the second best's score plus the threshold. This allows to save time when one version clearly outperfoms the others and no more rounds are necessary to know that it is the best one.

Running the qualification gives the following output, where we clearly see the version with the correct evaluation function winning over the one with the wrong evaluation function.

Qualifying pool:
  #00: tictactoe_version_1
  #01: tictactoe_version_2
  ==== ROUND 001/003 ====
#01 vs #00, result: 0.000000 / 1.000000
#00 vs #01, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +007.9 #00
n.2: -007.9 #01
  ==== ROUND 002/003 ====
#01 vs #00, result: 0.000000 / 1.000000
#00 vs #01, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +015.5 #00
n.2: -015.5 #01
  ==== ROUND 003/003 ====
#01 vs #00, result: 0.000000 / 1.000000
#00 vs #01, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +022.7 #00
n.2: -022.7 #01

Parameters search mode

The qualification mode introduced above is already a precious tool to design a good evaluation function, but MiniFrame offers even more: a parameter automatic search mode. The evaluation formula is almost always subject to some parameters (weight, heat map, threshold, ...). Looking for their optimal value is a tedious task, which can be automated. In MiniFrame I reuse the qualification mode and differential evolution algorithm to provide a parameters search mode. A binary running in server mode and accepting the parameters value in argument is prepared as follow:

float evalParam = 1.0;
...
void MiniFrameWorldEvaluateScores(MiniFrameWorld* const world) {
  loop(i, 2) world->scores[i] = 0.0;
  loop(iActor, 2) if(IsWinner(world, iActor)) world->scores[iActor] = evalParam;
}
...
int main(int argc, char** argv){
  ...
  evalParam = atof(argv[1]);
  MiniFrameSetServerVerbose(mf, false);
  int ret = MiniFrameRunServer(mf, filenameServerAddr);
  if(ret != 0) {
    printf("The server returned with error code %d \n", ret);
  }
  ...
}

Then the source for the binary for the search is as follow:

int main(){
  srandom(time(NULL));
  #define NbHyperParam 1
  MiniFrameHyperParamSearcher search = {
    .nbEpoch = 3,
    .nbHyperParam = NbHyperParam,
    .seedHyperParams = (float[NbHyperParam]){1.0},
    .hyperParamMins = (float[NbHyperParam]){-1.0},
    .hyperParamMaxs = (float[NbHyperParam]){1.0},
    .pathInstance = "./tictactoe_search",
    .nbRound = 3,
    .nbInstance = 3,
    .timeStepMs = 150,
    .userData = NULL,
    .thresholdScore = 40.0,
    .filenameServerAddr = filenameServerAddr,
  };
  float bestHyperParams[NbHyperParam];
  MiniFrameHyperParamSearch(&search, bestHyperParams);
  return 0;
}

nbEpoch is the number of epochs in the differential evolution algorithm. nbHyperParam is the number of searched parameters. hyperParamMins and hyperParamMaxs define the range of possible value for each searched parameters. seedHyperParams lets the user provides a value for each parameters for the first instance of the first epoch (useful to force a starting point in the differential evolution algorithm). pathInstance is the path to the binary accepting the parameters in argument (cf above). nbInstance is the number of instance in the differential evoluation algorithm. Other parameters are the same as for qualification mode. In this example settings are pretty low, in reality you would use much larger number of instances, rounds and epochs. But be aware that it very quickly explodes in term of execution time.

Running the automatic search on the simple tic-tac-toe example gives the expected results, any positive value is the best value:

 ==== EPOCH 001 / 003 ====
Qualifying pool:
  #00: ./tictactoe_search 1.000000 
  #01: ./tictactoe_search 0.717583 
  #02: ./tictactoe_search -0.832308 
  ==== ROUND 001/003 ====
#01 vs #00, result: 0.000000 / 0.000000
#02 vs #00, result: 0.000000 / 1.000000
#00 vs #01, result: 0.000000 / 0.000000
#02 vs #01, result: 0.000000 / 1.000000
#00 vs #02, result: 1.000000 / 0.000000
#01 vs #02, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +007.8 #01
n.2: +007.8 #00
n.3: -015.6 #02
  ==== ROUND 002/003 ====
#00 vs #01, result: 0.000000 / 0.000000
#02 vs #01, result: 0.000000 / 1.000000
#01 vs #00, result: 0.000000 / 0.000000
#02 vs #00, result: 0.000000 / 1.000000
#01 vs #02, result: 1.000000 / 0.000000
#00 vs #02, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +015.1 #00
n.2: +015.1 #01
n.3: -030.2 #02
  ==== ROUND 003/003 ====
#01 vs #00, result: 0.000000 / 0.000000
#02 vs #00, result: 0.000000 / 1.000000
#00 vs #01, result: 0.000000 / 0.000000
#02 vs #01, result: 0.000000 / 1.000000
#00 vs #02, result: 1.000000 / 0.000000
#01 vs #02, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +021.9 #01
n.2: +021.9 #00
n.3: -043.8 #02
Best hyper parameters:
0.717583
  ==== EPOCH 002 / 003 ====
Qualifying pool:
  #00: ./tictactoe_search 1.000000 
  #01: ./tictactoe_search 0.717583 
  #02: ./tictactoe_search -0.995994 
  ==== ROUND 001/003 ====
#01 vs #00, result: 0.000000 / 0.000000
#02 vs #00, result: 0.000000 / 1.000000
#00 vs #01, result: 0.000000 / 0.000000
#02 vs #01, result: 0.000000 / 1.000000
#00 vs #02, result: 1.000000 / 0.000000
#01 vs #02, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +007.8 #01
n.2: +007.8 #00
n.3: -015.6 #02
  ==== ROUND 002/003 ====
#00 vs #01, result: 0.000000 / 0.000000
#02 vs #01, result: 0.000000 / 1.000000
#01 vs #00, result: 0.000000 / 0.000000
#02 vs #00, result: 0.000000 / 1.000000
#01 vs #02, result: 1.000000 / 0.000000
#00 vs #02, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +015.1 #00
n.2: +015.1 #01
n.3: -030.2 #02
  ==== ROUND 003/003 ====
#01 vs #00, result: 0.000000 / 0.000000
#02 vs #00, result: 0.000000 / 1.000000
#00 vs #01, result: 0.000000 / 0.000000
#02 vs #01, result: 0.000000 / 1.000000
#00 vs #02, result: 1.000000 / 0.000000
#01 vs #02, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +021.9 #01
n.2: +021.9 #00
n.3: -043.8 #02
Best hyper parameters:
0.717583
  ==== EPOCH 003 / 003 ====
Qualifying pool:
  #00: ./tictactoe_search 1.000000 
  #01: ./tictactoe_search 0.717583 
  #02: ./tictactoe_search -0.859990 
  ==== ROUND 001/003 ====
#01 vs #00, result: 0.000000 / 0.000000
#02 vs #00, result: 0.000000 / 1.000000
#00 vs #01, result: 0.000000 / 0.000000
#02 vs #01, result: 0.000000 / 1.000000
#00 vs #02, result: 1.000000 / 0.000000
#01 vs #02, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +007.8 #01
n.2: +007.8 #00
n.3: -015.6 #02
  ==== ROUND 002/003 ====
#00 vs #01, result: 0.000000 / 0.000000
#02 vs #01, result: 0.000000 / 1.000000
#01 vs #00, result: 0.000000 / 0.000000
#02 vs #00, result: 0.000000 / 1.000000
#01 vs #02, result: 1.000000 / 0.000000
#00 vs #02, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +015.1 #00
n.2: +015.1 #01
n.3: -030.2 #02
  ==== ROUND 003/003 ====
#01 vs #00, result: 0.000000 / 0.000000
#02 vs #00, result: 0.000000 / 1.000000
#00 vs #01, result: 0.000000 / 0.000000
#02 vs #01, result: 0.000000 / 1.000000
#00 vs #02, result: 1.000000 / 0.000000
#01 vs #02, result: 1.000000 / 0.000000
  ==== RANKING ====
n.1: +021.9 #01
n.2: +021.9 #00
n.3: -043.8 #02
Best hyper parameters:
0.717583

Mancala and Reversi

Tic-tac-toe is a good test case thanks to its simplicity, but is not really useful or representative of the performance of MiniFrame. Lets see two more interesting examples: Mancala and Reversi. If you don't know these games, I let you refer to, respectively, this article and this article.

Mancala has a very clear evaluation function: simply the current number of captured stones. It's a very good candidate for a concrete use of MiniMax. Indeed I've already implemented a web app a long time ago allowing a human to play against a bot using MM. It predates MiniFrame, but still a good representative of what can be done. It's available here. A minimalist version using MiniFrame is also available here.

Implementing Mancala in MiniFrame can be done as follow:

#define MF_NB_MAX_POSSIBLE_ACTION 6
#define MF_NB_ACTOR 2

struct MiniFrameAction {
  MiniFrameActionFields;
  int idx;
};

struct MiniFrameWorld {
  MiniFrameWorldFields;
  uint8_t vals[12];
  uint8_t captures[2];
};

void MiniFrameWorldSetToInitialWorld(
  MiniFrameWorld* const that,
            void* const userData) {
  int* vals = userData;
  loop(i, 12) that->vals[i] = vals[i];
  loop(i, 2) that->aliveActors[i] = true;
}

bool MiniFrameWorldIsEnd(MiniFrameWorld* const that) {
  loop(i, 2) if(that->captures[i] > 24) return true;
  if(that->captures[0] == 24 && that->captures[1] == 24) return true;
  return false;
}

void MiniFrameWorldCreateActions(MiniFrameWorld* const that) {
  int nbActions = 0;
  loop(idx, 6) if(nbActions < MF_NB_MAX_POSSIBLE_ACTION) {
    if(that->vals[idx + that->iSente * 6] > 0) {
      that->actions[nbActions].idx = idx + that->iSente * 6;
      ++nbActions;
    }
  }
  if(nbActions == 0) {
    that->actions[nbActions].idx = -1;
    ++nbActions;
  }
  that->nbActions = nbActions;
}

void MiniFrameWorldApplyAction(
         MiniFrameWorld* const world,
  MiniFrameAction const* const action) {
  if(action->idx != -1) {
    int n = world->vals[action->idx];
    world->vals[action->idx] = 0;
    int j = 0;
    loop(i, n) {
      j = (action->idx + i) % 12;
      if(action->idx == j) n++;
      else world->vals[j] += 1;
    }
    while(
      j >= 0 && (j / 6) == world->iSente &&
      (world->vals[j] == 2 || world->vals[j] == 3)
    ) {
      world->captures[1 - world->iSente] += world->vals[j];
      world->vals[j] = 0;
      j--;
    }
    n = 0;
    loop(i, 6) n += world->vals[i];
    if(n == 0) {
      if(world->iSente == 1) {
        world->captures[0] = 48;
        world->captures[1] = 0;
      } else {
        world->captures[0] = 0;
        world->captures[1] = 48;
      }
    }
    n = 0;
    loop(i, 6) n += world->vals[6 + i];
    if(n == 0) {
      if(world->iSente == 1) {
        world->captures[0] = 48;
        world->captures[1] = 0;
      } else {
        world->captures[0] = 0;
        world->captures[1] = 48;
      }
    }
  }
}

void MiniFrameWorldCopy(
        MiniFrameWorld* const worldA,
  MiniFrameWorld const* const worldB) {
  loop(i, 12) worldA->vals[i] = worldB->vals[i];
  loop(i, 2) worldA->scores[i] = worldB->scores[i];
  loop(i, 2) worldA->captures[i] = worldB->captures[i];
}

void MiniFrameWorldEvaluateScores(MiniFrameWorld* const that) {
  loop(i, 2) that->scores[i] = that->captures[i];
}

bool MiniFrameWorldIsSame(
  MiniFrameWorld const* const worldA,
  MiniFrameWorld const* const worldB) {
  loop(i, 12) if(worldA->vals[i] != worldB->vals[i]) return false;
  loop(i, 2) if(worldA->scores[i] != worldB->scores[i]) return false;
  return true;
}

void MiniFrameWorldPrint(
  MiniFrameWorld const* const that,
  FILE* const stream) {
  loop(i, 6) fprintf(stream, "%d ", that->vals[5 - i]);
  fprintf(stream, "\n");
  loop(i, 6) fprintf(stream, "%d ", that->vals[6 + i]);
  fprintf(stream, "\n");
  fprintf(stream, "captures: (%d, %d)", that->captures[0], that->captures[1]);
}

void MiniFrameActionPrint(
  MiniFrameAction const* const that,
  FILE* const stream) {
  fprintf(stream, "idx %d", that->idx);
}

void MiniFrameWorldCreateInitData(
   void** const initData,
  size_t* const sizeData,
    void* const userData) {
  (void)userData;
  *sizeData = sizeof(int) * 12;
  *initData = malloc(*sizeData);
  assert(*initData != NULL);
  loop(i, 12) ((int*)(*initData))[i] = 4;
}

int main() {
  int initialGrid[12] = {4,4,4,4,4,4,4,4,4,4,4,4};
  MiniFrame* mf =
    MiniFrameAlloc(MiniFrameGetNbWorldForMemSize(1024 * 1024 * 10));
  MiniFrameSetInitialWorld(mf, initialGrid);
  MiniFrameSetExplorationType(mf, MFExplore_minimax);
  MiniFrameSetExplorationDepth(mf, 100);

  MiniFrameWorld actualWorld = MiniFrameGetCurrentWorld(mf);
  MiniFrameRunBackground(mf);
  while(MiniFrameWorldIsEnd(&actualWorld) == false) {
    MiniFrameSleepMs(10);
    MiniFrameAction action = {0};
    MiniFrameGetBestAction(mf, &action);
    printf("Selected move:\n");
    MiniFrameActionPrint(&action, stdout);
    printf("\n");
    MiniFrameWorldApplyAction(&actualWorld, &action);
    printf("Current world:\n");
    MiniFrameWorldPrint(&actualWorld, stdout);
    printf("\n");
    actualWorld.step += 1;
    actualWorld.iSente = MiniFrameWorldGetNextSente(&actualWorld);
    MiniFrameSetActualWorld(mf, &actualWorld);
  }
  MiniFrameFree(&mf);
  return 0;
}

Reversi is an even more interesting candidate because it's still easy to implement, but a good evaluation function is extremely difficult to find. As the coins can flip until the last move, there is no obvious way to know which player is leading during the game. A version developped with MiniFrame is available to play online here (excuse the extreme lack of effort put into the interface, the point here is just to prove it's working).

As a tournament of Reversi is available on CodinGame (here), it also allows me to evaluate the performance of a solution based on MiniFrame. At the time of writing my solution ranks 51st among 474 entries. Not at the top of the world but good enough for me to believe MiniFrame runs as expected and is performant. The difference with stronger solution lies probably in the evaluation function and optimisation of the world implementation. To avoid spoiling the tournament I don't share the code of my implementation here.

Sudoku

As I've written at the beginning of this article, this framework can be applied to solve any combinatorial optimisation problem that can be represented as a tree. To illustrate this, I've used MiniFrame to create a Sudoku solver. At the same time it also shows that it can be used when there is only one 'player'.

In that case, one node (or 'world') represents a grid, totally or partially filled, with any value in [1,9] in the non-empty cells. One transition (or 'action') consists of filling one empty cell with a value. The initial world is the initial sudoku grid. A grid with no empty cells is an end world. The evaluation function is simply the number of non-empty cell.

The creation of possible actions from a given world is where the magic happens. A dummy solution would be to consider all combinations (empty cell, value in [1,9]). This would lead to an enormous amount of actions for starting grids. The trick here is to do what I call 'pruning by design'. Instead of all combinations, search one cell where there is only one value possible. If such (cell, value) exists, set it as the only possible action from that world. If no such (cell, value) exists, consider pairs (cell, value) which are valids. This dramatically reduces the branching factor, with a single action per world most of the time. MiniFrame in MiniMax mode (without pruning) and 10Mb of memory was enough to solve all the grids I have tried (well under one second).

This solver is also available as a web app here, and the code looks like this:

#define MF_NB_MAX_POSSIBLE_ACTION 10
#define MF_NB_ACTOR 1

struct MiniFrameAction {
  MiniFrameActionFields;
  int row;
  int col;
  int val;
};

struct MiniFrameWorld {
  MiniFrameWorldFields;
  unsigned char vals[81];
};

void MiniFrameWorldSetToInitialWorld(
  MiniFrameWorld* const that,
            void* const userData) {
  int* vals = userData;
  loop(i, 81) that->vals[i] = vals[i];
  loop(i, MF_NB_ACTOR) that->aliveActors[i] = true;
}

int CountEmptyCell(MiniFrameWorld const* const that) {
  int nb = 0;
  loop(i, 81) if(that->vals[i] == 0) ++nb;
  return nb;
}

bool MiniFrameWorldIsEnd(MiniFrameWorld* const that) {
  return CountEmptyCell(that) == 0;
}

bool IsValid(
  MiniFrameWorld* const that,
              int const row,
              int const col,
              int const val) {
  if(that->vals[row * 9 + col] != 0) return false;
  loop(i, 9) if(that->vals[i * 9 + col] == val) return false;
  loop(i, 9) if(that->vals[row * 9 + i] == val) return false;
  int rowCell = (row / 3) * 3;
  int colCell = (col / 3) * 3;
  loop(i, 3) loop(j, 3) {
    if(that->vals[(rowCell + i) * 9 + (colCell + j)] == val) return false;
  }
  return true;
}

bool IsSure(
  MiniFrameWorld* const that,
              int const row,
              int const col,
              int const val) {
  bool isTheOnlyValid = true;
  loop(i, 9) if(i + 1 != val && IsValid(that, row, col, i + 1)) {
    isTheOnlyValid = false;
  }
  if(isTheOnlyValid) return true;
  bool isTheOnlyPossiblePlace = true;
  loop(i, 9) if(i != row && IsValid(that, i, col, val)) {
    isTheOnlyPossiblePlace = false;
  }
  if(isTheOnlyPossiblePlace) return true;
  isTheOnlyPossiblePlace = true;
  loop(i, 9) if(i != col && IsValid(that, row, i, val)) {
    isTheOnlyPossiblePlace = false;
  }
  if(isTheOnlyPossiblePlace) return true;
  return false;
}

void MiniFrameWorldCreateActions(MiniFrameWorld* const that) {
  int nbActions = 0;
  // Search one sure move
  int val = 1;
  while(nbActions == 0 && val < 10) {
    loop(row, 9) loop(col, 9) {
      if(
        nbActions == 0 &&
        IsValid(that, row, col, val) &&
        IsSure(that, row, col, val)) {
        that->actions[nbActions].row = row;
        that->actions[nbActions].col = col;
        that->actions[nbActions].val = val;
        ++nbActions;
      }
    }
    ++val;
  }
  // If there is no sure move, explore possible moves
  val = 1;
  while(nbActions == 0 && val < 10) {
    loop(row, 9) loop(col, 9) {
      if(nbActions < MF_NB_MAX_POSSIBLE_ACTION && IsValid(that, row, col, val)) {
        that->actions[nbActions].row = row;
        that->actions[nbActions].col = col;
        that->actions[nbActions].val = val;
        ++nbActions;
      }
    }
    ++val;
  }
  if(nbActions == 0) {
    that->actions[nbActions].row = -1;
    ++nbActions;
  }
  that->nbActions = nbActions;
}

void MiniFrameWorldEvaluateScores(MiniFrameWorld* const that) {
  int nbEmptyCell = CountEmptyCell(that);
  if(nbEmptyCell == 0) {
    that->scores[0] = 100.0;
  } else if(that->nbActions == 0) {
    that->scores[0] = -100.0;
  } else {
    that->scores[0] = 0.0;
  }
}

bool MiniFrameWorldIsSame(
  MiniFrameWorld const* const worldA,
  MiniFrameWorld const* const worldB) {
  loop(i, 81) if(worldA->vals[i] != worldB->vals[i]) return false;
  return true;
}

void MiniFrameWorldPrint(
  MiniFrameWorld const* const that,
  FILE* const stream) {
  loop(i, 9) {
    loop(j, 9) {
      int v = that->vals[i * 9 + j];
      if(v == 0) fprintf(stream, "."); else fprintf(stream, "%u", v);
    }
    if(i < 8) fprintf(stream, "\n");
  }
}

void MiniFrameActionPrint(
  MiniFrameAction const* const that,
  FILE* const stream) {
  fprintf(stream, "(%d,%d) %d", that->col + 1, that->row + 1, that->val);
}

void MiniFrameWorldApplyAction(
         MiniFrameWorld* const world,
  MiniFrameAction const* const action) {
  if(action->row != -1) {
    world->vals[action->row * 9 + action->col] = action->val;
  }
}

void MiniFrameWorldCopy(
        MiniFrameWorld* const worldA,
  MiniFrameWorld const* const worldB) {
  loop(i, 81) worldA->vals[i] = worldB->vals[i];
}

void MiniFrameWorldCreateInitData(
   void** const initData,
  size_t* const sizeData,
    void* const userData) {
  (void)userData;
  *sizeData = sizeof(int) * 81;
  *initData = malloc(*sizeData);
  loop(i, 81) ((int*)(*initData))[i] = 0;
}

int main(int argc, char** argv) {
  int initialGrid[81];
  loop(i, 81) initialGrid[i] = argv[1][i] - '0';
  MiniFrame* mf =
    MiniFrameAlloc(MiniFrameGetNbWorldForMemSize(1024 * 1024 * 10));
  MiniFrameSetInitialWorld(mf, initialGrid);
  MiniFrameSetExplorationDepth(mf, 100);
  MiniFrameSetExplorationType(mf, MFExplore_minimax);
  MiniFrameAction action;
  int step = 0;
  while(MiniFrameIsRunning(mf) && step < 100) {
    MiniFrameRunOneStep(mf);
    MiniFrameStepBestAction(mf, &action);
    if(action.row == -1) step = 100;
  }
  if(step < 100) {
    MiniFrameWorld world = MiniFrameGetCurrentWorld(mf);
    loop(i, 81) printf("%d", world.vals[i]);
  } else {
    printf("failed");
  }
  printf("\n");
  MiniFrameFree(&mf);
  return 0;
}

Note that I'm using here an helper function, MiniFrameStepBestAction(), which is equivalent to getting the best action and actualising the current world at once, convenient when using MiniFrame to solve that kind of problems.

Mars lander

Another claim I've made is that MiniFrame can be used under real time constraints. The CodinGame solution for Reversi already proves it: it must output its action within 150ms, which it does successfully. As one more example, I've choosen another problem available on CodinGame: the Mars lander. Here the goal is a simulated landing on the surface of the planet Mars. The problem is limited to 2D, but follows the physical reality of such an operation. The lander is subject to gravity, controls its attitude through change in orientation and thrust, and must satisfies constraints of position, speed and orientation for a successful landing, while using a limited amount of fuel.

Here, the 'world' is the status of the lander (position, speed, attitude, landscape), an 'action' is a command to activate the thruster with a given power, or to change orientation. To limit the number of possible actions, I limit an action to be either thrust either orientation (on CodinGame both are allowed simultaneously), the change of orientationto be \(\pm1\) degree, the thrust to be one of {0,4} when the lander is not above the landing area, and one of {0,2,4} when it is above it.

The evaluation function is divided into three, depending on where the lander is located relative to the landing area. It is crafted in a way to control the behaviour of the lander. Away from the landing area, it gets near to it as fast as possible while avoiding the landscape. High above the landing area it stabilises its horizontal speed and start the descent. Low above the landing area, it focus on a smooth landing within the allowed constraints.

Withing 100ms time constraint between each action and 10Mb of available memory, using MiniMax with siblings pruning allows the lander to plan its control several seconds ahead. It achieves 100% successful landings on 100 tests with random initial conditions (attitude and landscape). The animation below gives an illustration of the landing maneuvers. The red dot is the lander, the grey areay is the landscape profile (highest mountain is 484m for scale), the landing area is the flat area at the center of the screen.

Comparison MM, MCTS, ABP

Finally, here is a comparison of performances of MM (with/without siblings pruning), MCTS, and ABP on the Mancala and Reversi games. Using MiniFrame's qualification tool of course.

Comparison of MM without siblings pruning, and MM with siblings pruning for several threshold values on Reversi (150ms per step, 100Mb of memory, 10 rounds):

Results show that 1) there is an optimal value for the pruning threshold, 2) the siblings pruning with an appropriate threshold value outperforms MM without pruning.

I think the A shape of the results can be explained as follow: a too small threshold discards moves that are at short term bad but prove to be optimal choice at long term; a too large threshold is equivalent to no pruning at all, hence shallower exploration; intermediate thresholds are a good balance between keeping actions which are not optimal locally but may prove optimal later and pruning actions which looks locally really too bad and save exploration resources to go deeper in the tree. As the evaluation function is more accurate and the exploration resources increase, I expect the optimal threshold to move to the left (stronger pruning).

Comparison of ABP with various maximum depth of exploration, and MCTS on Reversi (150ms per step, 100Mb of memory, 10 rounds):

Results show that 1) the maximum depth has an influence on the performance, 2) better performances are obtained for a small maximum depth, 3) MCTS performs similarly to ABP with a smal maximum depth.

I think the results can be explained as follow: ABP depth 64 should be equivalent to MCTS, however my implementation of MCTS explores around 15% more worlds than ABP in the same time, hence the better rank of MCTS; as the maximum depth of ABP increases the rank decreases due to the increasing number of missed branches; the evaluation function is probably a good one, because evaluating everything at depth 4 is clearly better than evaluating depper and missing branches; MCTS as good as ABP depth 4 probably means that there is not a big difference overall between all these version.

Comparison of MM without siblings pruning, MM with siblings pruning (threshold 5), MCTS, and ABP (depth 4) on Reversi (150ms per step, 100Mb of memory):

Results show that 1) the type of algorithm significantly impacts the performance, 2) MM with pruning > MM without pruning > ABP > MCTS.

I think the results can be explained as follow. In the game of Othello the situation changes hugely until the last move of the game, it makes it very difficult to evaluate moves at long term and optimal moves are hidden in subtrees of bad moves. This makes a partial exploration of the tree particularly inefficient and explains MM > (MCTS and ABP). At the same time, we have to deal with the resources constraint. Pruning is necessary, and siblings pruning proves to be a very good way to do it.

Comparison of MM without siblings pruning, and MM with siblings pruning for several threshold values on Mancala (150ms per step, 100Mb of memory, 20 rounds):

Results show that 1) a strong pruning has a very negative impact on performance, 2) other pruning thresholds have no significant impact relative to each other and relative to no pruning.

In my implementation of Mancala, the evaluation function is simply the current score. It may not vary for several moves, and when it does it's in a much more quiet way than the one for Reversi. The exploration should go must deeper to notice that a locally bad move ends up being a good one. This explains why pruning here doesn't make a big difference, except if pruning so strongly than it shies away from risk, always waiting for killer moves that never come.

Comparison of ABP with various maximum depth of exploration, and MCTS on Mancala (150ms per step, 100Mb of memory, 20 rounds):

Results show that 1) there is an optimal maximum depth, 2) ABP at the appropriate depth outperform MCTS.

Again, MCTS being faster than ABP explains why it performs better than ABP depth 64. But contrarily to Reversi, the evaluation function is worst at predicting the outcome of a game, hence increasing a bit the maximum depth performs better than a minimal one, up to a point where the quantity of missed branches overcomes the advantage of a deeper exploration.

Comparison of MM without siblings pruning, MM with siblings pruning (threshold 20), MCTS, and ABP (depth 12) on Mancala (150ms per step, 100Mb of memory):

Results show that 1) the type of algorithm significantly impacts the performance, 2) MM with pruning > MM without pruning > ABP > MCTS.

Overall, results are similar to those obtained on Reversi.

Conclusion

In this article, I've introduced the MiniFrame framework implementing MiniMax, MonteCarlo Tree Search, and AlpaBeta Pruning. This framework allows to implement solvers in C programming language for combinatorial optimisation problems. The framework provides tools (debugger, evaluation function qualification, hyper parameters search, TCP/IP server) useful for the implementation of the problem, the design of the evaluation function, and the usage of the solver in various ways and context. It also provides an original pruning method for the MiniMax algorithm, which I've shown to perform better than other methods on two examples. The framework has been validated and qualified on several examples introduced here, and an online tournament on the CodinGame platform.

The code source of the current version of MiniFrame is available under GPL here. Any questions or comments are welcome by email.

List of versions:

v1.1.0: miniframe.1.1.0.tar.gz.
Refactor private MFGetBestAction() as public MiniFrameWorldGetBestAction(). Modify MiniFrameGetCurrentWorld() to include the non-user-defined properties in the returned copy of the current world.
Released on 2023/11/30.
v1.0.0: miniframe.1.0.0.tar.gz.
Initial version available publicly.
Released on 2023/11/12.

2023-11-12
in AI/ML, All, C programming,
42 views