An algorithm contest has recently been posted in the Osaka/Kyoto Web Designers and Developpers Meetup community by Sacha Greif. The goal of the contest was to calculate possible grids for a word game idea Sacha had. The details are available here (I assumed you've read it in what follows). In this article I introduce the solution I've designed and implemented in C.
Dictionaries
The pool of words is provided as 3 dictionaries of 2k, 5k and 10k most common english words. They are simple text file formatted as follow:
Reading and parsing these files is cumbersome. Instead, by changing just the first and last line we can bake them into the code with a simple #include ...:
Tokens
The first step is to calculate the tokens from the dictionary. A first filtering is applied to ignore "plural form" words and words whose length is incompatible with prefix's and suffix's length requirement.
Tokens from valid words are obtained by calculating the range of possible cutting positions inside the word, looping on that range and creating the prefix/suffix token with the part before/after the cut.
To store the tokens I use two hash tables: one hashing on the prefix, and the other hashing on the suffix. I also keep a reference of a token in one table to its sibling in the other table. The search algorithm will explain why. Pair and CreatePairs are completed as follow:
Hashing
For hashing, I use the FNV1a function. No particular reason why, I just came accross it recently and that's the one I had in mind at that time.
The hash table's size is calculated as a multiple of the number of threads (cause of course the search will be multithreaded). There is absolutely no need to do it that way except that it simplifies my life when creating the threads. The actual numbers are choosen as what feels good to me.
While adding entries to the hash table I also take care of keeping them sorted (on the key, within a same hash index). Again, the search algorithm will explain why.
Iteration on hash tables
I'll need to iterate on entries in the hashes and I'll do it in three ways which I'll call HashIterNext, HashIterSubNext, HashIterStep. The first one loops on all entries. The second one loops on the entries of the current key only. The third one loops on all entries but jump over entries with the same key. The second and third ways of iterating take advantage of the fact I have sorted entries when creating the tokens. The search algorithm will explain why I need these three ways of iterating.
Token pruning
We know from the game specification that only prefix tokens appearing at least in two pairs will be used, and at least six pairs for suffix tokens. So, we can prune all the tokens from the dictionary which appear in less than these number of pairs. Actually we could also prune on larger number of pairs. The less the number of pairs a token appears into, the more "difficult" it will be to used, in other words the less probable it will appear in a solution. As pruning more tokens will speed up the search (less combination to check), that could be a way to optimize (in speed): sacrifice the tokens which will probably not lead to solution to save time. In the end I haven't pruned more than the obvious thresholds (2, 6) as my solution was fast enough.
Here I've been lazy. Instead of actually removing the tokens from the hash table, I just flagged them as "pruned" and ignored them during the search. Removing them should speed up a little more the search.
Search for the solutions
Now I need a representation of the edge graph of the board and node types (prefix/suffix), which implies I also need to assign an index for each nodes.
The choice of indices is also made such as it will simplify things later.
I'm now ready to search for the solutions. The brute force method would be to loop over all token combinations and check if it's a valid one. That's \(\Pi_{i=0}^8(n-i)*\Pi_{i=0}^3(m-i)\) combinations, where \(n\) is the number of prefix tokens and \(m\) is the number of suffix tokens. Obviously way too much.
We can do much better: when a node is set, the possible values for all its neighbour nodes reduce to the existing pairs containing that node's token. So, except for a first initial node which needs to be checked for all possible tokens, we can loop on only very small subsets for its neighbours, and by propagation fill in the whole board very quickly (or give up if we find a node with no solution). Hence the choice to use two hash tables: one can access quickly any given token, and switch easily between looping over its paired prefix or suffix.
For the initial node, we want one that filter out as much as possible neighbours, i.e. one with many edges. That would be one of the suffix nodes. However, given the graph of the board, choosing the center node instead makes the implementation much easier, thus I choose that one as initial node.
The propagation occurs then from the central node outward, hence the choice in nodes' indice. To make things easier to understand I split the nodes into "levels": the center node is "level 0", the surrounding suffix nodes are " level 1", and the remaining prefix nodes are "level 2".
In pseudo code the search algorithm becomes as follow:
Used pairs
During the search I need to check if a pair is already used in the board. To make it efficient I add one last flag to Pair, which is also the only property needing to be managed per thread.
As I'm juggling between the two hash tables, I must not forget to update that flag in both tables. That's where the sibling property of the Pair becomes handy.
Symmetry
Beside pairs being used only once, another constraint needs to be checked: symmetry. Rotated or mirrored boards are obviously the same from the point of view of the game, and should be counted as one in the result. They can be easily avoided by imposing \(t_2\gt t_1\), \(t_3\gt t_1\) and \(t_4\gt t_2\) where \(t_i\) is the iteration index of the token used by the i-th node. Note that we need to allow \(t_4\lt t_3\): it leads to different solutions for level 2 nodes with 3 edges. In the implementation, instead of iterating on the whole range and checking indices, a more efficient way is to clone iterators appropriately to immediately start the search from the right index.
Multithreading
The algorithm above brings computation time in reachable range, of the order of \(ne^9(e-1)(e-2)^2\) where \(e\) is the average number of pairs per token. Multithreading will help a little more.
(Edit on 2023/11/27: correction of the formula for the computation time above).
There is an immediate solution for multithreading: in the algorithm above, after the initial node is set the execution branches are completely decorrelated. Then I just need to split the set of possible prefixes between each thread and I'm good to go. Given that this set is stored as a hash table, splitting can be done as choosing ranges of hash indices. And as I took care to choose hash table's size dividable by the number of threads, the range calculation is straight forward.
In term of conflict in data access, there is only one points to take care of: when updating the list of results. A mutex will manage that.
Early quit
What's above is sufficient for implementation, however the first results I've obtained show that for a given quadruple of suffix there exists in general many possible solutions. The game as it is described by Sacha uses these quadruples as starting states for the game, and it's up to the player to find the prefixes. So, what we are actually interested in is to find those quadruples. It means we can early quit the search for one quadruple as soon as we have found one solution.
In my algorithm it's easy to early quit in SolveLvl2 and come back to SolveLvl1, however at that point we still have the center node set in the board. The algorithm is convenient enough as it is so I don't want to break everything. Instead I choose to loop on the list of current results to see if the quadruple exists, and actually add the result only if it doesn't. That's a pain to have to do it, especially because the probability for a quadruple to have several possible center nodes seems very low (in other words, that check is probably useless). But I'm too lazy to look for a better solution...
Implementation
Finally, I have all I need and can implement the search.
Results
I've run the search for several suffix lengths on all dictionaries. The 2k dictionary does not produce any results, the 5k one very few, and the 10k one quite a lot. Increasing suffix length significantly increases the search time without significantly increasing the number of solutions. For a same quadruple of suffix there seems to be many solutions in general.
I've picked several solutions at random and checked them manually. I've also made a Python script to independently check that there are no duplicates in the results, and that returned boards are actual solutions. I've spent enough time on it to be sufficiently confident in the results to share my solution here, but I wouldn't bet my life on it. Use it at your own risk !
The C implementation and results are available for download here. Extract with tar xvf okwddm.tar.gz, and then refer to README.md.