The Burrows-Wheeler transform

Last week, after looking another interesting video from Computerphile about the Burrows-Wheeler Transform I decided to add it to my personal library. The Wikipedia page explains it in details and provides a straightforward implementation. Unfortunately that implementation is completely useless due to the amount of time and memory it requires on any practical amount of data.

The article mentions the existence of clever ways to implement it, without giving any details about them. If the trick to get an efficient transform is actually quite obvious, the one for the untransform was a complete mystery to me. Links toward the original paper are all dead links, and searching the web for info wasn't quick and easy as most results simply reproduce the wikipedia implementation and call it a day. This video by Benjamin Langmead finally put me on track, kudos to him. Given that finding info wasn't immediate I thought it would be worth the time to write an article myself too, hoping it may help someone.

The Burrows-Wheeler Transform (BWT) is a reordering of a sequence of characters, with interesting properties. The algo was initially designed for string of characters but it really doesn't matter, it works exactly the same whatever the data, so I'll speak of bytes in general. However, to be efficient these data must have some kind of sequential correlation (a byte is more or less predictable from its previous byte). It's easy then to understand that texts are a well suited type of data. An example of use is to apply BWT as a preprocessing step of RLE to improve its compression rate.

The transform consists of taking all the rotations of the sequence to transform, sorting them, and returning the last element of each sorted rotation. Note that the initial order of rotations doesn't matter as they are later sorted. Note also that if some introduction of the transform imposed the sequence start and end to be marked in a way or the other, it isn't actually necessary. Example:

In the straightforward implementation you have to create a matrix of dimensions \(n\)x\(n\), where \(n\) is the size of the input. For a very large input it becomes quickly impossible. The efficient way to do it is to avoid the creation of that matrix and use instead an indirection to the first element in the sequence for each given rotation. What kind of indirection, it depends on what offers the language your using, address in low level languages or indices in higher languages. I'll use indices below as it's more general and doesn't affect performance. The pseudocode becomes as follow:

If the unoptimised version of the transform is a memory black hole, the one for the untransform is a CPU cycle Gargantua. It needs to sort \(n\) times the \(n\) bytes long array! Fortunately, the optimised version magically does the same thing in just O(n) time. All it takes to understand is one thing: after sorting the rotations, as the last element is the element before the first element the order of same value elements is the same for the first and last element. See in the table below, the three 'A' in the original data appears in the same order in both the first and last column.

As we also know that the first column is sorted, we can calculate the index of the row containing the previous element of the last element of any row with nothing but the last column. It is the number of elements with a value less than the last element in that row plus the index of that last element within the last column. For example, using the 2nd row, the last element is the second 'N' in the last column. Before 'N' there are 3 'A' and 1 'B', so the row whose last element is the element before that 'N' in the original data is the (3+1+2)=6th row, hence the 3rd 'A'.

We can thus jump between rows to move from one element to its predecessor in the original data, then the predecessor of the predecessor, and so on. All it takes to complete the untransform is to know where to start from, that's why the transform returns the index of the row corresponding to the original data. This will give us the last element, from which we can reconstruct the whole original data from end to beginning. In our little example:

The pseudo code is as follow:

That's all folks. Comments or questions are always welcome by email.