An interesting little algorithm occasionally used in computer science is called the Burrows–Wheeler transform, an algorithm that reversibly permutes text strings. At first glance it might not seem to have much use, but it enabled a clever way of doing lossless data compression on text.
We can see how the algorithm works by performing it on the example string
The cat jumped from the mat into the hat. We start by placing a
$ at the beginning and end of the string respectively to help it be reconstructed later, replacing the spaces with
_ just to help with markup and display (not part of the algorithm), and then expanding it into a 2D array with each row being circularly shifted over from one row to the next:
^The_cat_jumped_from_the_mat_into_the_hat.$ $^The_cat_jumped_from_the_mat_into_the_hat. .$^The_cat_jumped_from_the_mat_into_the_hat t.$^The_cat_jumped_from_the_mat_into_the_ha at.$^The_cat_jumped_from_the_mat_into_the_h hat.$^The_cat_jumped_from_the_mat_into_the_ _hat.$^The_cat_jumped_from_the_mat_into_the e_hat.$^The_cat_jumped_from_the_mat_into_th he_hat.$^The_cat_jumped_from_the_mat_into_t the_hat.$^The_cat_jumped_from_the_mat_into_ _the_hat.$^The_cat_jumped_from_the_mat_into o_the_hat.$^The_cat_jumped_from_the_mat_int to_the_hat.$^The_cat_jumped_from_the_mat_in nto_the_hat.$^The_cat_jumped_from_the_mat_i into_the_hat.$^The_cat_jumped_from_the_mat_ _into_the_hat.$^The_cat_jumped_from_the_mat t_into_the_hat.$^The_cat_jumped_from_the_ma at_into_the_hat.$^The_cat_jumped_from_the_m mat_into_the_hat.$^The_cat_jumped_from_the_ _mat_into_the_hat.$^The_cat_jumped_from_the e_mat_into_the_hat.$^The_cat_jumped_from_th he_mat_into_the_hat.$^The_cat_jumped_from_t the_mat_into_the_hat.$^The_cat_jumped_from_ _the_mat_into_the_hat.$^The_cat_jumped_from m_the_mat_into_the_hat.$^The_cat_jumped_fro om_the_mat_into_the_hat.$^The_cat_jumped_fr rom_the_mat_into_the_hat.$^The_cat_jumped_f from_the_mat_into_the_hat.$^The_cat_jumped_ _from_the_mat_into_the_hat.$^The_cat_jumped d_from_the_mat_into_the_hat.$^The_cat_jumpe ed_from_the_mat_into_the_hat.$^The_cat_jump ped_from_the_mat_into_the_hat.$^The_cat_jum mped_from_the_mat_into_the_hat.$^The_cat_ju umped_from_the_mat_into_the_hat.$^The_cat_j jumped_from_the_mat_into_the_hat.$^The_cat_ _jumped_from_the_mat_into_the_hat.$^The_cat t_jumped_from_the_mat_into_the_hat.$^The_ca at_jumped_from_the_mat_into_the_hat.$^The_c cat_jumped_from_the_mat_into_the_hat.$^The_ _cat_jumped_from_the_mat_into_the_hat.$^The e_cat_jumped_from_the_mat_into_the_hat.$^Th he_cat_jumped_from_the_mat_into_the_hat.$^T The_cat_jumped_from_the_mat_into_the_hat.$^
Next we sort the array:
$^The_cat_jumped_from_the_mat_into_the_hat. .$^The_cat_jumped_from_the_mat_into_the_hat The_cat_jumped_from_the_mat_into_the_hat.$^ ^The_cat_jumped_from_the_mat_into_the_hat.$ _cat_jumped_from_the_mat_into_the_hat.$^The _from_the_mat_into_the_hat.$^The_cat_jumped _hat.$^The_cat_jumped_from_the_mat_into_the _into_the_hat.$^The_cat_jumped_from_the_mat _jumped_from_the_mat_into_the_hat.$^The_cat _mat_into_the_hat.$^The_cat_jumped_from_the _the_hat.$^The_cat_jumped_from_the_mat_into _the_mat_into_the_hat.$^The_cat_jumped_from at_into_the_hat.$^The_cat_jumped_from_the_m at_jumped_from_the_mat_into_the_hat.$^The_c at.$^The_cat_jumped_from_the_mat_into_the_h cat_jumped_from_the_mat_into_the_hat.$^The_ d_from_the_mat_into_the_hat.$^The_cat_jumpe e_cat_jumped_from_the_mat_into_the_hat.$^Th e_hat.$^The_cat_jumped_from_the_mat_into_th e_mat_into_the_hat.$^The_cat_jumped_from_th ed_from_the_mat_into_the_hat.$^The_cat_jump from_the_mat_into_the_hat.$^The_cat_jumped_ hat.$^The_cat_jumped_from_the_mat_into_the_ he_cat_jumped_from_the_mat_into_the_hat.$^T he_hat.$^The_cat_jumped_from_the_mat_into_t he_mat_into_the_hat.$^The_cat_jumped_from_t into_the_hat.$^The_cat_jumped_from_the_mat_ jumped_from_the_mat_into_the_hat.$^The_cat_ m_the_mat_into_the_hat.$^The_cat_jumped_fro mat_into_the_hat.$^The_cat_jumped_from_the_ mped_from_the_mat_into_the_hat.$^The_cat_ju nto_the_hat.$^The_cat_jumped_from_the_mat_i o_the_hat.$^The_cat_jumped_from_the_mat_int om_the_mat_into_the_hat.$^The_cat_jumped_fr ped_from_the_mat_into_the_hat.$^The_cat_jum rom_the_mat_into_the_hat.$^The_cat_jumped_f t_into_the_hat.$^The_cat_jumped_from_the_ma t_jumped_from_the_mat_into_the_hat.$^The_ca t.$^The_cat_jumped_from_the_mat_into_the_ha the_hat.$^The_cat_jumped_from_the_mat_into_ the_mat_into_the_hat.$^The_cat_jumped_from_ to_the_hat.$^The_cat_jumped_from_the_mat_in umped_from_the_mat_into_the_hat.$^The_cat_j
Finally we take the last column of each row, giving the string
In order for the algorithm to be useful we need to be able to reverse the transformation, and that proceeds by effectively reconstructing the tables from above. Since the final string we got had all the letters of the original – just permuted – sorting it gives the first column of the table:
$ . T ^ _ _ _ _ _ _ _ _ a a a c d e e e e f h h h h i j m m m n o o p r t t t t t t u
And combining the two strings gives every pair of letters in the original string:
.$ t. ^T $^ e_ d_ e_ t_ t_ e_ o_ m_ ma ca ha _c ed he he he pe _f _h Th th th _i _j om _m um in to ro mp fr at at at _t _t nt ju
The sorting and addition process is repeated until the entire string is reconstructed.
But what does this have to do with data compression? The transform itself doesn’t change the length of the string at all. You can start to see how by looking at the example string I gave. The three instances of the letter
h are right next to each other in the output string. To see why, look back at the first column. The first letter corresponding to the produced
h are all
h always came before
e in the string. The transform takes advantage of commonly used patterns within the text – like
he – and turns it into runs of repeated characters. It effectively makes the string into one that further compression algorithms could take better advantage of, as things like repeated characters are very easily compressed.
The algorithm is probably most famously used by
bzip2, the file format that was formerly widely used on Linux for data compression, before recently being replaced with the more advanced
xz. While it isn’t the current state-of-the-art in compression, it’s a superb example of an interesting little algorithm being used to great effect in software.