An interesting little algorithm occasionally used in computer science is called the Burrows–Wheeler transform, an algorithm that reversibly permutes text strings. At first glance it might not seem to have much use, but it enabled a clever way of doing lossless data compression on text.

We can see how the algorithm works by performing it on the example string The cat jumped from the mat into the hat. We start by placing a ^ and $ at the beginning and end of the string respectively to help it be reconstructed later, replacing the spaces with _ just to help with markup and display (not part of the algorithm), and then expanding it into a 2D array with each row being circularly shifted over from one row to the next:

^The_cat_jumped_from_the_mat_into_the_hat.$
$^The_cat_jumped_from_the_mat_into_the_hat.
.$^The_cat_jumped_from_the_mat_into_the_hat
t.$^The_cat_jumped_from_the_mat_into_the_ha
at.$^The_cat_jumped_from_the_mat_into_the_h
hat.$^The_cat_jumped_from_the_mat_into_the_
_hat.$^The_cat_jumped_from_the_mat_into_the
e_hat.$^The_cat_jumped_from_the_mat_into_th
he_hat.$^The_cat_jumped_from_the_mat_into_t
the_hat.$^The_cat_jumped_from_the_mat_into_
_the_hat.$^The_cat_jumped_from_the_mat_into
o_the_hat.$^The_cat_jumped_from_the_mat_int
to_the_hat.$^The_cat_jumped_from_the_mat_in
nto_the_hat.$^The_cat_jumped_from_the_mat_i
into_the_hat.$^The_cat_jumped_from_the_mat_
_into_the_hat.$^The_cat_jumped_from_the_mat
t_into_the_hat.$^The_cat_jumped_from_the_ma
at_into_the_hat.$^The_cat_jumped_from_the_m
mat_into_the_hat.$^The_cat_jumped_from_the_
_mat_into_the_hat.$^The_cat_jumped_from_the
e_mat_into_the_hat.$^The_cat_jumped_from_th
he_mat_into_the_hat.$^The_cat_jumped_from_t
the_mat_into_the_hat.$^The_cat_jumped_from_
_the_mat_into_the_hat.$^The_cat_jumped_from
m_the_mat_into_the_hat.$^The_cat_jumped_fro
om_the_mat_into_the_hat.$^The_cat_jumped_fr
rom_the_mat_into_the_hat.$^The_cat_jumped_f
from_the_mat_into_the_hat.$^The_cat_jumped_
_from_the_mat_into_the_hat.$^The_cat_jumped
d_from_the_mat_into_the_hat.$^The_cat_jumpe
ed_from_the_mat_into_the_hat.$^The_cat_jump
ped_from_the_mat_into_the_hat.$^The_cat_jum
mped_from_the_mat_into_the_hat.$^The_cat_ju
umped_from_the_mat_into_the_hat.$^The_cat_j
jumped_from_the_mat_into_the_hat.$^The_cat_
_jumped_from_the_mat_into_the_hat.$^The_cat
t_jumped_from_the_mat_into_the_hat.$^The_ca
at_jumped_from_the_mat_into_the_hat.$^The_c
cat_jumped_from_the_mat_into_the_hat.$^The_
_cat_jumped_from_the_mat_into_the_hat.$^The
e_cat_jumped_from_the_mat_into_the_hat.$^Th
he_cat_jumped_from_the_mat_into_the_hat.$^T
The_cat_jumped_from_the_mat_into_the_hat.$^

Next we sort the array:

$^The_cat_jumped_from_the_mat_into_the_hat.
.$^The_cat_jumped_from_the_mat_into_the_hat
The_cat_jumped_from_the_mat_into_the_hat.$^
^The_cat_jumped_from_the_mat_into_the_hat.$
_cat_jumped_from_the_mat_into_the_hat.$^The
_from_the_mat_into_the_hat.$^The_cat_jumped
_hat.$^The_cat_jumped_from_the_mat_into_the
_into_the_hat.$^The_cat_jumped_from_the_mat
_jumped_from_the_mat_into_the_hat.$^The_cat
_mat_into_the_hat.$^The_cat_jumped_from_the
_the_hat.$^The_cat_jumped_from_the_mat_into
_the_mat_into_the_hat.$^The_cat_jumped_from
at_into_the_hat.$^The_cat_jumped_from_the_m
at_jumped_from_the_mat_into_the_hat.$^The_c
at.$^The_cat_jumped_from_the_mat_into_the_h
cat_jumped_from_the_mat_into_the_hat.$^The_
d_from_the_mat_into_the_hat.$^The_cat_jumpe
e_cat_jumped_from_the_mat_into_the_hat.$^Th
e_hat.$^The_cat_jumped_from_the_mat_into_th
e_mat_into_the_hat.$^The_cat_jumped_from_th
ed_from_the_mat_into_the_hat.$^The_cat_jump
from_the_mat_into_the_hat.$^The_cat_jumped_
hat.$^The_cat_jumped_from_the_mat_into_the_
he_cat_jumped_from_the_mat_into_the_hat.$^T
he_hat.$^The_cat_jumped_from_the_mat_into_t
he_mat_into_the_hat.$^The_cat_jumped_from_t
into_the_hat.$^The_cat_jumped_from_the_mat_
jumped_from_the_mat_into_the_hat.$^The_cat_
m_the_mat_into_the_hat.$^The_cat_jumped_fro
mat_into_the_hat.$^The_cat_jumped_from_the_
mped_from_the_mat_into_the_hat.$^The_cat_ju
nto_the_hat.$^The_cat_jumped_from_the_mat_i
o_the_hat.$^The_cat_jumped_from_the_mat_int
om_the_mat_into_the_hat.$^The_cat_jumped_fr
ped_from_the_mat_into_the_hat.$^The_cat_jum
rom_the_mat_into_the_hat.$^The_cat_jumped_f
t_into_the_hat.$^The_cat_jumped_from_the_ma
t_jumped_from_the_mat_into_the_hat.$^The_ca
t.$^The_cat_jumped_from_the_mat_into_the_ha
the_hat.$^The_cat_jumped_from_the_mat_into_
the_mat_into_the_hat.$^The_cat_jumped_from_
to_the_hat.$^The_cat_jumped_from_the_mat_in
umped_from_the_mat_into_the_hat.$^The_cat_j

Finally we take the last column of each row, giving the string .t^$edetteommch_ehhhp__Ttt__o_uitrmfaaa__nj

In order for the algorithm to be useful we need to be able to reverse the transformation, and that proceeds by effectively reconstructing the tables from above. Since the final string we got had all the letters of the original – just permuted – sorting it gives the first column of the table:

$
.
T
^
_
_
_
_
_
_
_
_
a
a
a
c
d
e
e
e
e
f
h
h
h
h
i
j
m
m
m
n
o
o
p
r
t
t
t
t
t
t
u

And combining the two strings gives every pair of letters in the original string:

.$
t.
^T
$^
e_
d_
e_
t_
t_
e_
o_
m_
ma
ca
ha
_c
ed
he
he
he
pe
_f
_h
Th
th
th
_i
_j
om
_m
um
in
to
ro
mp
fr
at
at
at
_t
_t
nt
ju

The sorting and addition process is repeated until the entire string is reconstructed.

But what does this have to do with data compression? The transform itself doesn’t change the length of the string at all. You can start to see how by looking at the example string I gave. The three instances of the letter h are right next to each other in the output string. To see why, look back at the first column. The first letter corresponding to the produced h are all e, because h always came before e in the string. The transform takes advantage of commonly used patterns within the text – like he – and turns it into runs of repeated characters. It effectively makes the string into one that further compression algorithms could take better advantage of, as things like repeated characters are very easily compressed.

The algorithm is probably most famously used by bzip2, the file format that was formerly widely used on Linux for data compression, before recently being replaced with the more advanced xz. While it isn’t the current state-of-the-art in compression, it’s a superb example of an interesting little algorithm being used to great effect in software.