Burrows–Wheeler transform¶
The Burrows-Wheeler Transform (BWT) is a revolutionary algorithm in the field of data transformation and compression. Developed by Michael Burrows and David Wheeler in 1994, this algorithm reorganizes a string of characters in such a way that similar characters are grouped together, forming runs of the same character. This unique characteristic makes the BWT an invaluable tool in data compression, as it enhances the efficiency of subsequent compression algorithms by making the data more uniform and, thus, easier to compress.
Applications¶
The primary application of the BWT is in data compression. By transforming data into runs of similar characters, the BWT prepares the data for more efficient compression by other algorithms. It's often used as a pre-processing step in compression algorithms to increase their effectiveness. The transformation makes the data more uniform, which in turn makes it easier for compression algorithms to reduce the size of the data without losing any information.
In the realm of bioinformatics, the BWT has found significant applications, particularly in sequence alignment and genome assembly. The ability of the BWT to organize similar characters (or nucleotides, in the case of DNA/RNA) together makes it an excellent tool for identifying regions of similarity within long sequences of genomic data. This capability is crucial for aligning sequences to reference genomes, identifying genetic variations, and assembling short DNA sequences into longer ones.
Methodology¶
Rotation¶
The first step in the BWT is to construct a matrix that includes all possible rotations of the input string. To do this, we take the input string and rotate it one character at a time, prepending each rotation to a matrix.
Code
Here is an example for each stage of processing the input banana$
using the Burrows-Wheeler Transform (BWT).
Note
It's common to prepend a special character (like $
) to the end of the input string to signify the end of the string.
This character should be unique and lexicographically smaller than any other character in the string to ensure it sorts properly.
The matrix consisting of all possible rotations of the input string banana$
.
We have bolded all of the suffixes in the BWT.
banana$
anana$b
nana$ba
ana$ban
na$bana
a$banan
$banana
Each cyclical permutation shifts the characters of the string, effectively moving the last character to the front and sliding the rest one position to the right. This process preserves the "neighborhood" of characters, meaning that characters adjacent before the permutation remain close to each other in the permutations.
Sorting¶
After creating the matrix of all possible rotations, the next step is to sort these rotations lexicographically (i.e., in dictionary order). This step reorganizes the matrix into a more structured form that is essential for the next step of the transform.
The sorted rotations of the input string, lexicographically:
$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba
When these cyclical permutations are sorted lexicographically, patterns emerge. Characters frequently occurring in the original text tend to group in the matrix of sorted permutations. This grouping is especially pronounced for repeated patterns or sequences in the text, making them more apparent and compressible. For example, we see an and na patterns are present and sorted near each other.
Extraction¶
The final step in the Burrows-Wheeler Transform is to extract the last column of the sorted matrix. This column contains the transformed string, which tends to have runs of similar characters, making it more amenable to compression.
Code
The last column of this sorted matrix, which is the transformed string: annb$aa
Inversion¶
Inversion of the BWT is a crucial feature that distinguishes it from other data transformation techniques. It allows for the original document to be regenerated from its BWT representation, which is essentially the last column of a sorted list of all cyclic rotations of the document. This process is reversible due to the unique properties of the BWT and does not require the original document or any additional information beyond the BWT output and the position of the original string in the sorted list.
Start with the Last Column¶
Given the last column from the BWT output, the first task is to reconstruct the first column of the sorted rotations matrix. Since the last column contains all the characters of the original string, sorting these characters gives us the first column. The sorting needs to account for multiple occurrences of the same character by ensuring their relative order is preserved as in the last column.
-
(1) Last column
a
n
n
b
$
a
a
Reconstruct the First Column¶
Sort the characters in the last column alphabetically to obtain the first column of the matrix. This is possible because both columns contain the same set of characters, and sorting the last column's characters gives you the original order of characters as they appeared before the rotations were sorted.
-
(1) Last column—BWT output
a
n
n
b
$
a
a
-
(2) Last column sorted—first column
$
a
a
a
b
n
n
Pair Successive Characters¶
The next step involves pairing each character in the last column with the character in the same row of the first column. These pairs represent successive characters in the document, taking cyclically so that the last and first character form a pair. This cyclical pairing is a key aspect of the BWT's ability to preserve character sequences from the original document.
-
(2)
$
a
a
a
b
n
n
-
(3) Pair BWT output to the front
a$
na
na
ba
$b
an
an
Sort and Reconstruct Columns¶
By sorting these pairs, you start to reconstruct the document one column at a time. Each iteration adds another character to the reconstructed sequence, progressively building up the sorted rotations matrix. This iterative sorting and pairing process continues until the entire document is reconstructed.
-
(3)
a$
na
na
ba
$b
an
an
-
(4) Sort
$b
a$
an
an
ba
na
na
-
(5) prepend
a$b
na$
nan
ban
$ba
ana
ana
-
(6) Sort
$ba
a$b
ana
ana
ban
na$
nan
-
(7) prepend
a$ba
na$b
nana
bana
$ban
ana$
anan
-
(8) Sort
$ban
a$ba
ana$
anan
bana
na$b
nana
-
(9) prepend
a$ban
na$ba
nana$
banan
$bana
ana$b
anana
-
(10) Sort
$bana
a$ban
ana$b
anana
banan
na$ba
nana$
-
(11) prepend
a$bana
na$ban
nana$b
banana
$banan
ana$ba
anana$
-
(12) Sort
$banan
a$bana
ana$ba
anana$
banana
na$ban
nana$b
-
(13) prepend
a$banan
na$bana
nana$ba
banana$
$banana
ana$ban
anana$b
Identify the Original Text¶
The row that ends with the special "end of file" character (e.g., $
in our case) indicates the original document: banana$
.
Code
def invert_burrows_wheeler(last_column):
# Initialize a list to hold tuples of (character, index) for sorting
char_tuples = [(char, i) for i, char in enumerate(last_column)]
# Sort the tuples by character to simulate the first column
first_column_tuples = sorted(char_tuples)
# Reconstruct the document using a table of indices
text_length = len(last_column)
# Initialize the index for the row that starts with the EOF character (assuming it's at the end)
current_index = last_column.index('$')
original_text = [''] * text_length
for i in range(text_length):
char, next_index = first_column_tuples[current_index]
original_text[i] = char
current_index = next_index
# Return the reconstructed text as a string
return ''.join(original_text)