FM-index¶
In the realm of computer science, the ability to efficiently search through vast amounts of text is a cornerstone of numerous applications, from the intricate analysis of genetic sequences to the retrieval of information in large databases. The FM-index stands as a pivotal innovation in this domain, offering a sophisticated yet practical solution to the challenge of text searching. Developed by Paolo Ferragina and Giovanni Manzini in 2000, this data structure has revolutionized the way we approach text indexing and searching by combining compression with search efficiency.
At its core, the FM-index is a compressed full-text substring index. It facilitates the searching of substrings within a larger text corpus with remarkable efficiency. The innovation of the FM-index lies not just in its search capabilities but also in its ability to compress the indexed text, thus conserving valuable storage space.
Last-to-first mapping¶
BWT is useful for compression since these runs are easier to compress.
The "Last to First" (LF) mapping is a crucial part of the BWT, especially when it comes to the inverse transformation, where it helps in reconstructing the original string from the transformed string.
For example, the BWT of abracadabra$
is shown below.
I F L
1 $ abracadabr a
2 a $abracadab r
3 a bra$abraca d
4 a bracadabra $
5 a cadabra$ab r
6 a dabra$abra c
7 b ra$abracad a
8 b racadabra$ a
9 c adabra$abr a
10 d abra$abrac a
11 r a$abracada b
12 r acadabra$a b
- I column: Shows the index of the sorted rotations of the original string.
- F column: The first column in the sorted list of all rotations of the original string. This column is important because it contains the characters of the original string sorted alphabetically.
- L column: The last column in the sorted list of all rotations of the original string. This is the actual output of the BWT.
The LF mapping is a way to navigate from a character in the last column (L) back to its corresponding character in the first column (F). This mapping is possible because the sorting step ensures that the cyclic permutations are in a lexicographically sorted order, which preserves the original order of characters that are identical. Thus, if you know the position of a character in the L column, you can find its original position in the F column.
Given the sorted rotations and the BWT result, we proceed as follows.
1. Identify the L (last) and F (first) columns from your sorted rotations.
Based on our BWT above, we have
- F column (first characters of each row, already sorted):
$aaaaabbcdrr
- L column (last characters of each row, the BWT of the string):
ard$rcaaaabb
2. Count occurrences in L column up to each character to compute the mapping to F.
To compute the LF mapping, we need to count how many times each character appears in L up to a given point. This count tells us the rank of each character in L, which corresponds directly to its original position in F because both columns are essentially different permutations of the same string with the same character frequencies.
First, let's number our F letters starting from zero and increasing by one each time a repeat of a previous letter occurs. For example, the first a will be a0, then the second time an a appears we will label it a1, etc.
F L
$ abracadabr a
a0 $abracadab r
a1 bra$abraca d
a2 bracadabra $
a3 cadabra$ab r
a4 dabra$abra c
b0 ra$abracad a
b1 racadabra$ a
c0 adabra$abr a
d0 abra$abrac a
r0 a$abracada b
r1 acadabra$a b
Now we will do our L letters.
F L
$ abracadabr a0
a0 $abracadab r0
a1 bra$abraca d0
a2 bracadabra $
a3 cadabra$ab r1
a4 dabra$abra c0
b0 ra$abracad a1
b1 racadabra$ a2
c0 adabra$abr a3
d0 abra$abrac a4
r0 a$abracada b0
r1 acadabra$a b1
The number attached to each letter is actually called the rank; the higher the number the higher the rank. Ranks communicate how many times that letter occurs in the original string. "a" in our case has the highest rank of 4, which means there are a total of five a's in our string.
Note
In the context of the BWT, the rank of a character within the L column does not simply indicate how many times the character appears, but rather its sequential position among identical characters within that column. This is a crucial point because the BWT, by design, groups similar characters together due to the sorting of all cyclic permutations of the original string. The rank tells us not just about the quantity of occurrences but about the order of each occurrence within the transformed string.
The power of this rank is that the F letter with the same rank as a L letter is the same letter from the original string. Don't believe me? Look at the r0 in the L column; if we wrap around and continue the cyclical permutation until we hit $ we see that it would be r0a$. Now, find r0 in the F column and continue until we hit $. We also get r0a$! Go ahead and try this for other letters—it will work every time.
How?
Well, this specifically has to do with the fact we add the $
to the end of our string and made sure it is lexicographically lower than any possible letter.
The right-context of a character is essentially the substring that follows it in a particular cyclic permutation of the original string.
Since the BWT sorts these permutations lexicographically, characters are effectively grouped by their right-context in the L column.
When we map a character from L back to F using its rank, we're leveraging the inherent organization of the BWT, where each character's position is intimately tied to its right-context.
This mapping allows us to trace each character's journey through the sorted permutations, from its position in the L column (where it ends a particular permutation) back to its position in the F column (where it starts another, lexicographically earlier, permutation).
Reversing the BWT¶
You can also use the LF mapping to reverse the BWT and get the original string.
F L
$ a0
a0 r0
a1 d0
a2 $
a3 r1
a4 c0
b0 a1
b1 a2
c0 a3
d0 a4
r0 b0
r1 b1
You start from the first row with $ and then move to the end and append the letter.
Original: $a0
Move to the row that starts with a0 and then add the L-column letter.
Original: $a0r0
Move to the row that starts with r0 and then add the L-column letter.
Original: $a0r0b0
Repeat this process until you reach $ in the L column.
Original: $a0r0b0a1d0 a4c0a3r1b1a2
Now, reverse this string
Reversed: a2b1r1a3c0a4d0a1b0r0a0$
and drop the ranks.
Reversed: abracadabra$
Searching¶
This LF mapping is also super helpful in quickly searching for patterns. For example, let's search our BWT for the string "bra" and copy our BWT below.
F L
$ abracadabr a0
a0 $abracadab r0
a1 bra$abraca d0
a2 bracadabra $
a3 cadabra$ab r1
a4 dabra$abra c0
b0 ra$abracad a1
b1 racadabra$ a2
c0 adabra$abr a3
d0 abra$abrac a4
r0 a$abracada b0
r1 acadabra$a b1
Similar to our reversing the BWT, we perform successive LF mapping but we first reverse our search string (i.e., arb).
Note
We actually did this in the previous section by starting from the first row that begins with $.
First, we find all rows that start with "a".
F L
$ abracadabr a0
a0 $abracadab r0
a1 bra$abraca d0
a2 bracadabra $
a3 cadabra$ab r1
a4 dabra$abra c0
b0 ra$abracad a1
b1 racadabra$ a2
c0 adabra$abr a3
d0 abra$abrac a4
r0 a$abracada b0
r1 acadabra$a b1
Now we eliminate all rows that do not end in "r" because remember the letter in F is the letter immediately preceding the L letter in the same row. Thus, any row with "a" in F and "r" in L represents "ra" in the original string.
F L
$ abracadabr a0
a0 $abracadab r0
a1 bra$abraca d0
a2 bracadabra $
a3 cadabra$ab r1
a4 dabra$abra c0
b0 ra$abracad a1
b1 racadabra$ a2
c0 adabra$abr a3
d0 abra$abrac a4
r0 a$abracada b0
r1 acadabra$a b1
Now we find the rows with r0 and r1 in the F column.
F L
$ abracadabr a0
a0 $abracadab r0
a1 bra$abraca d0
a2 bracadabra $
a3 cadabra$ab r1
a4 dabra$abra c0
b0 ra$abracad a1
b1 racadabra$ a2
c0 adabra$abr a3
d0 abra$abrac a4
r0 a$abracada b0
r1 acadabra$a b1
We normally will need to eliminate all rows that do not have "b" in the L column, but we don't need to in this example. We go to the L column of our valid rows and see that our matches start with b0 and b1 in the F column.
F L
$ abracadabr a0
a0 $abracadab r0
a1 bra$abraca d0
a2 bracadabra $
a3 cadabra$ab r1
a4 dabra$abra c0
b0 ra$abracad a1
b1 racadabra$ a2
c0 adabra$abr a3
d0 abra$abrac a4
r0 a$abracada b0
r1 acadabra$a b1
We have found the two rows that match our string. This may seem redundant since we can easily see the correct rows from the beginning, but this quickly becomes intractable when we have thousands and thousands of rows.
Rank array¶
TODO: Introduce rank arrays for L and F and explain navigating them.
Checkpoints¶
TODO: Introduce rank checkpoints and offsets
-
Cheng, H., Wu, M., & Xu, Y. (2018). FMtree: a fast locating algorithm of FM-indexes for genomic data. Bioinformatics, 34(3), 416-424. doi: 10.1093/bioinformatics/btx596 ↩
-
Simpson, J. T., & Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26(12), i367-i373. doi: 10.1093/bioinformatics/btq217 ↩