FM-index

“Preliminaries”

We assume you are familiar with the following material:

Burrows–Wheeler transform

In the realm of computer science, the ability to efficiently search through vast amounts of text is a cornerstone of numerous applications, from the intricate analysis of genetic sequences to the retrieval of information in large databases. The FM-index stands as a pivotal innovation in this domain, offering a sophisticated yet practical solution to the challenge of text searching. Developed by Paolo Ferragina and Giovanni Manzini in 2000, this data structure has revolutionized the way we approach text indexing and searching by combining compression with search efficiency.

At its core, the FM-index is a compressed full-text substring index. It facilitates the searching of substrings within a larger text corpus with remarkable efficiency. The innovation of the FM-index lies not just in its search capabilities but also in its ability to compress the indexed text, thus conserving valuable storage space.

Last-to-first mapping

BWT is useful for compression since these runs are easier to compress. The “Last to First” (LF) mapping is a crucial part of the BWT, especially when it comes to the inverse transformation, where it helps in reconstructing the original string from the transformed string. For example, the BWT of abracadabra$ is shown below.

I     F                      L
1     $ abracadabr a
2     a $abracadab r
3     a bra$abraca d
4     a bracadabra $
5     a cadabra$ab r
6     a dabra$abra c
7     b ra$abracad a
8     b racadabra$ a
9     c adabra$abr a
10   d abra$abrac a
11   r a$abracada b
12   r acadabra$a b

I column: Shows the index of the sorted rotations of the original string.
F column: The first column in the sorted list of all rotations of the original string. This column is important because it contains the characters of the original string sorted alphabetically.
L column: The last column in the sorted list of all rotations of the original string. This is the actual output of the BWT.

The LF mapping is a way to navigate from a character in the last column (L) back to its corresponding character in the first column (F). This mapping is possible because the sorting step ensures that the cyclic permutations are in a lexicographically sorted order, which preserves the original order of characters that are identical. Thus, if you know the position of a character in the L column, you can find its original position in the F column.

Given the sorted rotations and the BWT result, we proceed as follows.

1. Identify the L (last) and F (first) columns from your sorted rotations.

Based on our BWT above, we have

F column (first characters of each row, already sorted): $aaaaabbcdrr
L column (last characters of each row, the BWT of the string): ard$rcaaaabb

2. Count occurrences in L column up to each character to compute the mapping to F.

To compute the LF mapping, we need to count how many times each character appears in L up to a given point. This count tells us the rank of each character in L, which corresponds directly to its original position in F because both columns are essentially different permutations of the same string with the same character frequencies.

First, let’s number our F letters starting from zero and increasing by one each time a repeat of a previous letter occurs. For example, the first a will be a₀, then the second time an a appears we will label it a₁, etc.

F                        L
$  abracadabr a
a₀ $abracadab r
a₁ bra$abraca d
a₂ bracadabra  $
a₃ cadabra$ab r
a₄ dabra$abra c
b₀ ra$abracad a
b₁ racadabra$ a
c₀ adabra$abr a
d₀ abra$abrac a
r₀ a$abracada b
r₁ acadabra$a b

Now we will do our L letters.

F                        L
$  abracadabr a₀
a₀ $abracadab r₀
a₁ bra$abraca d₀
a₂ bracadabra  $
a₃ cadabra$ab r₁
a₄ dabra$abra c₀
b₀ ra$abracad a₁
b₁ racadabra$ a₂
c₀ adabra$abr a₃
d₀ abra$abrac a₄
r₀ a$abracada b₀
r₁ acadabra$a b₁

The number attached to each letter is actually called the rank; the higher the number the higher the rank. Ranks communicate how many times that letter occurs in the original string. “a” in our case has the highest rank of 4, which means there are a total of five a’s in our string.

Note

In the context of the BWT, the rank of a character within the L column does not simply indicate how many times the character appears, but rather its sequential position among identical characters within that column. This is a crucial point because the BWT, by design, groups similar characters together due to the sorting of all cyclic permutations of the original string. The rank tells us not just about the quantity of occurrences but about the order of each occurrence within the transformed string.

The power of this rank is that the F letter with the same rank as a L letter is the same letter from the original string. Don’t believe me? Look at the r₀ in the L column; if we wrap around and continue the cyclical permutation until we hit $ we see that it would be r₀a$. Now, find r₀ in the F column and continue until we hit $. We also get r₀a$! Go ahead and try this for other letters—it will work every time.

How? Well, this specifically has to do with the fact we add the $ to the end of our string and made sure it is lexicographically lower than any possible letter. The right-context of a character is essentially the substring that follows it in a particular cyclic permutation of the original string. Since the BWT sorts these permutations lexicographically, characters are effectively grouped by their right-context in the L column. When we map a character from L back to F using its rank, we’re leveraging the inherent organization of the BWT, where each character’s position is intimately tied to its right-context. This mapping allows us to trace each character’s journey through the sorted permutations, from its position in the L column (where it ends a particular permutation) back to its position in the F column (where it starts another, lexicographically earlier, permutation).

Reversing the BWT

You can also use the LF mapping to reverse the BWT and get the original string.

F      L
$     a₀
a₀   r₀
a₁   d₀
a₂   $
a₃   r₁
a₄   c₀
b₀   a₁
b₁   a₂
c₀   a₃
d₀   a₄
r₀    b₀
r₁    b₁

You start from the first row with $ and then move to the end and append the letter.

Original: $a₀

Move to the row that starts with a₀ and then add the L-column letter.

Original: $a₀r₀

Move to the row that starts with r₀ and then add the L-column letter.

Original: $a₀r₀b₀

Repeat this process until you reach $ in the L column.

Original: $a₀r₀b₀a₁d₀ a₄c₀a₃r₁b₁a₂

Now, reverse this string

Reversed: a₂b₁r₁a₃c₀a₄d₀a₁b₀r₀a₀$

and drop the ranks.

Reversed: abracadabra$

Searching

This LF mapping is also super helpful in quickly searching for patterns. For example, let’s search our BWT for the string “bra” and copy our BWT below.

F                        L
$  abracadabr a₀
a₀ $abracadab r₀
a₁ bra$abraca d₀
a₂ bracadabra  $
a₃ cadabra$ab r₁
a₄ dabra$abra c₀
b₀ ra$abracad a₁
b₁ racadabra$ a₂
c₀ adabra$abr a₃
d₀ abra$abrac a₄
r₀ a$abracada b₀
r₁ acadabra$a b₁

Similar to our reversing the BWT, we perform successive LF mapping but we first reverse our search string (i.e., arb).

Note

We actually did this in the previous section by starting from the first row that begins with $. First, we find all rows that start with “a”.

F                        L
$  abracadabr a₀
a₀ $abracadab r₀
a₁ bra$abraca d₀
a₂ bracadabra  $
a₃ cadabra$ab r₁
a₄ dabra$abra c₀
b₀ ra$abracad a₁
b₁ racadabra$ a₂
c₀ adabra$abr a₃
d₀ abra$abrac a₄
r₀ a$abracada b₀
r₁ acadabra$a b₁

Now we eliminate all rows that do not end in “r” because remember the letter in F is the letter immediately preceding the L letter in the same row. Thus, any row with “a” in F and “r” in L represents “ra” in the original string.

F                        L
$  abracadabr a₀
a₀ $abracadab r₀
a₁ bra$abraca d₀
a₂ bracadabra  $
a₃ cadabra$ab r₁
a₄ dabra$abra c₀
b₀ ra$abracad a₁
b₁ racadabra$ a₂
c₀ adabra$abr a₃
d₀ abra$abrac a₄
r₀ a$abracada b₀
r₁ acadabra$a b₁

Now we find the rows with r₀ and r₁ in the F column.

F                        L
$  abracadabr a₀
a₀ $abracadab r₀
a₁ bra$abraca d₀
a₂ bracadabra  $
a₃ cadabra$ab r₁
a₄ dabra$abra c₀
b₀ ra$abracad a₁
b₁ racadabra$ a₂
c₀ adabra$abr a₃
d₀ abra$abrac a₄
r₀ a$abracada b₀
r₁ acadabra$a b₁

We normally will need to eliminate all rows that do not have “b” in the L column, but we don’t need to in this example. We go to the L column of our valid rows and see that our matches start with b₀ and b₁ in the F column.

F                        L
$  abracadabr a₀
a₀ $abracadab r₀
a₁ bra$abraca d₀
a₂ bracadabra  $
a₃ cadabra$ab r₁
a₄ dabra$abra c₀
b₀ ra$abracad a₁
b₁ racadabra$ a₂
c₀ adabra$abr a₃
d₀ abra$abrac a₄
r₀ a$abracada b₀
r₁ acadabra$a b₁

We have found the two rows that match our string. This may seem redundant since we can easily see the correct rows from the beginning, but this quickly becomes intractable when we have thousands and thousands of rows.

Rank array

TODO: Introduce rank arrays for L and F and explain navigating them.

Checkpoints

TODO: Introduce rank checkpoints and offsets

Last updated on February 12, 2025