Title: FullText Indexing via BurrowsWheeler Transform
1Full-Text Indexingvia Burrows-Wheeler Transform
- Wing-Kai Hon
- Oct 18, 2006
2Outline
- The Text Searching Problem
- What is Full-Text Indexing?
- Burrows-Wheeler Transform (BWT)
- BWT as a Full-Text Index
- Related work
3Text Searching
?
Text acacaaccagtcacactagac
Pattern acac
Where does the pattern occur in the text?
4How fast can we search?
- Let n be the length of text
m be the length of pattern - We can find all positions that the pattern
appears in O( n m ) time - Knuth-Morris-Pratt, Boyer-Moore
- Is O(nm) time good?
- Yes, because it is optimal!
5Text Searching (take 2)
?
?
we know the text in advance and can preprocess it
Text acacaaccagtcacactagac
Pattern acac
Where does the pattern occur in the text?
6Can we do better?
- Yes, there is a data structure for the text, and
by creating that, pattern search only takes O( m
? ) time, where ? number of times the pattern
appears in the text - Such a data structure is called an index
- Is O(m?) time useful?
- Yes, if the text is very long and it is searched
many times for different patterns
7Full-Text Index
- Full-Text Index
- Deals with creating an index for a text
- Also, each position in the text corresponds to an
appearance of at least one pattern (full) - Word-Level Index
- Text is a sequence of words
- The positions within a word does not correspond
to appearance of any pattern - E.g., Text Was it a cat I saw? (Pattern at
does not have an appearance)
8Suffix TreeAn Optimal Full-Text Index
- As mentioned, we can create an index for the text
such that pattern searching can be done in O(m?)
time - This time is optimal
- One such index is the Suffix Tree
- Introduced independently by E. McCreight in 1976
and P. Weiner in 1973
9Suffix and Suffix Tree
- Given a string S, a substring of S that ends at
the last position is called a suffix of S - If S consists of n chars, S has exactly n
suffixes - Theorem If a pattern P appears at position j in
S, P appears at the beginning of the suffix of S
that starts at position j
10- E.g., S acacaac
- Suffix of S acacaac (start at
pos 1) - cacaac
(start at pos 2) - acaac (start at pos 3)
- caac
(start at pos 4) - aac
(start at pos 5) - ac
(start at pos 6) - c
(start at pos 7) -
(start at pos 8)
Suppose P ac is a pattern. Then, P appears at
pos 1, pos 3 and pos 6 in S.
11Suffix and Suffix Tree (2)
- The suffix tree is an edge-labeled compact tree
(no degree-1 nodes) with n leaves such that - each leaf corresponds to a suffix
- Concatenating edge labels along the path from
root to leaf gives the corresponding suffix - Edge-label to each child starts with different
character - Example (next slide)
12c
a
8
a
a
7
c
c
c
a
a
c
a
c
a
5
6
4
2
a
c
c
a
a
c
3
1
The Suffix Tree of acacaac
13Searching with Suffix Tree
- To search P, we match P starting from the root
- If we can match P successfully in the tree, the
leaves under the stop point are all suffixes that
corresponds to an appearance of P in the text - Then, we traverse the tree under the stop point
to report where P appears - So, searching is done in O(m?) time
14Is Suffix Tree good?
- Yes, because optimal search time
- No, because of space requirement
- The space can be much larger than the text
- E.g., Text DNA of Human
- To store the text, we need 0.8 Gbyte
- To store the suffix tree, we need 64 Gbyte!
15Something Wrong??
- Both the suffix tree and the text has n things,
so they both need O(n) space - How come there is a big difference??
- Let us have a better analysis
- Let A be the alphabet (i.e., the set of distinct
characters) of a text T - E.g., in DNA, A a,c,g,t
16Something Wrong?? (2)
- To store T, we need only n log A bits
- But to store the suffix tree, we will need n
log n bits - When n is very large compared to A, there is a
huge difference - Question Is there an index that supports fast
searching, but occupies O( n log A ) bits only??
17Burrows-Wheeler Transform
- By arranging the suffix in sorted order, the
Burrows-Wheeler Transform is an array storing
their preceding chars - Example (next slide)
18Text acacaac
BWT
Suffix in sorted order
19BWT is useful
- BWT is shown to be compressed more easily than
the original text - Also, given the position in the BWT array where
the last character appears, we can get back the
original text - How?
20Text acacaac
Sorted BWT
BWT
Suffix in sorted order
21BWT ? Index
- Ferragina and Manzini (2000) observes that we can
use BWT to support pattern searching by storing
some additional O(n)-bit arrays - Precisely, let B1..n be the BWT. With the
additional arrays, for any x, we can count the
number of any char in B1..x in constant time - Then, we can count the number of times that a
pattern appears in the text in O(m) time (How?)
22Text acacaac, Pattern aca
Sorted BWT
BWT
Suffix in sorted order
23BWT ? Index
- They also show that, by storing another O(n) bit
array, we can report where the pattern appears in
O(? log n) time - So, searching is done in O(m ? log n) time
- What is the space? O( n log A ) bits
24Related Work
- Further compress the index
- Space is now measured in terms of the entropy (or
the randomness) of a text - Support text with large alphabet
- Efficient Construction
- Challenge is in minimizing working space
- More complex queries and operations
- Library problem, Dictionary problem
25Pointers for Further Study
- The Pizza Chili website
- http//pizzachili.di.unipi.it
- The FM-index paper by P. Ferragina and G.
Manzini, FOCS 2000 - The CSA paper by R. Grossi and J.S. Vitter, STOC
2000 - Discuss with me _ (email wkhon_at_)