FullText Indexing via BurrowsWheeler Transform - PowerPoint PPT Presentation

About This Presentation
Title:

FullText Indexing via BurrowsWheeler Transform

Description:

The Text Searching Problem. What is Full-Text Indexing? ... Introduced independently by E. McCreight in 1976 and P. Weiner in 1973. Suffix and Suffix Tree ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 26
Provided by: sec206
Category:

less

Transcript and Presenter's Notes

Title: FullText Indexing via BurrowsWheeler Transform


1
Full-Text Indexingvia Burrows-Wheeler Transform
  • Wing-Kai Hon
  • Oct 18, 2006

2
Outline
  • The Text Searching Problem
  • What is Full-Text Indexing?
  • Burrows-Wheeler Transform (BWT)
  • BWT as a Full-Text Index
  • Related work

3
Text Searching
?
Text acacaaccagtcacactagac
Pattern acac
Where does the pattern occur in the text?
4
How fast can we search?
  • Let n be the length of text
    m be the length of pattern
  • We can find all positions that the pattern
    appears in O( n m ) time
  • Knuth-Morris-Pratt, Boyer-Moore
  • Is O(nm) time good?
  • Yes, because it is optimal!

5
Text Searching (take 2)
?
?
we know the text in advance and can preprocess it
Text acacaaccagtcacactagac
Pattern acac
Where does the pattern occur in the text?
6
Can we do better?
  • Yes, there is a data structure for the text, and
    by creating that, pattern search only takes O( m
    ? ) time, where ? number of times the pattern
    appears in the text
  • Such a data structure is called an index
  • Is O(m?) time useful?
  • Yes, if the text is very long and it is searched
    many times for different patterns

7
Full-Text Index
  • Full-Text Index
  • Deals with creating an index for a text
  • Also, each position in the text corresponds to an
    appearance of at least one pattern (full)
  • Word-Level Index
  • Text is a sequence of words
  • The positions within a word does not correspond
    to appearance of any pattern
  • E.g., Text Was it a cat I saw? (Pattern at
    does not have an appearance)

8
Suffix TreeAn Optimal Full-Text Index
  • As mentioned, we can create an index for the text
    such that pattern searching can be done in O(m?)
    time
  • This time is optimal
  • One such index is the Suffix Tree
  • Introduced independently by E. McCreight in 1976
    and P. Weiner in 1973

9
Suffix and Suffix Tree
  • Given a string S, a substring of S that ends at
    the last position is called a suffix of S
  • If S consists of n chars, S has exactly n
    suffixes
  • Theorem If a pattern P appears at position j in
    S, P appears at the beginning of the suffix of S
    that starts at position j

10
  • E.g., S acacaac
  • Suffix of S acacaac (start at
    pos 1)
  • cacaac
    (start at pos 2)
  • acaac (start at pos 3)
  • caac
    (start at pos 4)
  • aac
    (start at pos 5)
  • ac
    (start at pos 6)
  • c
    (start at pos 7)

  • (start at pos 8)

Suppose P ac is a pattern. Then, P appears at
pos 1, pos 3 and pos 6 in S.
11
Suffix and Suffix Tree (2)
  • The suffix tree is an edge-labeled compact tree
    (no degree-1 nodes) with n leaves such that
  • each leaf corresponds to a suffix
  • Concatenating edge labels along the path from
    root to leaf gives the corresponding suffix
  • Edge-label to each child starts with different
    character
  • Example (next slide)

12
c

a
8
a

a
7
c
c
c

a
a
c
a

c

a
5

6
4
2
a
c
c
a

a
c

3
1
The Suffix Tree of acacaac
13
Searching with Suffix Tree
  • To search P, we match P starting from the root
  • If we can match P successfully in the tree, the
    leaves under the stop point are all suffixes that
    corresponds to an appearance of P in the text
  • Then, we traverse the tree under the stop point
    to report where P appears
  • So, searching is done in O(m?) time

14
Is Suffix Tree good?
  • Yes, because optimal search time
  • No, because of space requirement
  • The space can be much larger than the text
  • E.g., Text DNA of Human
  • To store the text, we need 0.8 Gbyte
  • To store the suffix tree, we need 64 Gbyte!

15
Something Wrong??
  • Both the suffix tree and the text has n things,
    so they both need O(n) space
  • How come there is a big difference??
  • Let us have a better analysis
  • Let A be the alphabet (i.e., the set of distinct
    characters) of a text T
  • E.g., in DNA, A a,c,g,t

16
Something Wrong?? (2)
  • To store T, we need only n log A bits
  • But to store the suffix tree, we will need n
    log n bits
  • When n is very large compared to A, there is a
    huge difference
  • Question Is there an index that supports fast
    searching, but occupies O( n log A ) bits only??

17
Burrows-Wheeler Transform
  • By arranging the suffix in sorted order, the
    Burrows-Wheeler Transform is an array storing
    their preceding chars
  • Example (next slide)

18
Text acacaac
BWT
Suffix in sorted order
19
BWT is useful
  • BWT is shown to be compressed more easily than
    the original text
  • Also, given the position in the BWT array where
    the last character appears, we can get back the
    original text
  • How?

20
Text acacaac
Sorted BWT
BWT
Suffix in sorted order
21
BWT ? Index
  • Ferragina and Manzini (2000) observes that we can
    use BWT to support pattern searching by storing
    some additional O(n)-bit arrays
  • Precisely, let B1..n be the BWT. With the
    additional arrays, for any x, we can count the
    number of any char in B1..x in constant time
  • Then, we can count the number of times that a
    pattern appears in the text in O(m) time (How?)

22
Text acacaac, Pattern aca
Sorted BWT
BWT
Suffix in sorted order
23
BWT ? Index
  • They also show that, by storing another O(n) bit
    array, we can report where the pattern appears in
    O(? log n) time
  • So, searching is done in O(m ? log n) time
  • What is the space? O( n log A ) bits

24
Related Work
  • Further compress the index
  • Space is now measured in terms of the entropy (or
    the randomness) of a text
  • Support text with large alphabet
  • Efficient Construction
  • Challenge is in minimizing working space
  • More complex queries and operations
  • Library problem, Dictionary problem

25
Pointers for Further Study
  • The Pizza Chili website
  • http//pizzachili.di.unipi.it
  • The FM-index paper by P. Ferragina and G.
    Manzini, FOCS 2000
  • The CSA paper by R. Grossi and J.S. Vitter, STOC
    2000
  • Discuss with me _ (email wkhon_at_)
Write a Comment
User Comments (0)
About PowerShow.com