Optimal exact string matching based on suffix arrays - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Optimal exact string matching based on suffix arrays

Description:

While suffix trees play a prominent role in algorithmics, they are not as ... the range 0 to n, specifying the lexicographic ordering of the n 1 suffixes of ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 23
Provided by: dblabCs
Category:

less

Transcript and Presenter's Notes

Title: Optimal exact string matching based on suffix arrays


1
Optimal exact string matching based on suffix
arrays
  • Mohamed Ibrahim Abouelhoda, Enno Ohlebusch, and
    Stefan Kurtz

2
1. Introduction
  • While suffix trees play a prominent role in
    algorithmics, they are not as widespread in
    actual implementations of software tools as one
    should expect.
  • On the other hand, the suffix array is a more
    space efficient data structure than the suffix
    tree.

3
  • However, it seems that the suffix array has two
    disadvantages over the suffix tree(1) The
    direct construction of the suffix array takes O(
    n log n) time.(2) It is not clear that (and
    how) every algorithm using a suffix tree can be
    replaced with an algorithm based on a suffix
    array solving the same problem in the same time
    complexity. For example, using only the basic
    suffix array, it takes O( m log n) time in the
    worst case to answer decision queries.

4
  • (1) The suffix array of S can be constructed in
    O( n) time in the worst case by first
    constructing the suffix tree of S.
  • (2) We will show how decision queries can be
    answered in optimal O( m) time and how to find
    all z occurrences of a pattern P in optimal O(
    mz) time by using the basic suffix array
    enhanced with two additional tables.

5
2. Basic notions
  • The suffix array suftab is an array of integers
    in the range 0 to n, specifying the lexicographic
    ordering of the n 1 suffixes of the string S.
  • Si Si..n-1 denotes the ith nonempty suffix
    of the string S, 0 i n.
  • The lcp-table lcptab is the length of the longest
    common prefix of Ssuftabi, for 0 i n.

6
3. The lcp-intervals of a suffix array
  • Definition 1. Interval i..j, 0 i lt j n, is
    an lcp-interval of lcp-value l if1. lcptabi lt
    l,2. lcptabk l for all k with i 1 k
    j,3. lcptabk l for at least one k with i 1
    k j,4. lcptabj 1 lt l.
  • Every index k, i 1 k j, with lcptabk l
    is called l-index.

7
(No Transcript)
8
(No Transcript)
9
4. The enhanced suffix array
  • the lcp-table, and an additional table the
    child-table cldtab
  • The child-table is a table of size n 1 indexed
    from 0 to n and each entry contains three values
    up, down, and nextlIndex.

10
(No Transcript)
11
5. Construction of the child-table
12
(No Transcript)
13
  • cldtabi.nextlIndex contains the next l-index if
    lcptabcldtabi.nextlIndex lcptabi
  • it stores the cldtabi.down value if
    lcptabcldtabi.nextlIndex gt lcptabi
  • cldtabi.nextlIndex contains the value
    cldtabi1.up if lcptabi gt lcptabi1

14
6. Determining child intervals in constant time
15
(No Transcript)
16
7. Answering queries in optimal time
  • It takes O(m log n) time complexity to answer
    decision queries by using an additional table
    (similar to the lcp-table).
  • The logarithmic terms are due to binary searches,
    which locate P in the suffix array of S.
  • Enumerative queries can be answered in optimal
    O(m z) time.

17
(No Transcript)
18
8. Implementation details
  • We store most of the values of table lcptab in a
    table lcptab1 using n bytes. That is, for any i
    (- 1, n, lcptab1i min255, lcptabi.
  • To access these efficiently, we store them in an
    extra table llvtab. This contains all pairs (i,
    lcptabi) such that lcptabi 255, ordered by
    the first component.

19
9. Experimental results
20
(No Transcript)
21
(No Transcript)
22
Conclusion
  • The suffix array is space efficient but slower
    than suffix tree to answer decision queries.
  • Enhanced suffix array can answer decision queries
    in optimal O( m) time and find all z occurrences
    of a pattern P in optimal O( mz) time by adding
    lcptab and cldtab.
Write a Comment
User Comments (0)
About PowerShow.com