Suffix Arrays - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Suffix Arrays

Description:

The space requirements of suffix trees can become prohibitive. Kurtz seems to be the king of small suffix trees: 20n n bytes ... Eschew redundant comparisons ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 17
Provided by: nathanjoh
Category:
Tags: arrays | eschew | suffix

less

Transcript and Presenter's Notes

Title: Suffix Arrays


1
Suffix Arrays
  • Lecture 13 October 13, 2005
  • Algorithms in Biosequence Analysis
  • Nathan Edwards - Fall, 2005

2
Suffix Tree Memory Footprint
  • The space requirements of suffix trees can become
    prohibitive
  • Kurtz seems to be the king of small suffix trees
  • 20n n bytes worst case,
  • 10n n bytes in practice
  • Others claim various factors
  • 20-30 range, worst case
  • Assumes 32 bit integers ( pointers)

3
Suffix Tree Memory Footprint
  • If we use arrays at internal nodes, need O(mS)
    space
  • If we use lists at internal nodes, need O(nS)
    time to search for P
  • If S is large, neither solution is
    satisfactory.
  • Suffix arrays provide one solution.

4
Suffix Arrays
  • Very space efficient (m integers)
  • Pattern lookup is nearly O(n) in practice
  • O(n log2 m) worst case with 2m additional
    integers
  • Independent of alphabet size!
  • Easiest to describe (and construct) using suffix
    trees
  • Other (slower) methods exist

5
Suffix Arrays
  • DefinitionGiven string T, length m.
  • The suffix array of T, is an array Pos of
    integers from 1 to m specifying the lexicographic
    order of the m suffixes of T.
  • Suffix Pos(1) is lexicographically smallest.
  • Suffix Pos(i) sorts before suffix Pos(j) if i lt
    j.
  • NB Append as before, is first lex. char.

6
Example suffix array
  • T mississippi
  • Pos 11, 8, 5, 2, 1, 10, 9, 7, 4, 6, 3

11 i 8 ippi 5 issippi 2 ississippi 1
mississippi 10 pi 9 ppi 7 sippi 4
sissippi 6 ssippi 3 ssissippi
7
Suffix Array Construction
  • Build suffix tree for T
  • Perform lexical depth-first search of suffix
    tree
  • output the suffix label of each leaf encountered
  • Therefore suffix array can be constructed in O(m)
    time.

8
Example
  • Suffix tree of T xabxa
  • Suffix Array Pos 6, 5, 2, 3, 4, 1

x
b

a
a
x
6
b

a
b
x
5

x

a
4
a


3
2
1
9
Suffix array pattern search
  • If P is in T, then all the locations of P are
    consecutive suffixes in Pos.
  • Do binary search in Pos to find P!
  • Compare P with suffix Pos(m/2)
  • If lexicographically less, P is in first half of
    T
  • If lexicographically more, P is in second half of
    T
  • Iterate!
  • How long to compare P with suffix of T?
  • O(n) worst case!
  • Binary search on Pos takes O(n log m) time

10
Suffix array binary search
  • Worst case will be rare
  • occur if many long prefixes of P appear in T
  • In random or large alphabet strings
  • expect to do less than log m comparisons
  • O(n log m) running time
  • What happens at long prefixes of P?

11
Avoid redundant comparisons
  • Let L and R be the left and right indices of
    current search interval
  • Initialization L 1, R m
  • In each iteration, check suffix of Pos(M) where M
    ( L R ) / 2
  • Track longest prefix of suffix Pos(L) and suffix
    Pos(R) that match a prefix of P

12
Avoid redundant comparisons
  • l length of longest prefix of P and suffix
    Pos(L)
  • r length of longest prefix of P and suffix
    Pos(L)
  • mlr minl,r
  • First mlr characters of all suffixes Pos(i), for
    L i R must match P.
  • So start matching suffix Pos(M) at character mlr
    1
  • Straightforward to keep l, r, and mlr updated
  • Still O( n log m ) worst case running time
  • But O( n log m ) in practice

13
Eschew redundant comparisons
  • Must limit redundant comparisons to one per
    binary search iteration
  • Implies O( n log m ) time bound
  • Need to get rid of comparisons between mlr1 and
    maxl,r.
  • Lcp(i,j) is the length of the longest common
    prefix of suffixes Pos(i) and Pos(j).
  • Search will use Lcp(L,M) Lcp(M,R)

14
Using the Lcp
  • Simple case If l r, then compare P with
    suffix Pos(M) from mlr 1
  • General cases (l gt r) If Lcp(L,M) gt l, P is
    between M and R If Lcp(L,M) lt l, P is between L
    and M If Lcp(L,M) l, compare P with
    suffix pos(M) from l 1
  • Search w/ Lcp takes O(n log m) time

15
Computing the Lcp
  • Dont need O(m2) Lcp values
  • need Lcp(L,M) and Lcp(M,R) for binary search
  • 2m 2 Lcp values need be computed
  • Suffix tree contains Lcp(i,j) for all i,j
  • Cant keep suffix tree around!
  • Lcp(i,i1) are determined during lex. depth-first
    search

16
Computing the Lcp
  • For all i, j, with j gt i1, Lcp(i,j) is the
    smallest Lcp(k,k1) for k i, , j-1.
  • So all necessary Lcp values can be determined
    during lex. depth-first search
  • Lcp values encode just enough structure of the
    suffix tree.
Write a Comment
User Comments (0)
About PowerShow.com