Suffix Trees - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Suffix Trees

Description:

Suffix Trees. Lecture 11: October 6, 2005. Algorithms in Biosequence Analysis ... Can amortize cost of text preprocessing over many patterns. 3. O(m) build algorithms ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 34
Provided by: nathanjoh
Category:
Tags: amortize | suffix | trees

less

Transcript and Presenter's Notes

Title: Suffix Trees


1
Suffix Trees
  • Lecture 11 October 6, 2005
  • Algorithms in Biosequence Analysis
  • Nathan Edwards - Fall, 2005

2
Suffix Trees
  • Powerful preprocessing for the text T
  • Indexes all substrings of T
  • Build in O(m) time
  • Answer substring queries in O(n) time
  • independent of m!
  • no preprocessing of pattern
  • Can amortize cost of text preprocessing over many
    patterns

3
O(m) build algorithms
  • Weiner 1973
  • McCreight 1976
  • Better space efficiency
  • Ukkonen 1995
  • Simpler exposition
  • Many bioinformatics algorithms start with First
    build a suffix tree on S

4
Definition
  • Suffix tree T of a string S, S m
  • Rooted, directed tree
  • Exactly m leaves numbered 1 m
  • Each edge is labeled with a substring of S
  • The edge-labels of root-to-leaf-i path spell out
    Sim
  • Internal nodes have at least two children
  • The edge-labels of a node have distinct first
    characters

5
Example
  • Suffix tree of S xabxac

x
b
c
a
a
x
6
b
c
a
b
x
5
c
x
c
a
4
a
c
c
3
2
1
6
Example
  • Non-Suffix tree of S xabxa

x
b
a
a
x
b
a
b
x
x
a
a
3
1
2
7
Example
  • Suffix tree of S xabxa

x
b

a
a
x
6
b

a
b
x
5

x

a
4
a


3
2
1
8
Definitions
  • Path-label of node v is the concatenation of the
    edge-labels of root-to-v path.
  • String-depth of node v is the length of its
    node-label
  • A path from the root that ends inside an edge out
    of node u has label formed from the concatenation
    of us node-label and the partial edge-label

9
Naïve construction
  • Looks just like a keyword tree for P S1m,
    S2m, , S(m)
  • Build in O(m2) time
  • Merge non-branching nodes into single edge

10
Exact string matching
  • Find all occurrences of P in T in O(nm) time
  • Build suffix tree T in O(m) time
  • Match characters of P along path from the root of
    T until either P is exhausted or there is a
    mismatch
  • If mismatch, no occurrence of P in T
  • If match, every leaf below end of P is a start
    position of P in T

11
Exact string matching
  • O(n) time to find end of P
  • P occurs in T at j iff P is a prefix of Tjm
  • Use depth-first search from end of P to collect
    start positions
  • edges 2 leaves
  • O(k) time to enumerate occurrences
  • Total time O(m n k)

12
Naïve construction 2
  • Insert suffixes one at a time, ala keyword tree
    construction O(m2)
  • Initial tree
  • single edge leaf 1 with label S.
  • Insert each suffix S2m to in turn
  • Follow path from root until mismatch
  • If at node, add leaf edge for remainder
  • If inside edge, split edge with new node and add
    leaf edge for remainder.

13
Implicit Suffix Tree
  • Non-suffix tree built on S
  • Some suffixes might end inside tree
  • Implicit suffix tree of S is a suffix tree iff
    last character of S is nowhere else

14
Example
  • Suffix tree of S xabxa

x
b

a
a
x
6
b

a
b
x
5

x

a
4
a


3
2
1
15
Example
  • Implicit suffix tree of S xabxa

x
b
a
a
x
b
a
b
x
x
a
a
3
1
2
16
Ukkonens O(m) construction
  • Construct implicit suffix tree Ti of prefix
    S1i, from T1 to Tm.
  • Phase i1 constructs Ti1 from Ti
  • Phase i1 consists of i1 extensions, which add
    S(i1) to suffixes S1i, S2i, in Ti to
    make Ti1

17
Ukkonens O(m) construction high-level
  • Construct T1
  • For i 1 to m 1 Phase i1 make Ti1 from
    Ti For j 1 to i 1 Extension j Find
    path from root labeled Sji Extend path with
    S(i1)

18
Suffix extension rules
  • Extension of Sji ß with S(i1)
  • ß ends at a leaf ? add S(i1) to leaf edge label
  • No path at end of ß starts with S(i1)? add new
    leaf edge at end of ß with label
    S(i1). create new node at end of ß if
    necessary
  • A path at end of ß starts with S(i1)? do nothing

19
Example
  • Implicit suffix tree of S axabx

b
a
x
x
x
a
a
4
b
b
x
b
x
1
x
3
2
20
Example
  • Implicit suffix tree of S axabxbRule 1

b
a
x
x
x
a
a
b
b
b
4
x
b
x
b
x
b
1
b
3
2
21
Example
  • Implicit suffix tree of S axabxbRule 2

b
a
x
x
x
b
a
a
b
b
b
5
4
x
b
x
b
x
b
1
b
3
2
22
Example
  • Implicit suffix tree of S axabxbRule 3

b
a
x
x
x
b
a
a
b
b
b
5
4
x
b
x
b
x
b
1
b
3
2
23
Extention j
  • Constant time, once suffix ßSji of S1i is
    found.
  • Extension j of phase i1 takes O(i 1 j)
    O(ß) time to walk from the root
  • O(m3) construction algorithm!
  • Suffix links find ß Sji of extension j of
    phase i1 in constant time.
  • O(m2) algorithm.

24
Suffix links
  • Given
  • String xa, character x, string a
  • Internal node v with node-label xa
  • If there is a node s(v) with node-label a, then v
    ? s(v) is called a suffix link.
  • If a is empty, s(v) is the root.

25
Suffix links
  • Lemma (In extension j of phase i1) If a new
    internal node v with path-label xa is created,
    then either
  • the path labeled a ends at an internal node, or
  • or a new internal node will be created at the end
    of the path labeled a, or
  • a is empty and s(v) is the root

26
Suffix links
  • Corollary (In extension j of phase i1) If a
    new internal node v with path-label xa is
    created, then it has a suffix link by the end of
    extension j1.
  • Corollary If implicit suffix tree Ti has an
    internal node v with path label xa,Ti must have a
    node s(v) with path-label a.

27
Using suffix links
  • S1i always ends at a leaf
  • longest string of Ti
  • Finding path ß Sji given ß0 Sj-1iv
    node at or above end of ß0, ? (partial) label
  • If v is root, find ß path from root
  • Otherwise, jump to s(v) find ? below s(v)
  • If Rule 2 created a new internal node w at end of
    ß0, then after ß is found and extension j
    completed, s(w) must exist. Create suffix link w
    ? s(w).

28
Using suffix links
node s(v)
node v
a
a
b
b
?
c
c
?
suffix link
d
end of ß0
d
j-1
end of ß
29
Skip/Count Edge Traversal
  • No need to compare all characters of ? with
    edge-labels below s(v), we know it is there!
  • Avoid extension in O( chars below s(v)) time
  • Just check first character of each edge label
  • skip the other comparisons
  • need length of each edge label
  • extension in O( nodes below s(v)) time

30
Skip/Count Edge Traversal
31
Skip/Count
  • node-depth of v is number of nodes on root-to-v
    path.
  • LemmaWhen any suffix link v ? s(v) is used,
    node-depth(v) node-depth(s(v))1

32
Node-depth lemma
33
O(m2) Running Time
  • LemmaAll phases take O(m) time.
  • CorollaryUkkonens algorithm with suffix links
    runs in O(m2) time.
Write a Comment
User Comments (0)
About PowerShow.com