Suffix Trees Come of Age in Bioinformatics - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Suffix Trees Come of Age in Bioinformatics

Description:

Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis AABAAB AABAAB Basic Facts about Suffix Trees Suffix ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 42
Provided by: DanGus8
Category:

less

Transcript and Presenter's Notes

Title: Suffix Trees Come of Age in Bioinformatics


1
Suffix Trees Come of Age in Bioinformatics
  • Algorithms, Applications and Implementations
  • Dan Gusfield, U.C. Davis

2
AABAAB
1 2 3 4 5 6
S
Suffix Tree
B
A
A
A
B

B
A
3


B
A
5
6
A

4
B
A

A
2
B
Every suffix of S is encoded by a root to leaf
walk. Every substring encoded by a walk from the
root.

1
3
AABAAB
1 2 3 4 5 6
SUFFIX ARRAY
B
A
A
A
B

B
A
3


B
A
5
6
A

4
B
A
4 1 5 2 3 6

A
2
The suffixes listed in lexicographic order. Found
by lexicographic DFS
B

1
4
Basic Facts about Suffix Trees
  • Suffix tree for a string of length n can be built
    in O(n) time.
  • Basic implementations need about 25 bytes per
    character.
  • Numerous applications in string algorithms
    achieving significant (sometimes amazing)
    speedups over naïve algorithms O(m) time
    substring search for a string of length m in a
    string of length n, regardless of n.

5
Suffix Tree/Array Construction
  • Weiner 1973
  • McCreight 1974
  • Ukkonen 1996
  • Farach-Colton 1999
  • Manber-Myers 1991 (Suffix-Arrays)
  • O(n log n time) with small space. O(n) time
    possible as on prior slide.

6
aabbaa
B
A
B
A
A


B
B
6
A
3
A
B

A
A
5
B

A

B
4
A
2
A

1
7
Bioinformatics Applications pre 1997
  • A small number of applications
  • Limited by space requirements of the tree, small
    computer memories, complication of the
    algorithms, poor locality of reference

8
Post 1997
  • Much has changed, and Suffix trees and their
    relatives (particularly suffix arrays) are more
    widely applied in bioinformatics
  • 20- 40 new publications with substantial
    connection to bioinformatics, post 1997

9
New Results since 1997
  • Fundamental Algorithms Farach-Colton
    construction simpler algorithm for
    least-common-ancestor.
  • Implementation improvements (space) for suffix
    trees and arrays
  • New variants of suffix trees affix trees,
    virtual suffix trees
  • New algorithms for tandem repeats

10
New Results since 1997
  • Approximate repeats and large-scale repeat
    structure
  • Fast lookups in databases
  • Substring frequencies
  • Motif and pattern discovery
  • Hybrid dynamic programming
  • Oligo and probe construction

11
New Results since 1997
  • Genome fragment assembly and resequencing
  • (multiple) whole genome comparison
  • Large-scale sequence comparison in resequencing
    projects
  • Misc. clever applications
  • Related less-used data structures

12
Suffix trees collect together repeated substring
prefixes
Example Speeding up the use of Position Specific
Scoring Matrices. Accelerating protein
classification ISMB 2000 position 1 2 3 4 5
A 3 1 4 2 1 T 2 4 4 3 2
C 1 3 5 6 2 G 4 2 4 6 2
AACTGAACTG.AACTG
PSSM
AACTG
Walk around the tree to depth 5
Starting locations of AACTG
13
Whole Genome Alignment with MUMs
  • Delcher Salzberg NAR 1999
  • A MUM is a Maximal Unique Matching substring in
    two strings S and S
  • MUMMER finds all MUMs selects and aligns
    non-overlapping pairs of MUMs and then recurses
    in the regions between adjacent selected MUMs.
  • Key issue finding MUMs efficiently

14
Spotting MUMs with a suffix tree
7
Example S AATCCGTG. S
GATCCGTA
20 So ATCCGT is a MUM if it only occurs in
these two places
root
Path spelling ATCCGT
Exactly two sibling leaves
S7
S20
15
Finding MUMs in a suffix tree
  • Build a suffix tree for the concatenation of S
    and S, noting at each leaf whether the suffix is
    from the S part or the S part. Also note the
    prior character in S or S for that position.
  • Do a DFS traversal of the S.T. to note at each
    node how many leaves below it are from S and how
    many from S.
  • A node corresponds to a MUM if and only if the
    count there is 1 and 1, and the two prior
    characters are different.
  • For total string length n, the time used is O(n).

16
Reducing the space needed for MUM findingMUMMER
II - NAR spring 2002
root
Path to a node v, spelling xB, where B is a string
Path to a node v, spelling B
v
v
Suffix link
17
Space reduction when finding MUMs
  • Given strings S and S, build a suffix tree T,
    with suffix links , but for only the smaller of S
    and S, say S.
  • Using T, find for each position i in S, the
    longest substring starting at i which occurs in
    S exactly once. Mark the end location in T of
    that match. Each such match is a candidate MUM.

18
Finding MUMs in less space
  • To find the matches, walk S through T, using
    suffix links to reduce search time.
  • After S has been completely processed, every
    marked position in T that is visited only once
    and has no marked position below it, corresponds
    to a MUM.
  • So all MUMs can be found in O(n) time, but in
    much reduced space.

19
Finding Tandem Repeats
Stoye and Gusfield, Theoretical Computer Science,
2001
TATAACTAACTAAGATT..
For a string of length n, a naïve algorithm might
use (on the order of) n3 operations to find all
T.R.s
20
Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem
Repeat. To extend a chain, examine the first
character of the T.R. and the character after the
end of the T.R. If they are the same, a new T.R.
exists on place to the right.
21
Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Branching Tandem Repeats
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem Repeat.
22
Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Branching Tandem Repeats
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem Repeat.
23
Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Branching Tandem Repeat
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem
Repeat. So to find all T.R.s find all
Branching T.Rs Use a suffix tree to efficiently
find Branching T.R.s
24
Finding Branching T.R.s
  • Every Branching T.R. ends at a node of the suffix
    tree so examine each node.

root
String to here is length 8
leaves 5 , 13 and 21 are below v
v
5
13 5 8
21
25
Basic Algorithm
  • Build the S.T. for string S process it to find
    depth of each node, and do a DFS numbering of the
    nodes so that ancestry of two nodes can be
    determined in constant time. (Classic property of
    DFS)
  • Check each internal node to see if it is at the
    first or second part of a tandem repeat.

26
How to check
  • From a node v of depth d(v), walk the subtree to
    collect the leaf numbers below v.
  • If leaf k is collected, check if v is an ancestor
    of leaf kd(v), and the next characters k1 and
    kd(v) 1 differ. If so, then the string to v is
    the first part of a Branching T.R.
  • Alternatively, focus on k-d(v) to see if the
    string to v is the second part of a Branching
    T.R.

27
AABAAB
1 2 3 4 5 6
S
Suffix Tree
B
A
A
A
B

B
A
3


B
A
5
6
A

4
B
A

A
2
B

1
28
Finding Tandem Repeats
  • This approach gives O(n2) time to find all
    branching tandem repeats.The time can be reduced
    to O(n log n) with a classic trick.
  • Related more complex result In O(n) time, one
    can mark in the suffix tree, the endpoints of all
    tandem repeat species. Gusfield and Stoye 2000
    UCD techreport
  • Uses the fact (Frankel et al 1999) that there
    are only 2n tandem repeat species complex
    proof.
  • Example aaaaaaaaaa has only five tandem repeat
    species, but 25 tandem repeat occurrences

29
Space issues again
  • Use virtual suffix tree or extended suffix
    array to reduce space
  • Def for positions i and j in string S, LCP(i,j)
    is the length of the longest matching substring
    starting at positions i and j in S.

30
AABAAB
1 2 3 4 5 6
SUFFIX ARRAY With LCP or string depth information
B
A
A
A
B

B
A
3


B
A
5
6
A

4
B
A

4 1 5 2 3 6 3 1 2 0 1
A
2
Node depths at time of descent after a backup
B

LCP of neighbors
1
31
Extended Suffix Array
  • String S, the suffix array A for S, and the LCP
    array for A, completely determine the suffix tree
    T for S. This is the extended suffix array for
    S.
  • The suffix array and the LCP array can be found
    in O(n log n) time and small space, without an
    explicit suffix tree.
  • For many efficient algorithms using an explicit
    suffix tree, it is possible to simulate the
    algorithm using only the extended suffix array,
    greatly reducing space usage without increasing
    time.
  • S. Kurtz WABI 2002

32
String Barcoding Oligo Construction
  • Uncovering Optimal Virus Signatures

Sam Rash and Dan Gusfield RECOMB April 2002
33
Motivation
  • Need for rapid virus detection
  • Given
  • unknown virus
  • database known viruses
  • Problem
  • identify unknown virus quickly
  • Ideal solution
  • have sequence of
  • viruses in database
  • unknown virus
  • Solution
  • use BLAST (or any sequence similarity
    program/algorithm)

34
Motivation
  • Real World
  • only have sequence for pathogens in database
  • not possible to quickly sequence an unknown virus
  • can test for presence small (lt 50 bp) strings in
    unknown virus
  • substring tests
  • Another Idea
  • String Barcoding
  • use substring tests to uniquely identify each
    virus in the database
  • acquire unique barcode for each virus in database

35
Implementation
  • Basic Idea Formulate problem as an Integer
    Linear Program (ILP)
  • Enumerate some useful set of substrings from S
  • variable in ILP for each substring
  • Constraint for each pair of strings in S
  • means that at least one substring will be chosen
    to distinguish each pair
  • Objective Function
  • Minimize sum of variables in ILP

36
Problem Definition
  • Formal Definition
  • given
  • set of strings S
  • goal
  • find set of strings S, the testing set
  • wlog, for each s1,s2 in S, there exists at least
    one u in S where u is a substring of only s1
  • u is a signature substring
  • minimize S
  • result
  • barcode for each element on S

37
Idea
  • strings
  • 1. cagtgc
  • 2. cagttc
  • 3. catgga
  • Each node in the suffix tree has a corresponding
    set of string IDs below it

Figure 1.1 - suffix tree for set of strings
cagtgc, cagttc, and catgga
38
Idea
  • strings
  • 1. cagtgc
  • 2. cagttc
  • 3. catgga
  • Each node in the suffix tree has a corresponding
    set of string IDs below it


Figure 1.1 - suffix tree for set of strings
cagtgc, cagttc, and catgga
39
Implementation Suffix Trees
  • root-edge walk
  • Creates string
  • appears in exactly the strings that label the
    node at which it ends
  • 2 root-edge walks ending on the same edge
  • Both strings created by the walk
  • occur in exactly the same set of original
    strings
  • Can use ether string

example - a root edge walk
40
Practical Implementation
  • If two substrings occur in exactly the same set
    of original strings, only one need be considered
  • Use strings from suffix tree for each uniquely
    labeled node
  • Build ILP as discussed
  • Solve ILP using CPLEX
  • Acquire barcode and signatures for each original
    string
  • signature is the set of substring tests occurring
    in a string

41
Implementation Example
minimize V18 V22 V11 V17 V8 objective
function st V18 V22 V11 V17 V8 gt 2 this
is the theoretical minimum V18 V17 V8 gt
1 constraint to cover pair 1,2 V22 V11 V8
gt 1 constraint to cover pair 1,3 V18 V22
V11 V17 gt 1 constraint to cover pair
2,3 binaries all variables are 0/1 V18 V22
V11 V17 V8 end
tg (V18) atgga (V22)
cagtgc 1 0
cagttc 0 0
catgga 1 1
Figure 1.4 - barcodes
cagtgc tg
cagttc ?
catgga tg, atgga
Figure 1.5 - signatures
42
Implementation Extensions
  • minimum and maximum lengths on signature
    substrings
  • acquire barcodes/signatures for only a subset of
    input strings (wrt to whole set)
  • minimum string edit distance between chosen
    signature substrings
  • redundancy
  • require r signature substrings to differentiate
    each pair
  • adds a higher level of confidence that signatures
    remain valid even with mutations

43
Results
  • Works quickly on most moderately sized datasets
    (especially when redundancy gt 2)
  • dataset properties
  • 50k virus genomes taken from NCBI (Genbank)
  • 50-150 virus genomes
  • average length of each sequence 1000 characters
  • total input size ranged from approximately 50,000
    150,000 characters
  • increasing dataset size scaled approximately
    linearly
  • reach 25 gap (at most 1/3 more than optimum) in
    just a few minutes
  • reach small gap (often lt 1) in 4 hours

44
Summary
  • Numerous applications of Suffix Trees and
    relatives in Bioinformatics
  • More yet to be found
  • Suffix tree and array software at my UCD website.
Write a Comment
User Comments (0)
About PowerShow.com