Title: Suffix Trees Come of Age in Bioinformatics
1Suffix Trees Come of Age in Bioinformatics
- Algorithms, Applications and Implementations
- Dan Gusfield, U.C. Davis
2AABAAB
1 2 3 4 5 6
S
Suffix Tree
B
A
A
A
B
B
A
3
B
A
5
6
A
4
B
A
A
2
B
Every suffix of S is encoded by a root to leaf
walk. Every substring encoded by a walk from the
root.
1
3AABAAB
1 2 3 4 5 6
SUFFIX ARRAY
B
A
A
A
B
B
A
3
B
A
5
6
A
4
B
A
4 1 5 2 3 6
A
2
The suffixes listed in lexicographic order. Found
by lexicographic DFS
B
1
4Basic Facts about Suffix Trees
- Suffix tree for a string of length n can be built
in O(n) time. - Basic implementations need about 25 bytes per
character. - Numerous applications in string algorithms
achieving significant (sometimes amazing)
speedups over naïve algorithms O(m) time
substring search for a string of length m in a
string of length n, regardless of n.
5Suffix Tree/Array Construction
- Weiner 1973
- McCreight 1974
- Ukkonen 1996
- Farach-Colton 1999
- Manber-Myers 1991 (Suffix-Arrays)
- O(n log n time) with small space. O(n) time
possible as on prior slide.
6aabbaa
B
A
B
A
A
B
B
6
A
3
A
B
A
A
5
B
A
B
4
A
2
A
1
7Bioinformatics Applications pre 1997
- A small number of applications
- Limited by space requirements of the tree, small
computer memories, complication of the
algorithms, poor locality of reference
8Post 1997
- Much has changed, and Suffix trees and their
relatives (particularly suffix arrays) are more
widely applied in bioinformatics - 20- 40 new publications with substantial
connection to bioinformatics, post 1997
9New Results since 1997
- Fundamental Algorithms Farach-Colton
construction simpler algorithm for
least-common-ancestor. - Implementation improvements (space) for suffix
trees and arrays - New variants of suffix trees affix trees,
virtual suffix trees - New algorithms for tandem repeats
10New Results since 1997
- Approximate repeats and large-scale repeat
structure - Fast lookups in databases
- Substring frequencies
- Motif and pattern discovery
- Hybrid dynamic programming
- Oligo and probe construction
11New Results since 1997
- Genome fragment assembly and resequencing
- (multiple) whole genome comparison
- Large-scale sequence comparison in resequencing
projects - Misc. clever applications
- Related less-used data structures
12Suffix trees collect together repeated substring
prefixes
Example Speeding up the use of Position Specific
Scoring Matrices. Accelerating protein
classification ISMB 2000 position 1 2 3 4 5
A 3 1 4 2 1 T 2 4 4 3 2
C 1 3 5 6 2 G 4 2 4 6 2
AACTGAACTG.AACTG
PSSM
AACTG
Walk around the tree to depth 5
Starting locations of AACTG
13Whole Genome Alignment with MUMs
- Delcher Salzberg NAR 1999
- A MUM is a Maximal Unique Matching substring in
two strings S and S - MUMMER finds all MUMs selects and aligns
non-overlapping pairs of MUMs and then recurses
in the regions between adjacent selected MUMs. - Key issue finding MUMs efficiently
14Spotting MUMs with a suffix tree
7
Example S AATCCGTG. S
GATCCGTA
20 So ATCCGT is a MUM if it only occurs in
these two places
root
Path spelling ATCCGT
Exactly two sibling leaves
S7
S20
15Finding MUMs in a suffix tree
- Build a suffix tree for the concatenation of S
and S, noting at each leaf whether the suffix is
from the S part or the S part. Also note the
prior character in S or S for that position. - Do a DFS traversal of the S.T. to note at each
node how many leaves below it are from S and how
many from S. - A node corresponds to a MUM if and only if the
count there is 1 and 1, and the two prior
characters are different. - For total string length n, the time used is O(n).
16Reducing the space needed for MUM findingMUMMER
II - NAR spring 2002
root
Path to a node v, spelling xB, where B is a string
Path to a node v, spelling B
v
v
Suffix link
17Space reduction when finding MUMs
- Given strings S and S, build a suffix tree T,
with suffix links , but for only the smaller of S
and S, say S. - Using T, find for each position i in S, the
longest substring starting at i which occurs in
S exactly once. Mark the end location in T of
that match. Each such match is a candidate MUM.
18Finding MUMs in less space
- To find the matches, walk S through T, using
suffix links to reduce search time. - After S has been completely processed, every
marked position in T that is visited only once
and has no marked position below it, corresponds
to a MUM. - So all MUMs can be found in O(n) time, but in
much reduced space.
19Finding Tandem Repeats
Stoye and Gusfield, Theoretical Computer Science,
2001
TATAACTAACTAAGATT..
For a string of length n, a naïve algorithm might
use (on the order of) n3 operations to find all
T.R.s
20Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem
Repeat. To extend a chain, examine the first
character of the T.R. and the character after the
end of the T.R. If they are the same, a new T.R.
exists on place to the right.
21Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Branching Tandem Repeats
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem Repeat.
22Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Branching Tandem Repeats
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem Repeat.
23Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Branching Tandem Repeat
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem
Repeat. So to find all T.R.s find all
Branching T.Rs Use a suffix tree to efficiently
find Branching T.R.s
24Finding Branching T.R.s
- Every Branching T.R. ends at a node of the suffix
tree so examine each node.
root
String to here is length 8
leaves 5 , 13 and 21 are below v
v
5
13 5 8
21
25Basic Algorithm
- Build the S.T. for string S process it to find
depth of each node, and do a DFS numbering of the
nodes so that ancestry of two nodes can be
determined in constant time. (Classic property of
DFS) - Check each internal node to see if it is at the
first or second part of a tandem repeat.
26How to check
- From a node v of depth d(v), walk the subtree to
collect the leaf numbers below v. - If leaf k is collected, check if v is an ancestor
of leaf kd(v), and the next characters k1 and
kd(v) 1 differ. If so, then the string to v is
the first part of a Branching T.R. - Alternatively, focus on k-d(v) to see if the
string to v is the second part of a Branching
T.R.
27AABAAB
1 2 3 4 5 6
S
Suffix Tree
B
A
A
A
B
B
A
3
B
A
5
6
A
4
B
A
A
2
B
1
28Finding Tandem Repeats
- This approach gives O(n2) time to find all
branching tandem repeats.The time can be reduced
to O(n log n) with a classic trick. - Related more complex result In O(n) time, one
can mark in the suffix tree, the endpoints of all
tandem repeat species. Gusfield and Stoye 2000
UCD techreport - Uses the fact (Frankel et al 1999) that there
are only 2n tandem repeat species complex
proof. - Example aaaaaaaaaa has only five tandem repeat
species, but 25 tandem repeat occurrences
29Space issues again
- Use virtual suffix tree or extended suffix
array to reduce space - Def for positions i and j in string S, LCP(i,j)
is the length of the longest matching substring
starting at positions i and j in S.
30AABAAB
1 2 3 4 5 6
SUFFIX ARRAY With LCP or string depth information
B
A
A
A
B
B
A
3
B
A
5
6
A
4
B
A
4 1 5 2 3 6 3 1 2 0 1
A
2
Node depths at time of descent after a backup
B
LCP of neighbors
1
31Extended Suffix Array
- String S, the suffix array A for S, and the LCP
array for A, completely determine the suffix tree
T for S. This is the extended suffix array for
S. - The suffix array and the LCP array can be found
in O(n log n) time and small space, without an
explicit suffix tree. - For many efficient algorithms using an explicit
suffix tree, it is possible to simulate the
algorithm using only the extended suffix array,
greatly reducing space usage without increasing
time. - S. Kurtz WABI 2002
32String Barcoding Oligo Construction
- Uncovering Optimal Virus Signatures
Sam Rash and Dan Gusfield RECOMB April 2002
33Motivation
- Need for rapid virus detection
- Given
- unknown virus
- database known viruses
- Problem
- identify unknown virus quickly
- Ideal solution
- have sequence of
- viruses in database
- unknown virus
- Solution
- use BLAST (or any sequence similarity
program/algorithm)
34Motivation
- Real World
- only have sequence for pathogens in database
- not possible to quickly sequence an unknown virus
- can test for presence small (lt 50 bp) strings in
unknown virus - substring tests
- Another Idea
- String Barcoding
- use substring tests to uniquely identify each
virus in the database - acquire unique barcode for each virus in database
35Implementation
- Basic Idea Formulate problem as an Integer
Linear Program (ILP) - Enumerate some useful set of substrings from S
- variable in ILP for each substring
- Constraint for each pair of strings in S
- means that at least one substring will be chosen
to distinguish each pair - Objective Function
- Minimize sum of variables in ILP
36Problem Definition
- Formal Definition
- given
- set of strings S
- goal
- find set of strings S, the testing set
- wlog, for each s1,s2 in S, there exists at least
one u in S where u is a substring of only s1 - u is a signature substring
- minimize S
- result
- barcode for each element on S
37 Idea
- strings
- 1. cagtgc
- 2. cagttc
- 3. catgga
- Each node in the suffix tree has a corresponding
set of string IDs below it
Figure 1.1 - suffix tree for set of strings
cagtgc, cagttc, and catgga
38Idea
- strings
- 1. cagtgc
- 2. cagttc
- 3. catgga
- Each node in the suffix tree has a corresponding
set of string IDs below it
Figure 1.1 - suffix tree for set of strings
cagtgc, cagttc, and catgga
39Implementation Suffix Trees
- root-edge walk
- Creates string
- appears in exactly the strings that label the
node at which it ends - 2 root-edge walks ending on the same edge
- Both strings created by the walk
- occur in exactly the same set of original
strings - Can use ether string
example - a root edge walk
40Practical Implementation
- If two substrings occur in exactly the same set
of original strings, only one need be considered - Use strings from suffix tree for each uniquely
labeled node - Build ILP as discussed
- Solve ILP using CPLEX
- Acquire barcode and signatures for each original
string - signature is the set of substring tests occurring
in a string
41Implementation Example
minimize V18 V22 V11 V17 V8 objective
function st V18 V22 V11 V17 V8 gt 2 this
is the theoretical minimum V18 V17 V8 gt
1 constraint to cover pair 1,2 V22 V11 V8
gt 1 constraint to cover pair 1,3 V18 V22
V11 V17 gt 1 constraint to cover pair
2,3 binaries all variables are 0/1 V18 V22
V11 V17 V8 end
tg (V18) atgga (V22)
cagtgc 1 0
cagttc 0 0
catgga 1 1
Figure 1.4 - barcodes
cagtgc tg
cagttc ?
catgga tg, atgga
Figure 1.5 - signatures
42Implementation Extensions
- minimum and maximum lengths on signature
substrings - acquire barcodes/signatures for only a subset of
input strings (wrt to whole set) - minimum string edit distance between chosen
signature substrings - redundancy
- require r signature substrings to differentiate
each pair - adds a higher level of confidence that signatures
remain valid even with mutations
43Results
- Works quickly on most moderately sized datasets
(especially when redundancy gt 2) - dataset properties
- 50k virus genomes taken from NCBI (Genbank)
- 50-150 virus genomes
- average length of each sequence 1000 characters
- total input size ranged from approximately 50,000
150,000 characters - increasing dataset size scaled approximately
linearly - reach 25 gap (at most 1/3 more than optimum) in
just a few minutes - reach small gap (often lt 1) in 4 hours
44Summary
- Numerous applications of Suffix Trees and
relatives in Bioinformatics - More yet to be found
- Suffix tree and array software at my UCD website.