Title: Designing Multiple Simultaneous Seeds for DNA Similarity Search
1Designing Multiple Simultaneous Seeds for DNA
Similarity Search
Yanni Sun, Jeremy Buhler Washington University
in Saint Louis
2Outline
- Problem of multi-seed design
- Methods
- Greedy covering algorithm
- Compute conditional match probabilities
- Experiments and results
- Conclusion and future work
3Sequence Alignment
- Functional regions conserved despite DNA
mutations over time - Conserved region can be aligned with high score
-
- Exact solution DP time complexity O(MN)
- Fast but heuristic solution seeded alignment
algorithm
4Seeded Alignment Algorithm
- BLAST is the most popular tool.
- Step 1 word match step 2 extend
the match to find the high
similarity pair
- TAGGACCTAACC
- GACCACCTTTT
-
5Seed and Similarity
- Example of a similarity and a single seed
- tgcagaaatgcagaggca
-
- tacacaggcaccgaggag
- Similarity 101101000010111100
- Seed 111, weight 3, span 4
- The seed detects/matches this similarity.
6Seed Choice is Important
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Significant alignment
Seed match
7Seed Design Previous Work
- Traditional seed word (e.g. 11111111111)
- Discontiguous patterns of matching bases
CR1993 MTL02 111010010100110111 - Our work on single discontiguous seed BKS03
8Multiple Simultaneous Seeds
- Multiple simultaneous seeds are defined as a set
of seeds. - ? seed1, seed2,seed i,, seedn
- ? detects a similarity if at least one of the
component seeds detects the similarity - Example
- Simultaneous seeds 111, 111 detect
similarities 100110100001, 1000010110001,
1101001011001
9Multi-seed Design Balance Sensitivity with
Specificity
- SensitivityA / Biologically
- meaningful alignments
- SpecificityA / seed matches
-
- Increase sensitivity
- Decrease weight of single seed
- Use multiple seeds
- Both methods hurt specificity
- Hypothesis a set of multiple seeds
- has a better tradeoff of sensitivity vs.
specificity comparing to single seed
biologically meaningful alignments
A
seed matches
10Our Work Design Multiple Simultaneous Seeds
Efficiently
- Use a new local search method to optimize seed
set - Design an efficient algorithm to calculate
conditional match probability - Empirical verification that multiple simultaneous
seeds have better tradeoff of sensitivity vs.
specificity
11Multi-seed Design Problem
- Input
- Ungapped alignments sampled from two genomic DNA
sequences - Resource constraints of seeds weight, span,
number - Goal find a set of seeds ? to maximize the
detection probability Pr? detects S. - Pr(? detects S) Pr( (seed1 detects S) or (seed2
detects S)or (seedn detects S))
12Outline
- Problem of multi-seed Design
- Methods
- Greedy covering algorithm
- Compute conditional match probabilities
- Experiments and results
- Conclusion and future work
13Computing Match Probability for Specified Seeds
BKS 03
- Learn a kth-order Markov model from similarities.
- Build a DFA that only accepts strings containing
the given seeds - Compute the probability that the DFA accepts a
string chosen randomly from model M by DP.
14Seek the Locally Optimal Set of Seeds
- Original local search
- Greedy covering algorithm a faster local search
strategy - Efficient computation of conditional match
probability
15Find Optimal Set of Seeds by Original Local
Search
Seed space with spanlt8,weight3
111, 111 Pr0.70
16Greedy Covering Algorithm
Similarity space
Design 3 simultaneous seedss1,s2,s3
s1 argmaxxPr(x) s2argmaxx Pr(xs1) s3argmaxx
Pr(xs1,s2)
17Calculate Conditional Match Probabilities
- Challenge how to calculate the conditional
probability efficiently ? - Seeds with small span exact computation via DFAs
- Seeds with large span Monte Carlo
18Calculate Conditional Match Probability via DFA
- Pr( x ) Pr(x )/ Pr( )
- Build DFA corresponding to x by using
cross product and complementation of DFA - Efficiency in the process of local search to
find optimal single seed x, Pr( ) can be
precomputed
19Outline
- Problem of multi-seed design
- Methods
- Greedy covering algorithm
- Compute conditional match probabilities
- Experiments and results
- Conclusion and future work
20Greedy Covering vs. Original Local Search
Detection probability
21Greedy Covering is Much Faster
- When n5, on the same hardware platform(P4)
- Greedy covering needs 20 minutes
- The original local search needs 2.4 hours
22Experimental Setup
- The ungapped alignments are sampled uniformly
from human and mouse syntenies - For a specified seed set
- sensitivity the number of significant gapped
alignments found by our BLAST-like alignment tool - False positive rate approximated by the number
of seed matches
23Results Verify the Hypothesis on Noncoding
Sequences
seed weight number of seeds gapped alignments found (sensitivity) improvement of sensitivity total seed matches (approximation of f.p)
11 1 251941 ---- 1.57x109
10 1 273831 8.7 5.88x109
11 3 292093 15.9 4.56x109
24Summary of Contributions
- Efficient algorithms to design multiple
simultaneous seeds at reasonable cost - Empirical verification multiple simultaneous
seeds have a better tradeoff between sensitivity
and specificity
25Future Work
- Design a better evaluation platform for different
seeds - Investigate utility of seeds in multiple sequence
alignment
26Acknowledgements
- Dr. Jeremy Buhler (advisor), Ben Westover, Rachel
Nordgren, Joseph Lancaster and Christopher Swope - Laboratory for computational genomics in
Washington University in Saint Louis - http//www.cse.wustl.edu/jbuhler/mandala