Designing Multiple Simultaneous Seeds for DNA Similarity Search - PowerPoint PPT Presentation

About This Presentation

Title:

Designing Multiple Simultaneous Seeds for DNA Similarity Search

Description:

Laboratory for Computational Genomics. 10 ... Laboratory for Computational Genomics. 15. Find Optimal Set of Seeds by Original Local Search ... – PowerPoint PPT presentation

Number of Views:15

Avg rating:3.0/5.0

Slides: 27

Provided by: cse6

Learn more at: http://www.cse.msu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Designing Multiple Simultaneous Seeds for DNA Similarity Search

1
Designing Multiple Simultaneous Seeds for DNA
Similarity Search
Yanni Sun, Jeremy Buhler Washington University
in Saint Louis
2
Outline

Problem of multi-seed design
Methods
Greedy covering algorithm
Compute conditional match probabilities
Experiments and results
Conclusion and future work

3
Sequence Alignment

Functional regions conserved despite DNA
mutations over time
Conserved region can be aligned with high score
Exact solution DP time complexity O(MN)
Fast but heuristic solution seeded alignment
algorithm

4
Seeded Alignment Algorithm

BLAST is the most popular tool.
Step 1 word match step 2 extend
the match to find the high
similarity pair
TAGGACCTAACC
GACCACCTTTT

5
Seed and Similarity

Example of a similarity and a single seed
tgcagaaatgcagaggca
tacacaggcaccgaggag
Similarity 101101000010111100
Seed 111, weight 3, span 4
The seed detects/matches this similarity.

6
Seed Choice is Important
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Significant alignment
Seed match
7
Seed Design Previous Work

Traditional seed word (e.g. 11111111111)
Discontiguous patterns of matching bases
CR1993 MTL02 111010010100110111
Our work on single discontiguous seed BKS03

8
Multiple Simultaneous Seeds

Multiple simultaneous seeds are defined as a set
of seeds.
? seed1, seed2,seed i,, seedn
? detects a similarity if at least one of the
component seeds detects the similarity
Example
Simultaneous seeds 111, 111 detect
similarities 100110100001, 1000010110001,
1101001011001

9
Multi-seed Design Balance Sensitivity with
Specificity

SensitivityA / Biologically
meaningful alignments
SpecificityA / seed matches
Increase sensitivity
Decrease weight of single seed
Use multiple seeds
Both methods hurt specificity
Hypothesis a set of multiple seeds
has a better tradeoff of sensitivity vs.
specificity comparing to single seed

biologically meaningful alignments
A
seed matches
10
Our Work Design Multiple Simultaneous Seeds
Efficiently

Use a new local search method to optimize seed
set
Design an efficient algorithm to calculate
conditional match probability
Empirical verification that multiple simultaneous
seeds have better tradeoff of sensitivity vs.
specificity

11
Multi-seed Design Problem

Input
Ungapped alignments sampled from two genomic DNA
sequences
Resource constraints of seeds weight, span,
number
Goal find a set of seeds ? to maximize the
detection probability Pr? detects S.
Pr(? detects S) Pr( (seed1 detects S) or (seed2
detects S)or (seedn detects S))

12
Outline

Problem of multi-seed Design
Methods
Greedy covering algorithm
Compute conditional match probabilities
Experiments and results
Conclusion and future work

13
Computing Match Probability for Specified Seeds
BKS 03

Learn a kth-order Markov model from similarities.
Build a DFA that only accepts strings containing
the given seeds
Compute the probability that the DFA accepts a
string chosen randomly from model M by DP.

14
Seek the Locally Optimal Set of Seeds

Original local search
Greedy covering algorithm a faster local search
strategy
Efficient computation of conditional match
probability

15
Find Optimal Set of Seeds by Original Local
Search
Seed space with spanlt8,weight3
111, 111 Pr0.70
16
Greedy Covering Algorithm
Similarity space
Design 3 simultaneous seedss1,s2,s3
s1 argmaxxPr(x) s2argmaxx Pr(xs1) s3argmaxx
Pr(xs1,s2)
17
Calculate Conditional Match Probabilities

Challenge how to calculate the conditional
probability efficiently ?
Seeds with small span exact computation via DFAs
Seeds with large span Monte Carlo

18
Calculate Conditional Match Probability via DFA

Pr( x ) Pr(x )/ Pr( )
Build DFA corresponding to x by using
cross product and complementation of DFA
Efficiency in the process of local search to
find optimal single seed x, Pr( ) can be
precomputed

19
Outline

Problem of multi-seed design
Methods
Greedy covering algorithm
Compute conditional match probabilities
Experiments and results
Conclusion and future work

20
Greedy Covering vs. Original Local Search
Detection probability
21
Greedy Covering is Much Faster

When n5, on the same hardware platform(P4)
Greedy covering needs 20 minutes
The original local search needs 2.4 hours

22
Experimental Setup

The ungapped alignments are sampled uniformly
from human and mouse syntenies
For a specified seed set
sensitivity the number of significant gapped
alignments found by our BLAST-like alignment tool
False positive rate approximated by the number
of seed matches

23
Results Verify the Hypothesis on Noncoding
Sequences
seed weight number of seeds gapped alignments found (sensitivity) improvement of sensitivity total seed matches (approximation of f.p)
11 1 251941 ---- 1.57x109
10 1 273831 8.7 5.88x109
11 3 292093 15.9 4.56x109
24
Summary of Contributions

Efficient algorithms to design multiple
simultaneous seeds at reasonable cost
Empirical verification multiple simultaneous
seeds have a better tradeoff between sensitivity
and specificity

25
Future Work

Design a better evaluation platform for different
seeds
Investigate utility of seeds in multiple sequence
alignment

26
Acknowledgements

Dr. Jeremy Buhler (advisor), Ben Westover, Rachel
Nordgren, Joseph Lancaster and Christopher Swope
Laboratory for computational genomics in
Washington University in Saint Louis
http//www.cse.wustl.edu/jbuhler/mandala

Write a Comment

User Comments (0)