Designing Multiple Simultaneous Seeds for DNA Similarity Search - PowerPoint PPT Presentation

About This Presentation
Title:

Designing Multiple Simultaneous Seeds for DNA Similarity Search

Description:

Laboratory for Computational Genomics. 10 ... Laboratory for Computational Genomics. 15. Find Optimal Set of Seeds by Original Local Search ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 27
Provided by: cse6
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Designing Multiple Simultaneous Seeds for DNA Similarity Search


1
Designing Multiple Simultaneous Seeds for DNA
Similarity Search
Yanni Sun, Jeremy Buhler Washington University
in Saint Louis
2
Outline
  • Problem of multi-seed design
  • Methods
  • Greedy covering algorithm
  • Compute conditional match probabilities
  • Experiments and results
  • Conclusion and future work

3
Sequence Alignment
  • Functional regions conserved despite DNA
    mutations over time
  • Conserved region can be aligned with high score
  • Exact solution DP time complexity O(MN)
  • Fast but heuristic solution seeded alignment
    algorithm

4
Seeded Alignment Algorithm
  • BLAST is the most popular tool.
  • Step 1 word match step 2 extend
    the match to find the high
    similarity pair
  • TAGGACCTAACC
  • GACCACCTTTT

5
Seed and Similarity
  • Example of a similarity and a single seed
  • tgcagaaatgcagaggca
  • tacacaggcaccgaggag
  • Similarity 101101000010111100
  • Seed 111, weight 3, span 4
  • The seed detects/matches this similarity.


6
Seed Choice is Important
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Significant alignment
Seed match
7
Seed Design Previous Work
  • Traditional seed word (e.g. 11111111111)
  • Discontiguous patterns of matching bases
    CR1993 MTL02 111010010100110111
  • Our work on single discontiguous seed BKS03

8
Multiple Simultaneous Seeds
  • Multiple simultaneous seeds are defined as a set
    of seeds.
  • ? seed1, seed2,seed i,, seedn
  • ? detects a similarity if at least one of the
    component seeds detects the similarity
  • Example
  • Simultaneous seeds 111, 111 detect
    similarities 100110100001, 1000010110001,
    1101001011001

9
Multi-seed Design Balance Sensitivity with
Specificity
  • SensitivityA / Biologically
  • meaningful alignments
  • SpecificityA / seed matches
  • Increase sensitivity
  • Decrease weight of single seed
  • Use multiple seeds
  • Both methods hurt specificity
  • Hypothesis a set of multiple seeds
  • has a better tradeoff of sensitivity vs.
    specificity comparing to single seed

biologically meaningful alignments
A
seed matches
10
Our Work Design Multiple Simultaneous Seeds
Efficiently
  • Use a new local search method to optimize seed
    set
  • Design an efficient algorithm to calculate
    conditional match probability
  • Empirical verification that multiple simultaneous
    seeds have better tradeoff of sensitivity vs.
    specificity

11
Multi-seed Design Problem
  • Input
  • Ungapped alignments sampled from two genomic DNA
    sequences
  • Resource constraints of seeds weight, span,
    number
  • Goal find a set of seeds ? to maximize the
    detection probability Pr? detects S.
  • Pr(? detects S) Pr( (seed1 detects S) or (seed2
    detects S)or (seedn detects S))

12
Outline
  • Problem of multi-seed Design
  • Methods
  • Greedy covering algorithm
  • Compute conditional match probabilities
  • Experiments and results
  • Conclusion and future work

13
Computing Match Probability for Specified Seeds
BKS 03
  • Learn a kth-order Markov model from similarities.
  • Build a DFA that only accepts strings containing
    the given seeds
  • Compute the probability that the DFA accepts a
    string chosen randomly from model M by DP.

14
Seek the Locally Optimal Set of Seeds
  • Original local search
  • Greedy covering algorithm a faster local search
    strategy
  • Efficient computation of conditional match
    probability

15
Find Optimal Set of Seeds by Original Local
Search
Seed space with spanlt8,weight3
111, 111 Pr0.70
16
Greedy Covering Algorithm
Similarity space
Design 3 simultaneous seedss1,s2,s3
s1 argmaxxPr(x) s2argmaxx Pr(xs1) s3argmaxx
Pr(xs1,s2)
17
Calculate Conditional Match Probabilities
  • Challenge how to calculate the conditional
    probability efficiently ?
  • Seeds with small span exact computation via DFAs
  • Seeds with large span Monte Carlo

18
Calculate Conditional Match Probability via DFA
  • Pr( x ) Pr(x )/ Pr( )
  • Build DFA corresponding to x by using
    cross product and complementation of DFA
  • Efficiency in the process of local search to
    find optimal single seed x, Pr( ) can be
    precomputed

19
Outline
  • Problem of multi-seed design
  • Methods
  • Greedy covering algorithm
  • Compute conditional match probabilities
  • Experiments and results
  • Conclusion and future work

20
Greedy Covering vs. Original Local Search
Detection probability
21
Greedy Covering is Much Faster
  • When n5, on the same hardware platform(P4)
  • Greedy covering needs 20 minutes
  • The original local search needs 2.4 hours

22
Experimental Setup
  • The ungapped alignments are sampled uniformly
    from human and mouse syntenies
  • For a specified seed set
  • sensitivity the number of significant gapped
    alignments found by our BLAST-like alignment tool
  • False positive rate approximated by the number
    of seed matches

23
Results Verify the Hypothesis on Noncoding
Sequences
seed weight number of seeds gapped alignments found (sensitivity) improvement of sensitivity total seed matches (approximation of f.p)
11 1 251941 ---- 1.57x109
10 1 273831 8.7 5.88x109
11 3 292093 15.9 4.56x109
24
Summary of Contributions
  • Efficient algorithms to design multiple
    simultaneous seeds at reasonable cost
  • Empirical verification multiple simultaneous
    seeds have a better tradeoff between sensitivity
    and specificity

25
Future Work
  • Design a better evaluation platform for different
    seeds
  • Investigate utility of seeds in multiple sequence
    alignment

26
Acknowledgements
  • Dr. Jeremy Buhler (advisor), Ben Westover, Rachel
    Nordgren, Joseph Lancaster and Christopher Swope
  • Laboratory for computational genomics in
    Washington University in Saint Louis
  • http//www.cse.wustl.edu/jbuhler/mandala
Write a Comment
User Comments (0)
About PowerShow.com