Designing Spaced Seeds - PowerPoint PPT Presentation

About This Presentation
Title:

Designing Spaced Seeds

Description:

Our goal is to make an automaton that accepts all strings which contain a string ... Substring automaton. We started ... Suppose R is described by automaton A ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 37
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Designing Spaced Seeds


1
Designing Spaced Seeds
2
Project/Exam deadlines
  • May 2
  • Send email to me with a title of your project
  • May 9
  • Each student/group gives a 10 min. presentation
    on their proposed project.
  • Show preliminary computations. What is the test
    plan? What is the data like, and how much is
    there.
  • Last week of classes
  • A 20 min. presentation from each group
  • A written report on the project
  • A take home exam, due electronically on the date
    of the final exam

3
Accuracy
  • Consider a 64bp sequence that is 70 similar to
    the query.
  • Pr(an 11 mer matches) 0.3
  • Pr(A spaced seed 11101001.. Matches) 0.466
  • This non-intuitive result leads to selection of
    spaced words that are an order of magnitude
    faster for identical specificity and sensitivity
  • Implemented in PATTERNHUNTER

4
How to compute a spaced seed
  • No good algorithm is known.
  • Iterate over all (M choose W) seeds.
  • Use a computation to decide Pr(match)
  • Choose the seed that maximizes probability.

5
Prob. Computation for Spaced Seeds
  • Given a specific seed Q(M,W), compute the
    probability of a hit in a sequence of length L.
  • We can assume that there is a probability p of
    match.
  • The match mismatch string is a binary string with
    probability p of 1

1
L
1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0
6
Prob. Computation for Spaced Seeds
  • Given a specific seed Q(M,W), compute the
    probability of a hit in a sequence of length L.
  • Q is a binary string of length M, with W 1s
  • We try to match the binary match string S which
    is a random binary string with probability p of
    success.

M
1
L
1100.11..0
  • PQ Prob. (Q matches random S at some location)
  • How can we compute PQ?

7
Computing F(i,b)
  • For a specific string b, define
  • F(i,b) Prob. (Q matches a random string S of
    length i, s.t. S ends in b)

i
1
b
  • Why is it sufficient to compute f(I,b) for all I,
    b?
  • PQ f(L,?)

8
Computing f(i,b)
  • Define B1 as the set of all strings b that match
    a suffix of Q

b
  • We have two possibilities
  • b ? B1 b is consistent with a suffix of Q.
  • b ? B0 B-B1

110001
Q
110001
9
Computing f(i,b)
  • Case b ? B0
  • f(i,b) f(i-1,bgtgt1)

b
Q
  • Case b ? B1 and b M
  • f(i,b) 1

10
Computing f(i,b)
  • Case b ? B1, bltM
  • f(i,b) pf(i-1,1b) (1-p) pf(i-1,0b)
  • Note that if b ? B1 , then 1b ? B1
  • However, it is possible that 0b ? B1
  • We want to iterate only over b ? B1
  • Find smallest j s.t. 0bgtgtj? B1
  • f(i,b) pf(i-1,1b) (1-p)f(i-j,0bgtgtj)

Q
11
Efficiency
  • B1 M2M-W
  • The iteration proceeds for all i, and all b?B1,
    and each comparison needs O(M) steps
  • O(M2M-W L M) O(M22M-W L)

12
More efficient algorithm for spaced seed design
  • Due to Buhler, Keich, and Sun
  • Consider seed ? (weight w, span s).
  • Let Q? be the set of all possible 2s-w strings
    matching ?.

1 1 0 0 0 1 0 1
1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1
1 0 0 1 1 1 1
13
Trie construction
  • Our goal is to make an automaton that accepts all
    strings which contain a string from Q.
  • Make a trie T? from Q?.
  • T? is a DFA that precisely accepts Q?
  • Can we convert T? to an DFA that accepts all
    strings that matches a string from Q? as a
    suffix?

14
Failure links
  • Use of failure links allows us to traverse any
    string till Q is reached.
  • Note the DFA has special structure. Does it
    help? No failure links when outgoing edge is 0.
    Therefore, we fail only when we see a 1.

15
Substring automaton
  • We started with a Trie that only accepts Q?
  • Next, we use failure links to accept any string
    with a suffix from Q?.
  • Finally, make every accepting state an absorbing
    one, to accept all strings containing a string
    from Q? as suffix.

0,1
1
0
0
1
1
1
1
Ex ? 1001
0
1
1
1
16
Computing sensitivity of ?
  • Compute the probability that a random string of
    length l will match ??
  • Equivalently What is the probability that a
    random string of length l that starts at the
    begin node will end in an accepting state of A?.
  • Case 1 Each bit of S is 1 with probability p
  • P(q,t)Probability that we reach q after reading
    the first t bits.

17
Complexity
  • Size of the Automaton W2M-W
  • What is the in-degree?
  • Claimed complexity ?(W2M-WL)
  • O(M2/W) faster then the previous algorithm

18
Generalizing the match string
  • The match string may have a different
    distribution
  • Errors do not fall independently at random
  • Instead of independent bernoulli trials, we can
    have a higher order markovian process generating
    the match string.
  • The algorithm of Keich et al. Cannot deal with
    this extension, but it is natural in Mandala

19
Experimental Results with Mandala
  • 428 human/mouse genomic aligned regions.
  • Repeat mask the alignments and separate into
    coding/non-coding regions.
  • A total of 1136000 similarities (alignments) were
    pulled. These are used to check for sensitivity
    (accuracy) of filters.

20
Effect of Span
  • Solid line 0-th order model
  • Dashed line 5-th order model.
  • W11 throughout larger span implies more gaps,
    span11 implies ungapped (BLASTN) seed

21
Accuracy of different seeds
  • Non-coding
  • Coding

22
Model order
  • Non-Coding solid line
  • coding dashed line

23
What about multiple keywords
  • All of the analysis is for ungapped alignments.
  • With indels, multiple words might be more
    sensitive.
  • Mandala works for multiple keywords also.
  • Can we make the algorithm more efficient?
  • In particular, there is an explosion of states in
    making a deterministic automaton? Can we match a
    non-deterministic automaton?

24
Regular Expressions
  • Concise representation of a set of strings over
    alphabet ?.
  • Described by a string over
  • R is a r.e. if and only if

25
Regular Expression
  • Q Let ?A,C,E
  • Is (AC)EEC a regular expression?
  • (AC)?
  • AC..E?
  • Q When is a string s in a regular expression?
  • R (AC)EEC
  • Is CEEC in R?
  • AEC?
  • ACEE?

26
Regular Expression Automata
  • Every R.E can be expressed by an automaton (a
    directed graph) with the following properties
  • The automaton has a start and end node
  • Each edge is labeled with a symbol from ?, or ?
  • Suppose R is described by automaton A
  • S ? R if and only if there is a path from start
    to end in A, labeled with s.

27
Examples Regular Expression Automata
  • (AC)EEC

C
A
E
E
start
end
C
28
Constructing automata from R.E
  • R ?
  • R ?, ? ? ?
  • R R1 R2
  • R R1 R2
  • R R1

?
?
?
?
?
?
29
Regular Expression Matching
  • Given a database D, and a regular expression R,
    is a substring of D in R?
  • Is there a string Dl..c that is accepted by the
    automaton of R?
  • Simpler Q Is D1..c accepted by the automaton
    of R?

30
Alg. For matching R.E.
  • If D1..c is accepted by the automaton RA
  • There is a path labeled D1Dc that goes from
    START to END in RA

?
D1
D2
Dc
31
Alg. For matching R.E.
  • If D1..c is accepted by the automaton RA
  • There is a path labeled D1Dc that goes from
    START to END in RA
  • There is a path labeled D1..Dc-1 from START
    to node u, and a path labeled Dc from u to the
    END

32
D.P. to match regular expression
  • Define
  • Au,? Automaton node reached from u after
    reading ?
  • Eps(u) set of all nodes reachable from node u
    using epsilon transitions.
  • Nc subset of nodes reachable from START node
    after reading D1..c
  • Q when is v ? Nc?

u
?
v
?
u
Eps(u)
33
D.P. to match regular expression
  • Q when is v ? Nc?
  • A If for some u ? Nc-1, w Au,Dc,
  • v ? w Eps(w)

w
u
D1 .. Dc-1
?
Dc
34
Algorithm
35
The final step
  • We have answered the question
  • Is D1..c accepted by R?
  • Yes, if END ? Nc
  • We need to answer
  • Is Dl..c (for some l, and some c) accepted by R

36
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com