Title: Designing Spaced Seeds
1Designing Spaced Seeds
2Project/Exam deadlines
- May 2
- Send email to me with a title of your project
- May 9
- Each student/group gives a 10 min. presentation
on their proposed project. - Show preliminary computations. What is the test
plan? What is the data like, and how much is
there. - Last week of classes
- A 20 min. presentation from each group
- A written report on the project
- A take home exam, due electronically on the date
of the final exam
3Accuracy
- Consider a 64bp sequence that is 70 similar to
the query. - Pr(an 11 mer matches) 0.3
- Pr(A spaced seed 11101001.. Matches) 0.466
- This non-intuitive result leads to selection of
spaced words that are an order of magnitude
faster for identical specificity and sensitivity - Implemented in PATTERNHUNTER
4How to compute a spaced seed
- No good algorithm is known.
- Iterate over all (M choose W) seeds.
- Use a computation to decide Pr(match)
- Choose the seed that maximizes probability.
5Prob. Computation for Spaced Seeds
- Given a specific seed Q(M,W), compute the
probability of a hit in a sequence of length L. - We can assume that there is a probability p of
match. - The match mismatch string is a binary string with
probability p of 1
1
L
1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0
6Prob. Computation for Spaced Seeds
- Given a specific seed Q(M,W), compute the
probability of a hit in a sequence of length L. - Q is a binary string of length M, with W 1s
- We try to match the binary match string S which
is a random binary string with probability p of
success.
M
1
L
1100.11..0
- PQ Prob. (Q matches random S at some location)
- How can we compute PQ?
7Computing F(i,b)
- For a specific string b, define
- F(i,b) Prob. (Q matches a random string S of
length i, s.t. S ends in b)
i
1
b
- Why is it sufficient to compute f(I,b) for all I,
b? - PQ f(L,?)
8Computing f(i,b)
- Define B1 as the set of all strings b that match
a suffix of Q
b
- We have two possibilities
- b ? B1 b is consistent with a suffix of Q.
- b ? B0 B-B1
110001
Q
110001
9Computing f(i,b)
- Case b ? B0
- f(i,b) f(i-1,bgtgt1)
b
Q
- Case b ? B1 and b M
- f(i,b) 1
10Computing f(i,b)
- Case b ? B1, bltM
- f(i,b) pf(i-1,1b) (1-p) pf(i-1,0b)
- Note that if b ? B1 , then 1b ? B1
- However, it is possible that 0b ? B1
- We want to iterate only over b ? B1
- Find smallest j s.t. 0bgtgtj? B1
- f(i,b) pf(i-1,1b) (1-p)f(i-j,0bgtgtj)
Q
11Efficiency
- B1 M2M-W
- The iteration proceeds for all i, and all b?B1,
and each comparison needs O(M) steps - O(M2M-W L M) O(M22M-W L)
12More efficient algorithm for spaced seed design
- Due to Buhler, Keich, and Sun
- Consider seed ? (weight w, span s).
- Let Q? be the set of all possible 2s-w strings
matching ?.
1 1 0 0 0 1 0 1
1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1
1 0 0 1 1 1 1
13Trie construction
- Our goal is to make an automaton that accepts all
strings which contain a string from Q. - Make a trie T? from Q?.
- T? is a DFA that precisely accepts Q?
- Can we convert T? to an DFA that accepts all
strings that matches a string from Q? as a
suffix?
14Failure links
- Use of failure links allows us to traverse any
string till Q is reached. - Note the DFA has special structure. Does it
help? No failure links when outgoing edge is 0.
Therefore, we fail only when we see a 1.
15Substring automaton
- We started with a Trie that only accepts Q?
- Next, we use failure links to accept any string
with a suffix from Q?. - Finally, make every accepting state an absorbing
one, to accept all strings containing a string
from Q? as suffix.
0,1
1
0
0
1
1
1
1
Ex ? 1001
0
1
1
1
16Computing sensitivity of ?
- Compute the probability that a random string of
length l will match ?? - Equivalently What is the probability that a
random string of length l that starts at the
begin node will end in an accepting state of A?. - Case 1 Each bit of S is 1 with probability p
- P(q,t)Probability that we reach q after reading
the first t bits.
17Complexity
- Size of the Automaton W2M-W
- What is the in-degree?
- Claimed complexity ?(W2M-WL)
- O(M2/W) faster then the previous algorithm
18Generalizing the match string
- The match string may have a different
distribution - Errors do not fall independently at random
- Instead of independent bernoulli trials, we can
have a higher order markovian process generating
the match string. - The algorithm of Keich et al. Cannot deal with
this extension, but it is natural in Mandala
19Experimental Results with Mandala
- 428 human/mouse genomic aligned regions.
- Repeat mask the alignments and separate into
coding/non-coding regions. - A total of 1136000 similarities (alignments) were
pulled. These are used to check for sensitivity
(accuracy) of filters.
20Effect of Span
- Solid line 0-th order model
- Dashed line 5-th order model.
- W11 throughout larger span implies more gaps,
span11 implies ungapped (BLASTN) seed
21Accuracy of different seeds
22Model order
- Non-Coding solid line
- coding dashed line
23What about multiple keywords
- All of the analysis is for ungapped alignments.
- With indels, multiple words might be more
sensitive. - Mandala works for multiple keywords also.
- Can we make the algorithm more efficient?
- In particular, there is an explosion of states in
making a deterministic automaton? Can we match a
non-deterministic automaton?
24Regular Expressions
- Concise representation of a set of strings over
alphabet ?. - Described by a string over
- R is a r.e. if and only if
25Regular Expression
- Q Let ?A,C,E
- Is (AC)EEC a regular expression?
- (AC)?
- AC..E?
- Q When is a string s in a regular expression?
- R (AC)EEC
- Is CEEC in R?
- AEC?
- ACEE?
26Regular Expression Automata
- Every R.E can be expressed by an automaton (a
directed graph) with the following properties - The automaton has a start and end node
- Each edge is labeled with a symbol from ?, or ?
- Suppose R is described by automaton A
- S ? R if and only if there is a path from start
to end in A, labeled with s.
27Examples Regular Expression Automata
C
A
E
E
start
end
C
28Constructing automata from R.E
- R ?
- R ?, ? ? ?
- R R1 R2
- R R1 R2
- R R1
?
?
?
?
?
?
29Regular Expression Matching
- Given a database D, and a regular expression R,
is a substring of D in R?
- Is there a string Dl..c that is accepted by the
automaton of R?
- Simpler Q Is D1..c accepted by the automaton
of R?
30Alg. For matching R.E.
- If D1..c is accepted by the automaton RA
- There is a path labeled D1Dc that goes from
START to END in RA
?
D1
D2
Dc
31Alg. For matching R.E.
- If D1..c is accepted by the automaton RA
- There is a path labeled D1Dc that goes from
START to END in RA - There is a path labeled D1..Dc-1 from START
to node u, and a path labeled Dc from u to the
END
32D.P. to match regular expression
- Define
- Au,? Automaton node reached from u after
reading ? - Eps(u) set of all nodes reachable from node u
using epsilon transitions. - Nc subset of nodes reachable from START node
after reading D1..c - Q when is v ? Nc?
u
?
v
?
u
Eps(u)
33D.P. to match regular expression
- Q when is v ? Nc?
- A If for some u ? Nc-1, w Au,Dc,
- v ? w Eps(w)
w
u
D1 .. Dc-1
?
Dc
34Algorithm
35The final step
- We have answered the question
- Is D1..c accepted by R?
- Yes, if END ? Nc
- We need to answer
- Is Dl..c (for some l, and some c) accepted by R
36(No Transcript)