Designing Spaced Seeds - PowerPoint PPT Presentation

About This Presentation

Title:

Designing Spaced Seeds

Description:

Our goal is to make an automaton that accepts all strings which contain a string ... Substring automaton. We started ... Suppose R is described by automaton A ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 37

Provided by: vineet50

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Designing Spaced Seeds

1
Designing Spaced Seeds
2
Project/Exam deadlines

May 2
Send email to me with a title of your project
May 9
Each student/group gives a 10 min. presentation
on their proposed project.
Show preliminary computations. What is the test
plan? What is the data like, and how much is
there.
Last week of classes
A 20 min. presentation from each group
A written report on the project
A take home exam, due electronically on the date
of the final exam

3
Accuracy

Consider a 64bp sequence that is 70 similar to
the query.
Pr(an 11 mer matches) 0.3
Pr(A spaced seed 11101001.. Matches) 0.466
This non-intuitive result leads to selection of
spaced words that are an order of magnitude
faster for identical specificity and sensitivity
Implemented in PATTERNHUNTER

4
How to compute a spaced seed

No good algorithm is known.
Iterate over all (M choose W) seeds.
Use a computation to decide Pr(match)
Choose the seed that maximizes probability.

5
Prob. Computation for Spaced Seeds

Given a specific seed Q(M,W), compute the
probability of a hit in a sequence of length L.
We can assume that there is a probability p of
match.
The match mismatch string is a binary string with
probability p of 1

1
L
1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0
6
Prob. Computation for Spaced Seeds

Given a specific seed Q(M,W), compute the
probability of a hit in a sequence of length L.
Q is a binary string of length M, with W 1s
We try to match the binary match string S which
is a random binary string with probability p of
success.

M
1
L
1100.11..0

PQ Prob. (Q matches random S at some location)
How can we compute PQ?

7
Computing F(i,b)

For a specific string b, define
F(i,b) Prob. (Q matches a random string S of
length i, s.t. S ends in b)

i
1
b

Why is it sufficient to compute f(I,b) for all I,
b?
PQ f(L,?)

8
Computing f(i,b)

Define B1 as the set of all strings b that match
a suffix of Q

We have two possibilities
b ? B1 b is consistent with a suffix of Q.
b ? B0 B-B1

110001
Q
110001
9
Computing f(i,b)

Case b ? B0
f(i,b) f(i-1,bgtgt1)

b
Q

Case b ? B1 and b M
f(i,b) 1

10
Computing f(i,b)

Case b ? B1, bltM
f(i,b) pf(i-1,1b) (1-p) pf(i-1,0b)
Note that if b ? B1 , then 1b ? B1
However, it is possible that 0b ? B1
We want to iterate only over b ? B1
Find smallest j s.t. 0bgtgtj? B1
f(i,b) pf(i-1,1b) (1-p)f(i-j,0bgtgtj)

Q
11
Efficiency

B1 M2M-W
The iteration proceeds for all i, and all b?B1,
and each comparison needs O(M) steps
O(M2M-W L M) O(M22M-W L)

12
More efficient algorithm for spaced seed design

Due to Buhler, Keich, and Sun
Consider seed ? (weight w, span s).
Let Q? be the set of all possible 2s-w strings
matching ?.

1 1 0 0 0 1 0 1
1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1
1 0 0 1 1 1 1
13
Trie construction

Our goal is to make an automaton that accepts all
strings which contain a string from Q.
Make a trie T? from Q?.
T? is a DFA that precisely accepts Q?
Can we convert T? to an DFA that accepts all
strings that matches a string from Q? as a
suffix?

14
Failure links

Use of failure links allows us to traverse any
string till Q is reached.
Note the DFA has special structure. Does it
help? No failure links when outgoing edge is 0.
Therefore, we fail only when we see a 1.

15
Substring automaton

We started with a Trie that only accepts Q?
Next, we use failure links to accept any string
with a suffix from Q?.
Finally, make every accepting state an absorbing
one, to accept all strings containing a string
from Q? as suffix.

0,1
1
0
0
1
1
1
1
Ex ? 1001
0
1
1
1
16
Computing sensitivity of ?

Compute the probability that a random string of
length l will match ??
Equivalently What is the probability that a
random string of length l that starts at the
begin node will end in an accepting state of A?.
Case 1 Each bit of S is 1 with probability p
P(q,t)Probability that we reach q after reading
the first t bits.

17
Complexity

Size of the Automaton W2M-W
What is the in-degree?
Claimed complexity ?(W2M-WL)
O(M2/W) faster then the previous algorithm

18
Generalizing the match string

The match string may have a different
distribution
Errors do not fall independently at random
Instead of independent bernoulli trials, we can
have a higher order markovian process generating
the match string.
The algorithm of Keich et al. Cannot deal with
this extension, but it is natural in Mandala

19
Experimental Results with Mandala

428 human/mouse genomic aligned regions.
Repeat mask the alignments and separate into
coding/non-coding regions.
A total of 1136000 similarities (alignments) were
pulled. These are used to check for sensitivity
(accuracy) of filters.

20
Effect of Span

Solid line 0-th order model
Dashed line 5-th order model.
W11 throughout larger span implies more gaps,
span11 implies ungapped (BLASTN) seed

21
Accuracy of different seeds

Non-coding
Coding

22
Model order

Non-Coding solid line
coding dashed line

23
What about multiple keywords

All of the analysis is for ungapped alignments.
With indels, multiple words might be more
sensitive.
Mandala works for multiple keywords also.
Can we make the algorithm more efficient?
In particular, there is an explosion of states in
making a deterministic automaton? Can we match a
non-deterministic automaton?

24
Regular Expressions

Concise representation of a set of strings over
alphabet ?.
Described by a string over
R is a r.e. if and only if

25
Regular Expression

Q Let ?A,C,E
Is (AC)EEC a regular expression?
(AC)?
AC..E?

Q When is a string s in a regular expression?
R (AC)EEC
Is CEEC in R?
AEC?
ACEE?

26
Regular Expression Automata

Every R.E can be expressed by an automaton (a
directed graph) with the following properties
The automaton has a start and end node
Each edge is labeled with a symbol from ?, or ?

Suppose R is described by automaton A
S ? R if and only if there is a path from start
to end in A, labeled with s.

27
Examples Regular Expression Automata

(AC)EEC

C
A
E
E
start
end
C
28
Constructing automata from R.E

R ?
R ?, ? ? ?
R R1 R2
R R1 R2
R R1

?
?
?
?
?
?
29
Regular Expression Matching

Given a database D, and a regular expression R,
is a substring of D in R?

Is there a string Dl..c that is accepted by the
automaton of R?

Simpler Q Is D1..c accepted by the automaton
of R?

30
Alg. For matching R.E.

If D1..c is accepted by the automaton RA
There is a path labeled D1Dc that goes from
START to END in RA

?
D1
D2
Dc
31
Alg. For matching R.E.

If D1..c is accepted by the automaton RA
There is a path labeled D1Dc that goes from
START to END in RA
There is a path labeled D1..Dc-1 from START
to node u, and a path labeled Dc from u to the
END

32
D.P. to match regular expression