Modern Homology Search

About This Presentation

Title:

Modern Homology Search

Description:

Modern Homology Search Ming Li Canada Research Chair in Bioinformatics Computer Science. U. Waterloo I want to show you how one simple theoretical idea can make big ... – PowerPoint PPT presentation

Number of Views:193

Avg rating:3.0/5.0

Slides: 83

Provided by: Ming74

Category:

more less

Transcript and Presenter's Notes

Title: Modern Homology Search

1
Modern Homology Search
Ming Li Canada Research Chair in
Bioinformatics Computer Science. U. Waterloo
2

I want to show you how one simple theoretical
idea can make big impact to a practical field.

3
Outline

Motivation, market, biology, homology search
Review dynamic programming, BLAST heuristics
Optimized spaced seeds
The idea
How to compute them
Data models, coding regions, HMM, vector seeds
Multiple optimized spaced seeds
PatternHunter program
Mathematical theory of spaced seeds
Why they are better
Complexity of finding optimal spaced seeds
Applications beyond bioinformatics Open problems

4
1. Introduction

Biology
Motivation and market
Definition of homology search problem

5
Biology
Phenotype
6
Human 3 billion bases, 30k genes. E. coli 5
million bases, 4k genes
T
A
A
T
C
G
cDNA
T
A
reverse transcription
T
A
translation
transcription
C
G
Protein
mRNA
G
C
(20 amino acids)
(A,C,G,U)
G
C
Codon three nucleotides encode an
amino acid. 64 codons 20 amino acids, some w/more
codes
A
T
C
G
T
A
A
T
7
A gigantic gold mine

The trend of genetic data growth
400 Eukaryote genome projects underway
GenBank doubles every 18 months
Comparative genomics ? all-against-all search
Software must be scalable to large datasets.

30 billion in year 2005
8
Homology search

Given two DNA sequences, find all local similar
regions, using edit distance (match1,
mismatch-1, gapopen-5, gapext-1).
Example. Input
E. coli genome 5 million base pairs
H. influenza genome 1.8 million base pairs
Output all local alignments (PH demo)
java jar phn.jar i ecoli.fna j
hinf.fna o out.txt b

9
Homology search vs Google search

Internet search
Size limit 5 billion people x homepage size
Supercomputing power used ½ million
CPU-hours/day
Query frequency Google --- 112 million/day
Query type exact keyword search --- easy to do
Homology search
Size limit 5 billion people x 3 billion
basepairs millions of species x billion bases
10 (?) of worlds supercomputing power
Query frequency NCBI BLAST -- 150,000/day, 15
increase/month
Query type approximate search

10
Bioinformatics Companies living on Blast

Paracel (Celera)
TimeLogic (recently liquidated)
TurboGenomics (TurboWorx)
NSF, NIH, pharmaceuticals proudly support many
supercomputing centers mainly for homology search
Hardware high cost become obsolete in 2 years.

11
History

Dynamic programming (1970-1980)
Full sensitivity, but slow.
Human vs mouse genomes 104 CPU-years
BLAST, FASTA heuristics (1980-1990)
Low sensitivity, still not fast enough.
Human vs mouse genomes 19 CPU-years
BLAST paper was referenced 100000 times
Modern technology full sensitivity greater
speed!

12
2. Old Technology

Dynamic programming full sensitivity homology
search
BLAST heuristics --- trading sensitivity with
time.

13
Dynamic Programming

Longest Common Subsequence (LCS).
Vv1v2 vn
Ww1w2 wm
s(i,j) length of LCS
of V1..i and W1..j
Dynamic Programming
s(i-1,j)
s(i,j)max s(i,j-1)
s(i-1,j-1)1,viwj

Sequence Alignment (Needleman-Wunsch, 1970)
s(i,j)max
s(i-1,j) d(vi,-)
s(i,j-1)d(-,wj)
s(i-1,j-1)d(vi,wj)
0 (Smith-Waterman, 1981)
No affine gap penalties,
where d, for proteins, is either PAM or BLOSUM
matrix. For DNA, it can be match 1, mismatch -1,
gap -3 (In fact open gap -5, extension -1.)
d(-,x)d(x,-) -a, d(x,y)-u. When a0,
uinfinity, it is LCS.

14
Misc. Concerns

Local sequence alignment, add si,j0.
Gap penalties. For good reasons, we charge first
gap cost a, and then subsequent continuous
insertions blta.
Space efficient sequence alignment. Hirschberg
alg. in O(n2) time, O(n) space.
Multiple alignment of k sequences O(nk)

15
Blast

Popular software, using heuristics. By Altschul,
Gish, Miller, Myers, Lipman, 1990.
E(xpected)-value e dmne -lS, here S is score, m
is database length and n is query length.
Meaning e is number of hits one can expect to
see just by chance when searching a database of
size m.
Basic Strategy For all 3 a.a. (and closely
related) triples, remember their locations in
database. Given a query sequence S. For all
triples in S, find their location in database,
then extend as long as e-value significant.
Similar strategy for DNA (7-11 nucleotides). Too
slow in genome level.

16
Blast Algorithm

Find seeded (11 bases) matches
Extent to HSPs (High Scoring Pairs)
Gapped Extension, dynamic programming
Report all local alignments

17
BLAST Algorithm Example

Find seeded matches of 11 base pairs
Extend each match to right and left, until the
scores drop too much, to form an alignment
Report all local alignments

Example AGCGATGTCACGCGCCCGTATTTCCGTA
TCGGATCTCACGCGCCCGGCTTACCGTG
0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0
1 1 0 1 1 1 1 0
G

x
18
BLAST Dilemma

If you want to speed up, have to use a longer
seed. However, we now face a dilemma
increasing seed size speeds up, but loses
sensitivity
decreasing seed size gains sensitivity, but loses
speed.
How do we increase sensitivity speed
simultaneously? For 20 years, many tried suffix
tree, better programming ..

19
3. Optimized Spaced Seeds

Why are they better
How to compute them
Data models, coding regions, HMM
PatternHunter
Multiple spaced seeds
Vector seeds

20
New thinking

Three lines of existing approaches
Smith-Waterman exhaustive dynamic programming.
Blast family find seed matches (11 for Blastn,
28 for MegaBlast), then extend. Dilemma
increasing seed size speeds up but loses
sensitivity descreasing seed size gains
sensitivity but loses speed.
Suffix tree MUMmer, Quasar, REPuter. Only good
for precise matches, highly similar sequences.
Blast approach is the only way to deal with real
and large problems, but we must to solve Blast
dilemma We need a way to improve sensitivity and
speed simultaneously.
Goals Improve (a) Blast, (b) Smith-Waterman.

21
Optimal Spaced Seed(Ma, Tromp, Li
Bioinformatics, 183, 2002, 440-445)

Spaced Seed nonconsecutive matches and optimized
match positions.
Represent BLAST seed by 11111111111
Spaced seed 11111111111
1 means a required match
means dont care position
This seemingly simple change makes a huge
difference significantly increases hit prob. to
homologous region while reducing bad hits.

22
Formalization

Given i.i.d. sequence (homology region) with
Pr(1)p and Pr(0)1-p for each bit
1100111011101101011101101011111011101
Which seed is more likely to hit this region
BLAST seed 11111111111
Spaced seed 11111111111

11111111111
23
Sensitivity PH weight 11 seed vs Blast 11 10
24
PH 2-hit sensitivity vs Blastn 11, 12 1-hit
25
Expect Less, Get More

Lemma The expected number of hits of a weight W
length M seed model within a length L region with
similarity p is
(L-M1)pW
Proof The expected number of hits is the sum,
over the L-M1 possible positions of fitting the
seed within the region, of the probability of W
specific matches, the latter being pW.
Example In a region of length 64 with 0.7
similarity, PH has probability of 0.466 to hit vs
Blast 0.3, 50 increase. On the other hand, by
above lemma, Blast expects 1.07 hits, while PH
0.93, 14 less.

26
Why Is Spaced Seed Better?

A wrong, but intuitive, proof seed s, interval
I, similarity p
E(hits) Pr(s hits) E(hits s hits)
Thus
Pr(s hits) Lpw / E(hits s hits)
For optimized spaced seed, E(hits s hits)
11111111111 Non overlap
Prob
11111111111 6
p6
11111111111 6
p6
11111111111 6
p6
11111111111 7
p7
..
For spaced seed the divisor is 1p6p6p6p7
For BLAST seed the divisor is bigger 1 p p2
p3

27
Finding Optimal Spaced Seeds(Keich, Li, Ma,
Tromp, Discrete Appl. Math. 138(2004), 253-263 )

Let f(i,b) be the probability that seed s hits
the length i prefix of R that ends with b.
Thus, if s matches b, then
f(i,b) 1,
otherwise we have the recursive relationship
f(i,b) (1-p)f(i-1,0b') pf(i-1,1b')
where b' is b deleting the last bit.
Then the probability of s hitting R is,
SbM Prob(b) f(R,b).
Choose s with the highest probability.

28
Improvements

Brejova-Brown-Vinar (HMM) and Buhler-Keich-Sun
(Markov) The input sequence can be modeled by a
(hidden) Markov process, instead of iid.
Multiple seeds
Brejova-Brown-Vinar Vector seeds
Csuros Variable length seeds e.g. shorter
seeds for rare query words.

29
PatternHunter(Ma, Tromp, Li Bioinformatics,
183, 2002, 440-445)

PH used optimal spaced seeds, novel usage of data
structures red-black tree, queues, stacks,
hashtables, new gapped alignment algorithm.
Written in Java. Yet many times faster than
BLAST.
Used in Mouse/RAT Genome Consortia (Nature, Dec.
5, 2002), as well as in hundreds of institutions
and industry.

30
Comparison with BLAST

On Pentium III 700MH, 1GB
BLAST PatternHunter
E.coli vs H.inf 716s
14s/68M
Arabidopsis 2 vs 4 --
498s/280M
Human 21 vs 22 --
5250s/417M
Human(3G) vs Mouse(x39G) 19 years 20 days
All with filter off and identical parameters
16M reads of Mouse genome against Human genome
for MIT Whitehead. Best BLAST program takes 19
years at the same sensitivity

31
Quality Comparisonx-axis alignment
ranky-axis alignment scoreboth axes in
logarithmic scale
A. thaliana chr 2 vs 4
E. Coli vs H. influenza
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Genome Alignment by PatternHunter (4 seconds)
36
Prior Literature

Over the years, it turns out, that the
mathematicians know about certain patterns appear
more likely than others for example, in a random
English text, ABC is more likely to appear than
AAA, in their study of renewal theory.
Random or multiple spaced q-grams were used in
the following work
FLASH by Califano Rigoutsos
Multiple filtration by Pevzner Waterman
LSH of Buhler
Praparata et al

37
Coding Region HMM

The Hidden Markov Model is a finite set
of states, each of which is associated with a
probability distribution. Transitions among the
states are governed by a set of probabilities
called transition probabilities. In a particular
state an outcome or observation can be generated,
according to the associated probability
distribution. It is only the outcome, not the
state, which is visible to an external observer
and therefore states are hidden'' from the
outside hence the name Hidden Markov Model. An
HMM has the following components satisfying usual
probability laws
The number of states of the model, N.
The number of observation symbols in the
alphabet, M.
A set of state transition probabilities,
depending on current state.
A probability distribution of the observable
symbol at each of the states.

38
Modeling coding region with HMM

PatternHunter original assumption I is a
sequence of N independent Bernoulli variables X0,
XN-1 with P(Xi)p.
Coding region third position usually is less
conserved. Skipping it should be a good idea
(BLAT 110110110). With spaced seeds, we can model
it using HMM.
Brejova, Brown, Vinar M(3) HMM I is a sequence
of N independent Bernoulli random variables
X0,X1, XN-1 where Pr(Xi1)pi mod 3. In
particular
p0
p1 p2
human/mouse 0.82 0.87 0.61
human/fruit fly 0.67 0.77 0.40
DP algorithm naturally extends to M(3),
Opt seed (weight 8) for coding region
11011011000011

39
Picture of M(3)
Extending M(3), Brejova-Brown-Vinar proposed
M(8), above, to model dependencies among
positions within codons Spijk 1, each codon
has conservation pattern ijk with probability
pijk.
Figures from Brejova-Brown-Vinar
40
Sensitivity of all seeds Under 4 models
From Brejova-Brown-Vinar
41
Running PHDownload at www.BioinformaticsSolution
s.com

java jar ph.jar i query.fna j subject.fna o
out.txt b multi 4
-Xmx512m --- for large files
-j missing query.fna self-comparison
-db multiple sequence input, 0,1,2,3 (no, query,
subject, both)
-W seed weight
-G open gap penalty (default 5)
-E gap extension (default 1)
-q mismatch penalty (default 1)
-r reward for match (default 1)
-model specify model in binary
-H hits before extension
-P show progress
-b Blast output format
-multi number of seeds to be used (default 1
max 16)

42
Multiple Spaced Seeds Approaching Smith-Waterman
Sensitivity(Li, Ma, Kisman, Tromp, PatternHunter
II, J. Bioinfo Comput. Biol. 2004)

The biggest problem for Blast was low sensitivity
(and low speed). Massive parallel machines are
built to do Smith-Waterman exhaustive dynamic
programming.
Spaced seeds give PH a unique opportunity of
using several optimal seeds to achieve optimal
sensitivity, this was not possible for Blast
technology.
Economy --- 2 seeds (½ time slow down) achieve
sensitivity of 1 seed of shorter length (4 times
speed).
With multiple optimal seeds. PH II approaches
Smith-Waterman sensitivity, and 3000 times
faster.
Finding optimal seeds, even one, is NP-hard.
Experiment 29715 mouse EST, 4407 human EST.
Sensitivity and speed comparison next two slides.

43
Finding Multiple Seeds (PH II)

Computing the hit probability of given k seeds on
a uniformly distributed random region is NP-hard.
Finding a set of k optimal seeds to maximize the
hit probability cannot be approximated within
1-1/e unless NPP
Given k seeds, algorithm DP can be adapted to
compute their sensitivity (probability one of the
seeds has a hit).
Greedy Strategy
Compute one optimal seed a. Sa
Given S, find b maximizing hit probability of S U
b.
Coding region, use M(3), 0.8, 0.8, 0.5 (for mod 3
positions)
We took 12 CPU-days on 3GHz PC to compute 16
weight 11 seeds. General seeds (first 4)
111010010100110111, 111100110010100001011,
11010000110001010111, 1110111010001111.

44
Sensitivity Comparison with Smith-Waterman (at
100) The thick dashed curve is the sensitivity
of Blastn, seed weight 11. From low to high,
the solid curves are the sensitivity of PH II
using 1, 2, 4, 8 weight 11 coding region seeds,
and the thin dashed curves are the sensitivity
1, 2, 4, 8 weight 11 general purpose seeds,
respectively
45
Speed Comparison with Smith-Waterman

Smith-Waterman (SSearch) 20 CPU-days.
PatternHunter II with 4 seeds 475 CPU-seconds.
3638 times faster than Smith-Waterman dynamic
programming at the same sensitivity.

46
Vector Seeds

Definition. A vector seed is an ordered pair
Q(w,T), where w is a weight vector (w1, wM)
and T is a threshold value. An alignment sequence
x1,,xn contains a hit of the seed Q at position
k if
?i1..M(wixki-1) T
The number of nonzero positions in w is called
the support of the seed.
Comments. Other seeds are special cases of vector
seeds.

47
Experiment (Brejova PhD. thesis, 2005)

Predicted sensitivity and false positive rate
(prob. hit 0.25-homology region of seed length)
comparison of various types of seeds in a
Bernoulli region of 0.7 match probability, length
64.
--------------------------------------------------
---------------------------------------
Name Weight vector T
Support Sensitivity False positive rate
BLAST 1111111111 10
10 41 9.5x10-7
PH-10 1110101100011011 10 10
60 9.5x10-7
BLAST-13-15 111111111111111 13 15
73 9.2x10-7
VS-13-15 111111011011101111 13 15
85 9.2x10-7
VS-12-13 111101100110101111 12 13
74 6.0x10-7
VS-11-12 111011001011010111 11 12
84 2.2x10-6

48
Variable weight spaced seeds

M. Csuros proposal use variable weighted spaced
seeds depending on the composition of the strings
to be hashed. Rarer strings are expected to
produce fewer false positives, use seeds with
smaller weight --- this increases the sensitivity.

49
Open Question

Can we use different patterns (of same weight) at
different positions. So this can be called
variable position-spaced seeds. At the same
weight, we do not increase expected number of
false positive hits. Can we increase sensitivity?
Find these patterns? How much increase will there
be? Practical enough?
If the above is not the case, then prove single
optimal seed is always better than variable
positions.

50
Old field, new trend

Research trend
Dozens of papers on spaced seeds have appeared
since the original PH paper, in 3 years.
Many more have used PH in their work.
Most modern alignment programs (including BLAST)
have now adopted spaced seeds
Spaced seeds are serving thousands of users/day
PatternHunter direct users
Pharmaceutical/biotech firms.
Mouse Genome Consortium, Nature, Dec. 5, 2002.
Hundreds of academic institutions.

51
4. Mathematical Theory of spaced seeds

Why they are better Spaced seeds hits first
Asymptotic sensitivity is there a best seed in
absolute sense?
NP hardness of computing sensitivity
Approximation algorithm for sensitivity

52
Why are the spaced seeds better

Theorem 1. Let I be a homologous region, homology
level p. For any sequence 0i0 lt i1 lt in-1
I-s, let Aj be the event a spaced seed s hits
I at ij, Bj be event the consecutive seed hits I
at j. (Keich, Li, Ma, Tromp, Discrete Appl. Math.
138(2004), 253-263 )
(1) P(Ujltn Aj) P(Ujltn Bj) When
ijj, gt holds.
Proof. Induction on n. For n1, P(A0)pWP(B0).
Assume the theorem holds for nN, we prove it
holds for N1. Let Ek denote the event that, when
putting a seed at I0, 0th, 1st k-1st 1s in
the seed all match the k-th 1 mismatches.
Clearly, Ek is a partition of sample space for
all k0, .. , W, and P(Ek)pk(1-p) for any seed.
(Note, Ek for Ai and Ek for Bi have different
indices they are really different events.) It
is sufficient to show, for each kW
(2) P(Uj0..N AjEk)P(Uj0..NBjEk)
When kW, both sides of (2) equal to 1. For
kltW since (UjkBj)nEkF and Bk1,Bk2, BN are
mutually independent of Ek, we have
(3) P(Uj0..NBjEk)P(Ujk1..NBj)
Now consider the first term in (2). For each
k in 0, W-1, at most k1 of the events Aj
satisfy AjnEkF because AjnEkF iff when aligned
at ij seed s has a 1 bit at overlapping k-th bit
when the seed was at 0 there are at most k1
choices. Thus, there exist indices 0ltmk1ltmk2
ltmN N such that AmjnEk?F. Since Ek means all
previous bits matched (except for k-th), it is
clear that Ek is non-negatively correlated with
Ujk1..N Amj, thus
(4) P(Uj0..N
AjEk)P(Ujk1..N AmjEk)P(Ujk1..NAmj)
By the inductive hypothesis, we have
P(Ujk1..N Amj)
P(Ujk1..NBj)
Combined with (3),(4), this proves (2),
hence the part of (1) of the theorem.

53
To prove when ijj, P(Uj0..n-1Aj)gtP(Uj0..n-1Bj)
--- (5)

We prove (5) by induction on n. For n2, we have
P(Uj0,1Aj)2pw p2w-Shift1gt2pw-pw1P
(Uj0,1Bj) ()
where shift1number of overlapped bits of the
spaced seed s when shifted to the right by 1 bit.
For inductive step, note that the proof of (1)
shows that for all k0,1, ,W,
P(Uj0..n-1AjEk)P(Uj0..n-1BjEk). Thus to
prove (5) we only need to prove that
P(Uj0..n-1AjE0)gtP(Uj0..n-1BjE0).
The above follows from the inductive hypothesis
as follows
P(Uj0..n-1AjE0)P(Uj1..n-1Aj)gtP(Uj1..n-1Bj)P
(Uj0..n-1BjE0).

54
Corollary

Corollary Assume I is infinite. Let ps and pc be
the first positions a spaced seed and the
consecutive seed hit I, respectively. Then Eps
lt Epc.
Proof. EpsSk0..8 kP(psk)
Sk0..8 kP(psgtk-1)-P(psgtk)
Sk0..8P(psgtk)
Sk0..8(1-P(Uj0..kAj))
ltSk0..8(1-P(Uj0..kBj))
Epc.

55
Asymptotic sensitivity(Buhler-Keich-Sun, JCSS
2004, Li-Ma-Zhang, SODA2006)

Theorem. Given a spaced seed s, the asymptotic
hit probability on a homologous region R can be
computed, up to a constant factor, in time
exponentially proportional to s, but
independent of R.
Proof. Let d be one plus the length of the
maximum run of 0's in s. Let M'sd. Consider
an infinite iid sequence RR1R2 ... ,
Prob(Ri1)p.
Let T be set of all length s strings that do
not dominate s (not matched by s).
Let Zi,j denote the event that s does not hit
at any of the positions Ri, Ri1, , Rj.
For any ti ? T, let
xi(n) PZ1,n
Rn..ns-1 ti .
That is, xi(n) is the no-hit probability in the
first n positions when they end with ti. We now
establish a relationship between xi(n) and
xi(n-M) . Let
Ci,j PZn-M'1,n ti at n,
tj at n-M' x Ptj at n-M' ti at n.
Thus Ci,j is the probability of generating tj at
position n-M' (from ti at n) times the
probability not having a hit in a region of
length sM' beginning with tj and ending with
ti. Especially Ci,j ? 0 for any nontrivial seed
of length greater than 0 (because of M with d
spacing).

56
Proof Continued

Then,
xi(n) ?j1K PZ1,n-M' tj at n-M'
x Ptj at n-M' ti at n x
PZn-M'1,n ti at n, tj at n-M'
?j1K PZ1,n-M' tj at n-M' x
Ci,j
?j1K Ci,j xj(n-M)
Define the transition matrix C(Ci,j). It is a K
x K matrix. Let x(n) (x1(n), x2(n), , xK(n)).
Then, assuming that M' divides n,
x(n) x(1) ? (CT)n/M.
Because Ci,j gt 0 for all i,j, C is a positive
matrix. The row sum of the i-th row is the
probability that a length-(sM') region ending
with ti does not have a hit. The row sum is lower
bounded by (1-p)d (1-pW ), where (1-p)d is the
probability of generating d 0's and 1-pW is the
probability of generating a string in T. It is
known that the largest eigenvalue of C is
positive and unique, and is lower bounded by the
smallest row sum and upper bounded the largest
row sum of C. Let ?1gt0 be the largest eigenvalue,
and ?2, 0 lt ?2 lt ?1 be the second largest.
There is a unique eigenvector corresponding to
?1.
x(n) / x(n) x(1) ? (CT
)n/M / x(1) ? (CT)n/M
converges to the eigenvector corresponding
to ?1. As ?1n tends to zero, we can use standard
techniques to normalize xn as
y(n) x(n) / x(n)2
and then x(nM) C y(n) .

Proof Continued
Then (by power method) the Rayleigh quotient of
? (y(n))T C y(n) / (y(n))T y(n)
(y(n))T x(nM)
converges to ?1. The convergence speed depends on
the ratio ?1/ ?2, it is known that we can upper
bound the second largest eigenvalue ?2 in a
positive matrix by
?2 ?1 (K-1)/(K1)
where K maxi,j,k,l ? (Ci,j Ck,l / Ci,l Ck,j).
For any i,j, we have
pa (1-p)b (1-p)d
Ci,j pa (1-p)b
where a is the number of 1's in tj, b the number
of 0's in tj. K O(1/(1-p)d ).
As (1 - 1/K)K goes to 1/e, the convergence can be
achieved in O(K) O(1/(1-p)d ) steps. The time
complexity is therefore upper bounded by an
exponential function in s.
QED

58
Theorem 2. It is NP hard to find the optimal
seed(Li, Ma, Kisman, Tromp, J. Bioinfo Comput.
Biol. 2004)

Proof. Reduction from 3-SAT. Given a 3-SAT
instance C1 Cn, on Boolean variables x1, ,
xm, where Ci l1 l2 l3, each l is some xi or
its complement. We reduce this to seed selection
problem. The required seed has weight Wm2 and
length L2m2, and is intended to be of the form
1(0110)m1, a straightforward variable assignment
encoding. The i-th pair of bits after the initial
1 is called the code for variable xi. A 01 code
corresponds to setting xi true, while a 10 means
false. To prohibit code 11, we introduce, for
every 1 i m, a region
Ai1 (11)i-101(11)m-i 10m1
1(11)i-110(11)m-i 1.
Since the seed has m 0s, it cannot bridge the
middle 0m1 part, and is thus forced to hit at
the start or at the end. This obviously rules out
code 11 for variable xi. Code 00 automatically
becomes ruled out as well, since the seed must
distribute m 1s among all m codes. Finally, for
each clause Ci, we introduce
Ri 1(11)a-1 c1(11)m-a 1 000
1(11)b-1c2(11)m-b 1 000 1(11)c-1c3(11)m-c 1.
The total length of Ri is (2m2)3(2m2)3(2m2)
6m12 it consists of three parts each
corresponding to a literal in the constraint,
with ci being the code for making that literal
true.
Thus a seed hits Ri iff the assignment encoded by
the seed satisfies Ci .

59
The complexity of computing spaced seed
sensitivity

Theorem. It is also NP hard to compute
sensitivity of a given seed. It remains to be NPC
even in a uniform region, at any homology level
p.
Proof.
Very complicated. Omitted.

60
PTAS for computing sensitivity of a seed

Trivial alg. Sample R poly times, compute the
frequency s hits. Use that to approximate the
real probability.
When p is large, then using Chernoff bound, we
have a PTAS.
But when p is very small, or seed weight is very
large, hitting probability is actually very
small, then we do not have a PTAS.

61
Efficient approximation of the hit probability of
a seed

Theorem. Let the hit probability of s be x. For
any egt0, let N 6R2log R / e2. Then with high
probability, in polynomial time and N samples, we
can output a value y such that y-x ex.

Proof. We give a transformation to transform the
low probability event s hits R'' dependent on p
and weight of s to several events with
probability sum 1, independent of p and seed
weight W. Let sr be the reverse string of s.
Ps hits R Psr hits R
?i0R-s Psr hits at i, but misses at
0, , i-1
?i0R-s Psr hits at i x Psr misses
0, , i-1 sr hits at i
?i0R-s pW x Ps misses 1 i s hits
0 (1)
Because Ps hits R pW, (1) induces that
?i0R-s Ps misses 1 i s hits 0 1
Now, we can do efficient sampling, and using
Chernoff bounds, the rest of the proof is
standard. QED

63
Expected hit distance

Corollary. E(second hit position s hits at 0)
p-W.
Proof.
?i0R-s Ps misses 1 i s hits 0
?i0R-s ?ji1R-s1 Ps second hits j
s hits 0
?j0R-s1 ?i0j-1 Ps second hits j s
hits 0
?i0R-s1 (j-1)Ps second hits j s hits
0
Combining with (1), we have
Ps hits RxpW?i0R-s1 (j-1)Ps second hits
js hits 0
Letting R ? 8, the left hand is p-W, the right
is
E(second hit position s hits at 0).
QED

64
Chernoff bound

Consider a sequence of n independent tosses of a
coin with heads prob. p. The expected number of
heads is mnp. What is the probability that the
number of heads deviates a lot from this? Let X
be the number of successes, then
PrX-mem 2e-eem/3

65
Complexity of finding the optimal spaced seeds

Theorem 1 Li-Ma. Given a seed and it is NP-hard
to find its sensitivity, even in a uniform
region.
Theorem 2 Li-Ma. The sensitivity of a given
seed can be efficiently approximated by a PTAS.
Theorem 3. Given a set of seeds, choose k best
can be approximated with ratio 1- 1/e.
Theorem 4 Buhler-Keich-Sun, Li-Ma The
asymptotic hit probability is computable in
exponential time in seed length, independent of
homologous region length.
Theorem 5 L. Zhang If the length of a spaced
seed is not too long, then it strictly
outperforms consecutive seed, in asymptotic hit
probability.

5. Applications beyond bioinformatics

67
Stock Market PredictionZou-Deng-Li Detecting
Market Trends by Ignoring It, Some Days, March,
2005

A real gold mine 4.6 billion dollars are traded
at NYSE daily.
Buy low, sell high.
Best thing God tells us market trend
Next best thing A good market indicator.
Essentially, a buy indicator must be
Sensitive when the market rises
Insensitive otherwise.

68
My goal

Provide a sensitive, but not aggressive market
trend indicator.
Learning/inference methods or complete and
comprehensive systems are beyond this study. They
can be used in conjunction with our proposal.

69
Background

Hundreds of market indicators are used (esp. in
automated systems). Typical
Common sense if the past k days are going up,
then the market is moving up.
Moving average over the last k days. When the
average curve and the price curve intersect,
buy/sell.
Special patterns a wedge, triangle, etc.
Volume
Hundreds used in automated trading systems.
Note any method will make money in the right
market, lose money in the wrong market

70
Problem Formalization

The market movement is modeled as a 0-1 sequence,
one bit per day, with 0 meaning market going
down, and 1 up.
S(n,p) is an n day iid sequence where each bit
has probability p being 1 and 1-p being 0. If
pgt0.5, it is an up market
Ik1k is an indicator that the past k days are
1s.
I8 has sensitivity 0.397 in S(30,0.7), too
conservative
I8 has false positive rate 0.0043 in S(100, 0.3).
Good
Iij is an indicator that there are i 1s in last
j days.
I811 has high sensitivity 0.96 in S(30,0.7)
But it is too aggressive at 0.139 false positive
rate in S(100, 0.3).
Spaced seeds 111111111 and 111111111 combine
to
have sensitivity 0.49 in S(30,0.7)
False positive rate 0.0032 in S(100, 0.3).
Consider a betting game A player bets a number
k. He wins k dollars for a correct prediction and
o.w. loses k dollars. We say an indicator A is
better than B, AgtB, if A always wins more and
loses less.

71
Sleeping on Tuesdays and Fridays

Spaced seeds are beautiful indicators they are
sensitive when we need them to be sensitive and
not sensitive when we do not want them to be.

11111111 always beats I811 if it bets 4
dollars for each dollar I811 bets. It is gtI8
too.
72
Two spaced seeds
Observe two spaced Seeds curve vs I8, the spaced
seeds are always more sensitive in pgt0.5 region,
and less sensitive when plt0.5
73
Two experiments

We performed two trading experiments
One artificial
One on real data (SP 500, Nasdaq indices)

74
Experiment 1 Artificial data

This simple HMM generates a very artificial
simple model
5000 days (bits)
Indicators I7, I711, 5 spaced seeds.
Trading strategy if there is a hit, buy, and
sell 5 days later.
Reward is (1)-(0) in that 5 days times the
betting ratio

75
Results of Experiment 1.

R Hits Final MTM
Bankrupcies
I71111111 30 12 679
16
I711 15 47 916
14
5 Spaced seeds 25 26 984
13

76
Experiment 2.

Historical data of SP 500, from Oct 20, 1982 to
Feb. 14, 2005 and NASDAQ, from Jan 2, 85 to Jan
3, 2005 were downloaded from Yahoo.com.
Each strategy starts with 10,000 USD. If an
indicator matches, use all the money to buy/sell.
While in no ways this experiment says anything
affirmatively, it does suggest that spaced
patterns can serve a promising basis for a useful
indicator, together with other parameters such as
trade volume, natural events, politics,
psychology.

77
(No Transcript)
78
Open Questions

I have presented a simple idea of optimized
spaced seeds and its applications in homology
search and time series prediction.
Open questions current research
Complexity of finding (near) optimal seed, in a
uniform region. Note that this is not an NP-hard
question.
Tighter bounds on why spaced seeds are better.
Applications to other areas. Apparently, the same
idea works for any time series.
Model financial data
Can we use varying patterns instead of one seed?
When patterns vary at the same weight, we still
have same number of expected hits. However, can
this increase sensitivity? One seed is just a
special case of this.

79
Acknowledgement

PH is joint work with B. Ma and J. Tromp
PH II is joint work with Ma, Kisman, and Tromp
Some joint theoretical work with Ma, Keich,
Tromp, Xu, Brown, Zhang.
Financial market prediction J. Zou, X. Deng
Financial support Bioinformatics Solutions Inc,
NSERC, Killam Fellowship, CRC chair program, City
University of Hong Kong.

80
Here is an example of a BLASTP match (E-value 0)
between gene 0189 in C. pneumoniae and gene 131
in C. trachomatis. Query CPn0189
Score
(bits) E-value Aligned with CT131 hypothetical
protein 1240
0.0 Query 1 MKRRSWLKILGICLGSSIVLGFLIFLPQLLSTE
SRKYLVFSL I HKESGLSCSAEELKISW 60
MKR W KI G L L L LP SES KYL
SKEGL EL SW Sbjct 1
MKRSPWYKIFGYYLLVGVPLALLALLPKFFSSESGKYLFLSVLNKETGLQ
F EIEQLHLSW 60 Query 61 FGRQTARKIKLTG-EAKDEVFS
AEKFELDGSLLRLL I YKKPKGITLSGWSLKINEPASID 119
FG QTAKI G EFAEK GSL
RLLY PK TLGWSLIE S Sbjct 61
FGSQTAKKIRIRGIDSDSEIFAAEKI IVKGSLPRLLL
YRFPKALTLTGWSLQIDESLSMN 120 Etc. Note Because of
powerpoint character alignment problems, I
inserted some white space to make it look more
aligned.
81
Summary and Open Questions

Simple ideas 0 created SW, 1 created Blast. 1
0 created PatternHunter.
Good spaced seeds help to improve sensitivity and
reduce irrelevant hit.
Multiple spaced seeds allow us to improve
sensitivity to approach Smith-Waterman, not
possible with BLAST.
Computing spaced seeds by DP, while it is NP
hard.
Proper data models help to improve design of
spaced seeds (HMM)
Open Problems (a) A mathematical theory of
spaced seeds. I.e. quantitative statements for
Theorem 1.
(b) Applications to other fields.

82
Smaller (more solvable) questions (perhaps good
for course projects)

Multiple protein seeds for Smith-Waterman
sensitivity?
Applications in other fields, like internet
document search?
Prove Theorem 1 for other models for example,
M(3) model 0.8,0.8,0.5 together with specialized
seeds.
More extensive study of Problem 3.

Write a Comment

User Comments (0)

About PowerShow.com

Modern Homology Search - PowerPoint PPT Presentation

Modern Homology Search

Modern Homology Search Ming Li Canada Research Chair in Bioinformatics Computer Science. U. Waterloo I want to show you how one simple theoretical idea can make big ... – PowerPoint PPT presentation