Title: Modern Homology Search
1Modern Homology Search
Ming Li Canada Research Chair in
Bioinformatics Computer Science. U. Waterloo
2- I want to show you how one simple theoretical
idea can make big impact to a practical field.
3Outline
- Motivation, market, biology, homology search
- Review dynamic programming, BLAST heuristics
- Optimized spaced seeds
- The idea
- How to compute them
- Data models, coding regions, HMM, vector seeds
- Multiple optimized spaced seeds
- PatternHunter program
- Mathematical theory of spaced seeds
- Why they are better
- Complexity of finding optimal spaced seeds
- Applications beyond bioinformatics Open problems
41. Introduction
- Biology
- Motivation and market
- Definition of homology search problem
5Biology
Phenotype
6Human 3 billion bases, 30k genes. E. coli 5
million bases, 4k genes
T
A
A
T
C
G
cDNA
T
A
reverse transcription
T
A
translation
transcription
C
G
Protein
mRNA
G
C
(20 amino acids)
(A,C,G,U)
G
C
Codon three nucleotides encode an
amino acid. 64 codons 20 amino acids, some w/more
codes
A
T
C
G
T
A
A
T
7 A gigantic gold mine
- The trend of genetic data growth
- 400 Eukaryote genome projects underway
- GenBank doubles every 18 months
- Comparative genomics ? all-against-all search
- Software must be scalable to large datasets.
30 billion in year 2005
8Homology search
- Given two DNA sequences, find all local similar
regions, using edit distance (match1,
mismatch-1, gapopen-5, gapext-1). - Example. Input
- E. coli genome 5 million base pairs
- H. influenza genome 1.8 million base pairs
- Output all local alignments (PH demo)
- java jar phn.jar i ecoli.fna j
hinf.fna o out.txt b
9Homology search vs Google search
- Internet search
- Size limit 5 billion people x homepage size
- Supercomputing power used ½ million
CPU-hours/day - Query frequency Google --- 112 million/day
- Query type exact keyword search --- easy to do
- Homology search
- Size limit 5 billion people x 3 billion
basepairs millions of species x billion bases - 10 (?) of worlds supercomputing power
- Query frequency NCBI BLAST -- 150,000/day, 15
increase/month - Query type approximate search
10Bioinformatics Companies living on Blast
- Paracel (Celera)
- TimeLogic (recently liquidated)
- TurboGenomics (TurboWorx)
- NSF, NIH, pharmaceuticals proudly support many
supercomputing centers mainly for homology search - Hardware high cost become obsolete in 2 years.
11History
- Dynamic programming (1970-1980)
- Full sensitivity, but slow.
- Human vs mouse genomes 104 CPU-years
- BLAST, FASTA heuristics (1980-1990)
- Low sensitivity, still not fast enough.
- Human vs mouse genomes 19 CPU-years
- BLAST paper was referenced 100000 times
- Modern technology full sensitivity greater
speed!
122. Old Technology
- Dynamic programming full sensitivity homology
search - BLAST heuristics --- trading sensitivity with
time.
13Dynamic Programming
- Longest Common Subsequence (LCS).
- Vv1v2 vn
- Ww1w2 wm
- s(i,j) length of LCS
- of V1..i and W1..j
- Dynamic Programming
- s(i-1,j)
- s(i,j)max s(i,j-1)
- s(i-1,j-1)1,viwj
- Sequence Alignment (Needleman-Wunsch, 1970)
- s(i,j)max
- s(i-1,j) d(vi,-)
- s(i,j-1)d(-,wj)
- s(i-1,j-1)d(vi,wj)
- 0 (Smith-Waterman, 1981)
- No affine gap penalties,
- where d, for proteins, is either PAM or BLOSUM
matrix. For DNA, it can be match 1, mismatch -1,
gap -3 (In fact open gap -5, extension -1.) - d(-,x)d(x,-) -a, d(x,y)-u. When a0,
uinfinity, it is LCS.
14Misc. Concerns
- Local sequence alignment, add si,j0.
- Gap penalties. For good reasons, we charge first
gap cost a, and then subsequent continuous
insertions blta. - Space efficient sequence alignment. Hirschberg
alg. in O(n2) time, O(n) space. - Multiple alignment of k sequences O(nk)
15Blast
- Popular software, using heuristics. By Altschul,
Gish, Miller, Myers, Lipman, 1990. - E(xpected)-value e dmne -lS, here S is score, m
is database length and n is query length. - Meaning e is number of hits one can expect to
see just by chance when searching a database of
size m. - Basic Strategy For all 3 a.a. (and closely
related) triples, remember their locations in
database. Given a query sequence S. For all
triples in S, find their location in database,
then extend as long as e-value significant.
Similar strategy for DNA (7-11 nucleotides). Too
slow in genome level.
16Blast Algorithm
- Find seeded (11 bases) matches
- Extent to HSPs (High Scoring Pairs)
- Gapped Extension, dynamic programming
- Report all local alignments
17BLAST Algorithm Example
- Find seeded matches of 11 base pairs
- Extend each match to right and left, until the
scores drop too much, to form an alignment - Report all local alignments
Example AGCGATGTCACGCGCCCGTATTTCCGTA
TCGGATCTCACGCGCCCGGCTTACCGTG
0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0
1 1 0 1 1 1 1 0
G
x
18BLAST Dilemma
- If you want to speed up, have to use a longer
seed. However, we now face a dilemma - increasing seed size speeds up, but loses
sensitivity - decreasing seed size gains sensitivity, but loses
speed. - How do we increase sensitivity speed
simultaneously? For 20 years, many tried suffix
tree, better programming ..
193. Optimized Spaced Seeds
- Why are they better
- How to compute them
- Data models, coding regions, HMM
- PatternHunter
- Multiple spaced seeds
- Vector seeds
20New thinking
- Three lines of existing approaches
- Smith-Waterman exhaustive dynamic programming.
- Blast family find seed matches (11 for Blastn,
28 for MegaBlast), then extend. Dilemma
increasing seed size speeds up but loses
sensitivity descreasing seed size gains
sensitivity but loses speed. - Suffix tree MUMmer, Quasar, REPuter. Only good
for precise matches, highly similar sequences. - Blast approach is the only way to deal with real
and large problems, but we must to solve Blast
dilemma We need a way to improve sensitivity and
speed simultaneously. - Goals Improve (a) Blast, (b) Smith-Waterman.
21Optimal Spaced Seed(Ma, Tromp, Li
Bioinformatics, 183, 2002, 440-445)
- Spaced Seed nonconsecutive matches and optimized
match positions. - Represent BLAST seed by 11111111111
- Spaced seed 11111111111
- 1 means a required match
- means dont care position
- This seemingly simple change makes a huge
difference significantly increases hit prob. to
homologous region while reducing bad hits.
22Formalization
- Given i.i.d. sequence (homology region) with
Pr(1)p and Pr(0)1-p for each bit - 1100111011101101011101101011111011101
- Which seed is more likely to hit this region
- BLAST seed 11111111111
- Spaced seed 11111111111
11111111111
23Sensitivity PH weight 11 seed vs Blast 11 10
24PH 2-hit sensitivity vs Blastn 11, 12 1-hit
25Expect Less, Get More
- Lemma The expected number of hits of a weight W
length M seed model within a length L region with
similarity p is - (L-M1)pW
- Proof The expected number of hits is the sum,
over the L-M1 possible positions of fitting the
seed within the region, of the probability of W
specific matches, the latter being pW. - Example In a region of length 64 with 0.7
similarity, PH has probability of 0.466 to hit vs
Blast 0.3, 50 increase. On the other hand, by
above lemma, Blast expects 1.07 hits, while PH
0.93, 14 less.
26Why Is Spaced Seed Better?
- A wrong, but intuitive, proof seed s, interval
I, similarity p - E(hits) Pr(s hits) E(hits s hits)
- Thus
- Pr(s hits) Lpw / E(hits s hits)
- For optimized spaced seed, E(hits s hits)
- 11111111111 Non overlap
Prob - 11111111111 6
p6 - 11111111111 6
p6 - 11111111111 6
p6 - 11111111111 7
p7 - ..
- For spaced seed the divisor is 1p6p6p6p7
- For BLAST seed the divisor is bigger 1 p p2
p3
27Finding Optimal Spaced Seeds(Keich, Li, Ma,
Tromp, Discrete Appl. Math. 138(2004), 253-263 )
- Let f(i,b) be the probability that seed s hits
the length i prefix of R that ends with b. - Thus, if s matches b, then
- f(i,b) 1,
- otherwise we have the recursive relationship
- f(i,b) (1-p)f(i-1,0b') pf(i-1,1b')
- where b' is b deleting the last bit.
- Then the probability of s hitting R is,
- SbM Prob(b) f(R,b).
- Choose s with the highest probability.
28Improvements
- Brejova-Brown-Vinar (HMM) and Buhler-Keich-Sun
(Markov) The input sequence can be modeled by a
(hidden) Markov process, instead of iid. - Multiple seeds
- Brejova-Brown-Vinar Vector seeds
- Csuros Variable length seeds e.g. shorter
seeds for rare query words.
29PatternHunter(Ma, Tromp, Li Bioinformatics,
183, 2002, 440-445)
- PH used optimal spaced seeds, novel usage of data
structures red-black tree, queues, stacks,
hashtables, new gapped alignment algorithm. - Written in Java. Yet many times faster than
BLAST. - Used in Mouse/RAT Genome Consortia (Nature, Dec.
5, 2002), as well as in hundreds of institutions
and industry.
30Comparison with BLAST
- On Pentium III 700MH, 1GB
-
BLAST PatternHunter - E.coli vs H.inf 716s
14s/68M - Arabidopsis 2 vs 4 --
498s/280M - Human 21 vs 22 --
5250s/417M - Human(3G) vs Mouse(x39G) 19 years 20 days
- All with filter off and identical parameters
- 16M reads of Mouse genome against Human genome
for MIT Whitehead. Best BLAST program takes 19
years at the same sensitivity
31Quality Comparisonx-axis alignment
ranky-axis alignment scoreboth axes in
logarithmic scale
A. thaliana chr 2 vs 4
E. Coli vs H. influenza
32(No Transcript)
33(No Transcript)
34(No Transcript)
35Genome Alignment by PatternHunter (4 seconds)
36Prior Literature
- Over the years, it turns out, that the
mathematicians know about certain patterns appear
more likely than others for example, in a random
English text, ABC is more likely to appear than
AAA, in their study of renewal theory. - Random or multiple spaced q-grams were used in
the following work - FLASH by Califano Rigoutsos
- Multiple filtration by Pevzner Waterman
- LSH of Buhler
- Praparata et al
37Coding Region HMM
- The Hidden Markov Model is a finite set
of states, each of which is associated with a
probability distribution. Transitions among the
states are governed by a set of probabilities
called transition probabilities. In a particular
state an outcome or observation can be generated,
according to the associated probability
distribution. It is only the outcome, not the
state, which is visible to an external observer
and therefore states are hidden'' from the
outside hence the name Hidden Markov Model. An
HMM has the following components satisfying usual
probability laws - The number of states of the model, N.
- The number of observation symbols in the
alphabet, M. - A set of state transition probabilities,
depending on current state. - A probability distribution of the observable
symbol at each of the states.
38Modeling coding region with HMM
- PatternHunter original assumption I is a
sequence of N independent Bernoulli variables X0,
XN-1 with P(Xi)p. - Coding region third position usually is less
conserved. Skipping it should be a good idea
(BLAT 110110110). With spaced seeds, we can model
it using HMM. - Brejova, Brown, Vinar M(3) HMM I is a sequence
of N independent Bernoulli random variables
X0,X1, XN-1 where Pr(Xi1)pi mod 3. In
particular - p0
p1 p2 - human/mouse 0.82 0.87 0.61
- human/fruit fly 0.67 0.77 0.40
- DP algorithm naturally extends to M(3),
- Opt seed (weight 8) for coding region
11011011000011
39Picture of M(3)
Extending M(3), Brejova-Brown-Vinar proposed
M(8), above, to model dependencies among
positions within codons Spijk 1, each codon
has conservation pattern ijk with probability
pijk.
Figures from Brejova-Brown-Vinar
40Sensitivity of all seeds Under 4 models
From Brejova-Brown-Vinar
41Running PHDownload at www.BioinformaticsSolution
s.com
- java jar ph.jar i query.fna j subject.fna o
out.txt b multi 4 - -Xmx512m --- for large files
- -j missing query.fna self-comparison
- -db multiple sequence input, 0,1,2,3 (no, query,
subject, both) - -W seed weight
- -G open gap penalty (default 5)
- -E gap extension (default 1)
- -q mismatch penalty (default 1)
- -r reward for match (default 1)
- -model specify model in binary
- -H hits before extension
- -P show progress
- -b Blast output format
- -multi number of seeds to be used (default 1
max 16)
42Multiple Spaced Seeds Approaching Smith-Waterman
Sensitivity(Li, Ma, Kisman, Tromp, PatternHunter
II, J. Bioinfo Comput. Biol. 2004)
- The biggest problem for Blast was low sensitivity
(and low speed). Massive parallel machines are
built to do Smith-Waterman exhaustive dynamic
programming. - Spaced seeds give PH a unique opportunity of
using several optimal seeds to achieve optimal
sensitivity, this was not possible for Blast
technology. - Economy --- 2 seeds (½ time slow down) achieve
sensitivity of 1 seed of shorter length (4 times
speed). - With multiple optimal seeds. PH II approaches
Smith-Waterman sensitivity, and 3000 times
faster. - Finding optimal seeds, even one, is NP-hard.
- Experiment 29715 mouse EST, 4407 human EST.
Sensitivity and speed comparison next two slides.
43Finding Multiple Seeds (PH II)
- Computing the hit probability of given k seeds on
a uniformly distributed random region is NP-hard. - Finding a set of k optimal seeds to maximize the
hit probability cannot be approximated within
1-1/e unless NPP - Given k seeds, algorithm DP can be adapted to
compute their sensitivity (probability one of the
seeds has a hit). - Greedy Strategy
- Compute one optimal seed a. Sa
- Given S, find b maximizing hit probability of S U
b. - Coding region, use M(3), 0.8, 0.8, 0.5 (for mod 3
positions) - We took 12 CPU-days on 3GHz PC to compute 16
weight 11 seeds. General seeds (first 4)
111010010100110111, 111100110010100001011,
11010000110001010111, 1110111010001111.
44Sensitivity Comparison with Smith-Waterman (at
100) The thick dashed curve is the sensitivity
of Blastn, seed weight 11. From low to high,
the solid curves are the sensitivity of PH II
using 1, 2, 4, 8 weight 11 coding region seeds,
and the thin dashed curves are the sensitivity
1, 2, 4, 8 weight 11 general purpose seeds,
respectively
45Speed Comparison with Smith-Waterman
- Smith-Waterman (SSearch) 20 CPU-days.
- PatternHunter II with 4 seeds 475 CPU-seconds.
3638 times faster than Smith-Waterman dynamic
programming at the same sensitivity.
46Vector Seeds
- Definition. A vector seed is an ordered pair
Q(w,T), where w is a weight vector (w1, wM)
and T is a threshold value. An alignment sequence
x1,,xn contains a hit of the seed Q at position
k if - ?i1..M(wixki-1) T
- The number of nonzero positions in w is called
the support of the seed. - Comments. Other seeds are special cases of vector
seeds.
47Experiment (Brejova PhD. thesis, 2005)
- Predicted sensitivity and false positive rate
(prob. hit 0.25-homology region of seed length)
comparison of various types of seeds in a
Bernoulli region of 0.7 match probability, length
64. - --------------------------------------------------
--------------------------------------- - Name Weight vector T
Support Sensitivity False positive rate
- BLAST 1111111111 10
10 41 9.5x10-7 - PH-10 1110101100011011 10 10
60 9.5x10-7 - BLAST-13-15 111111111111111 13 15
73 9.2x10-7 - VS-13-15 111111011011101111 13 15
85 9.2x10-7 - VS-12-13 111101100110101111 12 13
74 6.0x10-7 - VS-11-12 111011001011010111 11 12
84 2.2x10-6
48Variable weight spaced seeds
- M. Csuros proposal use variable weighted spaced
seeds depending on the composition of the strings
to be hashed. Rarer strings are expected to
produce fewer false positives, use seeds with
smaller weight --- this increases the sensitivity.
49Open Question
- Can we use different patterns (of same weight) at
different positions. So this can be called
variable position-spaced seeds. At the same
weight, we do not increase expected number of
false positive hits. Can we increase sensitivity?
Find these patterns? How much increase will there
be? Practical enough? - If the above is not the case, then prove single
optimal seed is always better than variable
positions.
50Old field, new trend
- Research trend
- Dozens of papers on spaced seeds have appeared
since the original PH paper, in 3 years. - Many more have used PH in their work.
- Most modern alignment programs (including BLAST)
have now adopted spaced seeds - Spaced seeds are serving thousands of users/day
- PatternHunter direct users
- Pharmaceutical/biotech firms.
- Mouse Genome Consortium, Nature, Dec. 5, 2002.
- Hundreds of academic institutions.
514. Mathematical Theory of spaced seeds
- Why they are better Spaced seeds hits first
- Asymptotic sensitivity is there a best seed in
absolute sense? - NP hardness of computing sensitivity
- Approximation algorithm for sensitivity
52Why are the spaced seeds better
- Theorem 1. Let I be a homologous region, homology
level p. For any sequence 0i0 lt i1 lt in-1
I-s, let Aj be the event a spaced seed s hits
I at ij, Bj be event the consecutive seed hits I
at j. (Keich, Li, Ma, Tromp, Discrete Appl. Math.
138(2004), 253-263 ) - (1) P(Ujltn Aj) P(Ujltn Bj) When
ijj, gt holds. - Proof. Induction on n. For n1, P(A0)pWP(B0).
Assume the theorem holds for nN, we prove it
holds for N1. Let Ek denote the event that, when
putting a seed at I0, 0th, 1st k-1st 1s in
the seed all match the k-th 1 mismatches.
Clearly, Ek is a partition of sample space for
all k0, .. , W, and P(Ek)pk(1-p) for any seed.
(Note, Ek for Ai and Ek for Bi have different
indices they are really different events.) It
is sufficient to show, for each kW - (2) P(Uj0..N AjEk)P(Uj0..NBjEk)
- When kW, both sides of (2) equal to 1. For
kltW since (UjkBj)nEkF and Bk1,Bk2, BN are
mutually independent of Ek, we have - (3) P(Uj0..NBjEk)P(Ujk1..NBj)
- Now consider the first term in (2). For each
k in 0, W-1, at most k1 of the events Aj
satisfy AjnEkF because AjnEkF iff when aligned
at ij seed s has a 1 bit at overlapping k-th bit
when the seed was at 0 there are at most k1
choices. Thus, there exist indices 0ltmk1ltmk2
ltmN N such that AmjnEk?F. Since Ek means all
previous bits matched (except for k-th), it is
clear that Ek is non-negatively correlated with
Ujk1..N Amj, thus - (4) P(Uj0..N
AjEk)P(Ujk1..N AmjEk)P(Ujk1..NAmj) - By the inductive hypothesis, we have
- P(Ujk1..N Amj)
P(Ujk1..NBj) - Combined with (3),(4), this proves (2),
hence the part of (1) of the theorem.
53To prove when ijj, P(Uj0..n-1Aj)gtP(Uj0..n-1Bj)
--- (5)
- We prove (5) by induction on n. For n2, we have
- P(Uj0,1Aj)2pw p2w-Shift1gt2pw-pw1P
(Uj0,1Bj) () - where shift1number of overlapped bits of the
spaced seed s when shifted to the right by 1 bit.
- For inductive step, note that the proof of (1)
shows that for all k0,1, ,W,
P(Uj0..n-1AjEk)P(Uj0..n-1BjEk). Thus to
prove (5) we only need to prove that - P(Uj0..n-1AjE0)gtP(Uj0..n-1BjE0).
- The above follows from the inductive hypothesis
as follows - P(Uj0..n-1AjE0)P(Uj1..n-1Aj)gtP(Uj1..n-1Bj)P
(Uj0..n-1BjE0).
54Corollary
- Corollary Assume I is infinite. Let ps and pc be
the first positions a spaced seed and the
consecutive seed hit I, respectively. Then Eps
lt Epc. - Proof. EpsSk0..8 kP(psk)
- Sk0..8 kP(psgtk-1)-P(psgtk)
- Sk0..8P(psgtk)
- Sk0..8(1-P(Uj0..kAj))
- ltSk0..8(1-P(Uj0..kBj))
- Epc.
55Asymptotic sensitivity(Buhler-Keich-Sun, JCSS
2004, Li-Ma-Zhang, SODA2006)
- Theorem. Given a spaced seed s, the asymptotic
hit probability on a homologous region R can be
computed, up to a constant factor, in time
exponentially proportional to s, but
independent of R. - Proof. Let d be one plus the length of the
maximum run of 0's in s. Let M'sd. Consider
an infinite iid sequence RR1R2 ... ,
Prob(Ri1)p. - Let T be set of all length s strings that do
not dominate s (not matched by s). - Let Zi,j denote the event that s does not hit
at any of the positions Ri, Ri1, , Rj.
For any ti ? T, let - xi(n) PZ1,n
Rn..ns-1 ti . - That is, xi(n) is the no-hit probability in the
first n positions when they end with ti. We now
establish a relationship between xi(n) and
xi(n-M) . Let - Ci,j PZn-M'1,n ti at n,
tj at n-M' x Ptj at n-M' ti at n. - Thus Ci,j is the probability of generating tj at
position n-M' (from ti at n) times the
probability not having a hit in a region of
length sM' beginning with tj and ending with
ti. Especially Ci,j ? 0 for any nontrivial seed
of length greater than 0 (because of M with d
spacing).
56Proof Continued
- Then,
- xi(n) ?j1K PZ1,n-M' tj at n-M'
- x Ptj at n-M' ti at n x
PZn-M'1,n ti at n, tj at n-M' - ?j1K PZ1,n-M' tj at n-M' x
Ci,j - ?j1K Ci,j xj(n-M)
- Define the transition matrix C(Ci,j). It is a K
x K matrix. Let x(n) (x1(n), x2(n), , xK(n)).
Then, assuming that M' divides n, - x(n) x(1) ? (CT)n/M.
- Because Ci,j gt 0 for all i,j, C is a positive
matrix. The row sum of the i-th row is the
probability that a length-(sM') region ending
with ti does not have a hit. The row sum is lower
bounded by (1-p)d (1-pW ), where (1-p)d is the
probability of generating d 0's and 1-pW is the
probability of generating a string in T. It is
known that the largest eigenvalue of C is
positive and unique, and is lower bounded by the
smallest row sum and upper bounded the largest
row sum of C. Let ?1gt0 be the largest eigenvalue,
and ?2, 0 lt ?2 lt ?1 be the second largest.
There is a unique eigenvector corresponding to
?1. - x(n) / x(n) x(1) ? (CT
)n/M / x(1) ? (CT)n/M - converges to the eigenvector corresponding
to ?1. As ?1n tends to zero, we can use standard
techniques to normalize xn as - y(n) x(n) / x(n)2
- and then x(nM) C y(n) .
57- Proof Continued
- Then (by power method) the Rayleigh quotient of
- ? (y(n))T C y(n) / (y(n))T y(n)
(y(n))T x(nM) - converges to ?1. The convergence speed depends on
the ratio ?1/ ?2, it is known that we can upper
bound the second largest eigenvalue ?2 in a
positive matrix by - ?2 ?1 (K-1)/(K1)
- where K maxi,j,k,l ? (Ci,j Ck,l / Ci,l Ck,j).
For any i,j, we have -
- pa (1-p)b (1-p)d
Ci,j pa (1-p)b - where a is the number of 1's in tj, b the number
of 0's in tj. K O(1/(1-p)d ). - As (1 - 1/K)K goes to 1/e, the convergence can be
achieved in O(K) O(1/(1-p)d ) steps. The time
complexity is therefore upper bounded by an
exponential function in s.
QED
58Theorem 2. It is NP hard to find the optimal
seed(Li, Ma, Kisman, Tromp, J. Bioinfo Comput.
Biol. 2004)
- Proof. Reduction from 3-SAT. Given a 3-SAT
instance C1 Cn, on Boolean variables x1, ,
xm, where Ci l1 l2 l3, each l is some xi or
its complement. We reduce this to seed selection
problem. The required seed has weight Wm2 and
length L2m2, and is intended to be of the form
1(0110)m1, a straightforward variable assignment
encoding. The i-th pair of bits after the initial
1 is called the code for variable xi. A 01 code
corresponds to setting xi true, while a 10 means
false. To prohibit code 11, we introduce, for
every 1 i m, a region -
- Ai1 (11)i-101(11)m-i 10m1
1(11)i-110(11)m-i 1. - Since the seed has m 0s, it cannot bridge the
middle 0m1 part, and is thus forced to hit at
the start or at the end. This obviously rules out
code 11 for variable xi. Code 00 automatically
becomes ruled out as well, since the seed must
distribute m 1s among all m codes. Finally, for
each clause Ci, we introduce - Ri 1(11)a-1 c1(11)m-a 1 000
1(11)b-1c2(11)m-b 1 000 1(11)c-1c3(11)m-c 1. - The total length of Ri is (2m2)3(2m2)3(2m2)
6m12 it consists of three parts each
corresponding to a literal in the constraint,
with ci being the code for making that literal
true. - Thus a seed hits Ri iff the assignment encoded by
the seed satisfies Ci .
59The complexity of computing spaced seed
sensitivity
- Theorem. It is also NP hard to compute
sensitivity of a given seed. It remains to be NPC
even in a uniform region, at any homology level
p. - Proof.
- Very complicated. Omitted.
60PTAS for computing sensitivity of a seed
- Trivial alg. Sample R poly times, compute the
frequency s hits. Use that to approximate the
real probability. - When p is large, then using Chernoff bound, we
have a PTAS. - But when p is very small, or seed weight is very
large, hitting probability is actually very
small, then we do not have a PTAS.
61Efficient approximation of the hit probability of
a seed
- Theorem. Let the hit probability of s be x. For
any egt0, let N 6R2log R / e2. Then with high
probability, in polynomial time and N samples, we
can output a value y such that y-x ex.
62- Proof. We give a transformation to transform the
low probability event s hits R'' dependent on p
and weight of s to several events with
probability sum 1, independent of p and seed
weight W. Let sr be the reverse string of s. - Ps hits R Psr hits R
- ?i0R-s Psr hits at i, but misses at
0, , i-1 - ?i0R-s Psr hits at i x Psr misses
0, , i-1 sr hits at i - ?i0R-s pW x Ps misses 1 i s hits
0 (1) - Because Ps hits R pW, (1) induces that
-
- ?i0R-s Ps misses 1 i s hits 0 1
- Now, we can do efficient sampling, and using
Chernoff bounds, the rest of the proof is
standard. QED
63Expected hit distance
- Corollary. E(second hit position s hits at 0)
p-W. - Proof.
- ?i0R-s Ps misses 1 i s hits 0
- ?i0R-s ?ji1R-s1 Ps second hits j
s hits 0 - ?j0R-s1 ?i0j-1 Ps second hits j s
hits 0 - ?i0R-s1 (j-1)Ps second hits j s hits
0 - Combining with (1), we have
- Ps hits RxpW?i0R-s1 (j-1)Ps second hits
js hits 0 - Letting R ? 8, the left hand is p-W, the right
is - E(second hit position s hits at 0).
QED
64Chernoff bound
- Consider a sequence of n independent tosses of a
coin with heads prob. p. The expected number of
heads is mnp. What is the probability that the
number of heads deviates a lot from this? Let X
be the number of successes, then - PrX-mem 2e-eem/3
65Complexity of finding the optimal spaced seeds
- Theorem 1 Li-Ma. Given a seed and it is NP-hard
to find its sensitivity, even in a uniform
region. - Theorem 2 Li-Ma. The sensitivity of a given
seed can be efficiently approximated by a PTAS. - Theorem 3. Given a set of seeds, choose k best
can be approximated with ratio 1- 1/e. - Theorem 4 Buhler-Keich-Sun, Li-Ma The
asymptotic hit probability is computable in
exponential time in seed length, independent of
homologous region length. - Theorem 5 L. Zhang If the length of a spaced
seed is not too long, then it strictly
outperforms consecutive seed, in asymptotic hit
probability.
66- 5. Applications beyond bioinformatics
67Stock Market PredictionZou-Deng-Li Detecting
Market Trends by Ignoring It, Some Days, March,
2005
- A real gold mine 4.6 billion dollars are traded
at NYSE daily. - Buy low, sell high.
- Best thing God tells us market trend
- Next best thing A good market indicator.
- Essentially, a buy indicator must be
- Sensitive when the market rises
- Insensitive otherwise.
68My goal
- Provide a sensitive, but not aggressive market
trend indicator. - Learning/inference methods or complete and
comprehensive systems are beyond this study. They
can be used in conjunction with our proposal.
69Background
- Hundreds of market indicators are used (esp. in
automated systems). Typical - Common sense if the past k days are going up,
then the market is moving up. - Moving average over the last k days. When the
average curve and the price curve intersect,
buy/sell. - Special patterns a wedge, triangle, etc.
- Volume
- Hundreds used in automated trading systems.
- Note any method will make money in the right
market, lose money in the wrong market
70Problem Formalization
- The market movement is modeled as a 0-1 sequence,
one bit per day, with 0 meaning market going
down, and 1 up. - S(n,p) is an n day iid sequence where each bit
has probability p being 1 and 1-p being 0. If
pgt0.5, it is an up market - Ik1k is an indicator that the past k days are
1s. - I8 has sensitivity 0.397 in S(30,0.7), too
conservative - I8 has false positive rate 0.0043 in S(100, 0.3).
Good - Iij is an indicator that there are i 1s in last
j days. - I811 has high sensitivity 0.96 in S(30,0.7)
- But it is too aggressive at 0.139 false positive
rate in S(100, 0.3). - Spaced seeds 111111111 and 111111111 combine
to - have sensitivity 0.49 in S(30,0.7)
- False positive rate 0.0032 in S(100, 0.3).
- Consider a betting game A player bets a number
k. He wins k dollars for a correct prediction and
o.w. loses k dollars. We say an indicator A is
better than B, AgtB, if A always wins more and
loses less.
71Sleeping on Tuesdays and Fridays
- Spaced seeds are beautiful indicators they are
sensitive when we need them to be sensitive and
not sensitive when we do not want them to be.
11111111 always beats I811 if it bets 4
dollars for each dollar I811 bets. It is gtI8
too.
72Two spaced seeds
Observe two spaced Seeds curve vs I8, the spaced
seeds are always more sensitive in pgt0.5 region,
and less sensitive when plt0.5
73Two experiments
- We performed two trading experiments
- One artificial
- One on real data (SP 500, Nasdaq indices)
74Experiment 1 Artificial data
- This simple HMM generates a very artificial
simple model - 5000 days (bits)
- Indicators I7, I711, 5 spaced seeds.
- Trading strategy if there is a hit, buy, and
sell 5 days later. - Reward is (1)-(0) in that 5 days times the
betting ratio
75Results of Experiment 1.
- R Hits Final MTM
Bankrupcies - I71111111 30 12 679
16 - I711 15 47 916
14 - 5 Spaced seeds 25 26 984
13
76Experiment 2.
- Historical data of SP 500, from Oct 20, 1982 to
Feb. 14, 2005 and NASDAQ, from Jan 2, 85 to Jan
3, 2005 were downloaded from Yahoo.com. - Each strategy starts with 10,000 USD. If an
indicator matches, use all the money to buy/sell. - While in no ways this experiment says anything
affirmatively, it does suggest that spaced
patterns can serve a promising basis for a useful
indicator, together with other parameters such as
trade volume, natural events, politics,
psychology.
77(No Transcript)
78Open Questions
- I have presented a simple idea of optimized
spaced seeds and its applications in homology
search and time series prediction. - Open questions current research
- Complexity of finding (near) optimal seed, in a
uniform region. Note that this is not an NP-hard
question. - Tighter bounds on why spaced seeds are better.
- Applications to other areas. Apparently, the same
idea works for any time series. - Model financial data
- Can we use varying patterns instead of one seed?
When patterns vary at the same weight, we still
have same number of expected hits. However, can
this increase sensitivity? One seed is just a
special case of this.
79Acknowledgement
- PH is joint work with B. Ma and J. Tromp
- PH II is joint work with Ma, Kisman, and Tromp
- Some joint theoretical work with Ma, Keich,
Tromp, Xu, Brown, Zhang. - Financial market prediction J. Zou, X. Deng
- Financial support Bioinformatics Solutions Inc,
NSERC, Killam Fellowship, CRC chair program, City
University of Hong Kong.
80Here is an example of a BLASTP match (E-value 0)
between gene 0189 in C. pneumoniae and gene 131
in C. trachomatis. Query CPn0189
Score
(bits) E-value Aligned with CT131 hypothetical
protein 1240
0.0 Query 1 MKRRSWLKILGICLGSSIVLGFLIFLPQLLSTE
SRKYLVFSL I HKESGLSCSAEELKISW 60
MKR W KI G L L L LP SES KYL
SKEGL EL SW Sbjct 1
MKRSPWYKIFGYYLLVGVPLALLALLPKFFSSESGKYLFLSVLNKETGLQ
F EIEQLHLSW 60 Query 61 FGRQTARKIKLTG-EAKDEVFS
AEKFELDGSLLRLL I YKKPKGITLSGWSLKINEPASID 119
FG QTAKI G EFAEK GSL
RLLY PK TLGWSLIE S Sbjct 61
FGSQTAKKIRIRGIDSDSEIFAAEKI IVKGSLPRLLL
YRFPKALTLTGWSLQIDESLSMN 120 Etc. Note Because of
powerpoint character alignment problems, I
inserted some white space to make it look more
aligned.
81Summary and Open Questions
- Simple ideas 0 created SW, 1 created Blast. 1
0 created PatternHunter. - Good spaced seeds help to improve sensitivity and
reduce irrelevant hit. - Multiple spaced seeds allow us to improve
sensitivity to approach Smith-Waterman, not
possible with BLAST. - Computing spaced seeds by DP, while it is NP
hard. - Proper data models help to improve design of
spaced seeds (HMM) - Open Problems (a) A mathematical theory of
spaced seeds. I.e. quantitative statements for
Theorem 1. - (b) Applications to other fields.
82Smaller (more solvable) questions (perhaps good
for course projects)
- Multiple protein seeds for Smith-Waterman
sensitivity? - Applications in other fields, like internet
document search? - Prove Theorem 1 for other models for example,
M(3) model 0.8,0.8,0.5 together with specialized
seeds. - More extensive study of Problem 3.