Modern Homology Search - PowerPoint PPT Presentation

About This Presentation
Title:

Modern Homology Search

Description:

Modern Homology Search Ming Li Canada Research Chair in Bioinformatics Computer Science. U. Waterloo I want to show you how one simple theoretical idea can make big ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 83
Provided by: Ming74
Category:

less

Transcript and Presenter's Notes

Title: Modern Homology Search


1
Modern Homology Search
Ming Li Canada Research Chair in
Bioinformatics Computer Science. U. Waterloo
2
  • I want to show you how one simple theoretical
    idea can make big impact to a practical field.

3
Outline
  • Motivation, market, biology, homology search
  • Review dynamic programming, BLAST heuristics
  • Optimized spaced seeds
  • The idea
  • How to compute them
  • Data models, coding regions, HMM, vector seeds
  • Multiple optimized spaced seeds
  • PatternHunter program
  • Mathematical theory of spaced seeds
  • Why they are better
  • Complexity of finding optimal spaced seeds
  • Applications beyond bioinformatics Open problems

4
1. Introduction
  • Biology
  • Motivation and market
  • Definition of homology search problem

5
Biology
Phenotype
6
Human 3 billion bases, 30k genes. E. coli 5
million bases, 4k genes
T
A
A
T
C
G
cDNA
T
A
reverse transcription
T
A
translation
transcription
C
G
Protein
mRNA
G
C
(20 amino acids)
(A,C,G,U)
G
C
Codon three nucleotides encode an
amino acid. 64 codons 20 amino acids, some w/more
codes
A
T
C
G
T
A
A
T
7
A gigantic gold mine
  • The trend of genetic data growth
  • 400 Eukaryote genome projects underway
  • GenBank doubles every 18 months
  • Comparative genomics ? all-against-all search
  • Software must be scalable to large datasets.

30 billion in year 2005
8
Homology search
  • Given two DNA sequences, find all local similar
    regions, using edit distance (match1,
    mismatch-1, gapopen-5, gapext-1).
  • Example. Input
  • E. coli genome 5 million base pairs
  • H. influenza genome 1.8 million base pairs
  • Output all local alignments (PH demo)
  • java jar phn.jar i ecoli.fna j
    hinf.fna o out.txt b

9
Homology search vs Google search
  • Internet search
  • Size limit 5 billion people x homepage size
  • Supercomputing power used ½ million
    CPU-hours/day
  • Query frequency Google --- 112 million/day
  • Query type exact keyword search --- easy to do
  • Homology search
  • Size limit 5 billion people x 3 billion
    basepairs millions of species x billion bases
  • 10 (?) of worlds supercomputing power
  • Query frequency NCBI BLAST -- 150,000/day, 15
    increase/month
  • Query type approximate search

10
Bioinformatics Companies living on Blast
  • Paracel (Celera)
  • TimeLogic (recently liquidated)
  • TurboGenomics (TurboWorx)
  • NSF, NIH, pharmaceuticals proudly support many
    supercomputing centers mainly for homology search
  • Hardware high cost become obsolete in 2 years.

11
History
  • Dynamic programming (1970-1980)
  • Full sensitivity, but slow.
  • Human vs mouse genomes 104 CPU-years
  • BLAST, FASTA heuristics (1980-1990)
  • Low sensitivity, still not fast enough.
  • Human vs mouse genomes 19 CPU-years
  • BLAST paper was referenced 100000 times
  • Modern technology full sensitivity greater
    speed!

12
2. Old Technology
  • Dynamic programming full sensitivity homology
    search
  • BLAST heuristics --- trading sensitivity with
    time.

13
Dynamic Programming
  • Longest Common Subsequence (LCS).
  • Vv1v2 vn
  • Ww1w2 wm
  • s(i,j) length of LCS
  • of V1..i and W1..j
  • Dynamic Programming
  • s(i-1,j)
  • s(i,j)max s(i,j-1)
  • s(i-1,j-1)1,viwj
  • Sequence Alignment (Needleman-Wunsch, 1970)
  • s(i,j)max
  • s(i-1,j) d(vi,-)
  • s(i,j-1)d(-,wj)
  • s(i-1,j-1)d(vi,wj)
  • 0 (Smith-Waterman, 1981)
  • No affine gap penalties,
  • where d, for proteins, is either PAM or BLOSUM
    matrix. For DNA, it can be match 1, mismatch -1,
    gap -3 (In fact open gap -5, extension -1.)
  • d(-,x)d(x,-) -a, d(x,y)-u. When a0,
    uinfinity, it is LCS.

14
Misc. Concerns
  • Local sequence alignment, add si,j0.
  • Gap penalties. For good reasons, we charge first
    gap cost a, and then subsequent continuous
    insertions blta.
  • Space efficient sequence alignment. Hirschberg
    alg. in O(n2) time, O(n) space.
  • Multiple alignment of k sequences O(nk)

15
Blast
  • Popular software, using heuristics. By Altschul,
    Gish, Miller, Myers, Lipman, 1990.
  • E(xpected)-value e dmne -lS, here S is score, m
    is database length and n is query length.
  • Meaning e is number of hits one can expect to
    see just by chance when searching a database of
    size m.
  • Basic Strategy For all 3 a.a. (and closely
    related) triples, remember their locations in
    database. Given a query sequence S. For all
    triples in S, find their location in database,
    then extend as long as e-value significant.
    Similar strategy for DNA (7-11 nucleotides). Too
    slow in genome level.

16
Blast Algorithm
  • Find seeded (11 bases) matches
  • Extent to HSPs (High Scoring Pairs)
  • Gapped Extension, dynamic programming
  • Report all local alignments

17
BLAST Algorithm Example
  • Find seeded matches of 11 base pairs
  • Extend each match to right and left, until the
    scores drop too much, to form an alignment
  • Report all local alignments

Example AGCGATGTCACGCGCCCGTATTTCCGTA
TCGGATCTCACGCGCCCGGCTTACCGTG
0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0
1 1 0 1 1 1 1 0
G



x
18
BLAST Dilemma
  • If you want to speed up, have to use a longer
    seed. However, we now face a dilemma
  • increasing seed size speeds up, but loses
    sensitivity
  • decreasing seed size gains sensitivity, but loses
    speed.
  • How do we increase sensitivity speed
    simultaneously? For 20 years, many tried suffix
    tree, better programming ..

19
3. Optimized Spaced Seeds
  • Why are they better
  • How to compute them
  • Data models, coding regions, HMM
  • PatternHunter
  • Multiple spaced seeds
  • Vector seeds

20
New thinking
  • Three lines of existing approaches
  • Smith-Waterman exhaustive dynamic programming.
  • Blast family find seed matches (11 for Blastn,
    28 for MegaBlast), then extend. Dilemma
    increasing seed size speeds up but loses
    sensitivity descreasing seed size gains
    sensitivity but loses speed.
  • Suffix tree MUMmer, Quasar, REPuter. Only good
    for precise matches, highly similar sequences.
  • Blast approach is the only way to deal with real
    and large problems, but we must to solve Blast
    dilemma We need a way to improve sensitivity and
    speed simultaneously.
  • Goals Improve (a) Blast, (b) Smith-Waterman.

21
Optimal Spaced Seed(Ma, Tromp, Li
Bioinformatics, 183, 2002, 440-445)
  • Spaced Seed nonconsecutive matches and optimized
    match positions.
  • Represent BLAST seed by 11111111111
  • Spaced seed 11111111111
  • 1 means a required match
  • means dont care position
  • This seemingly simple change makes a huge
    difference significantly increases hit prob. to
    homologous region while reducing bad hits.

22
Formalization
  • Given i.i.d. sequence (homology region) with
    Pr(1)p and Pr(0)1-p for each bit
  • 1100111011101101011101101011111011101
  • Which seed is more likely to hit this region
  • BLAST seed 11111111111
  • Spaced seed 11111111111

11111111111
23
Sensitivity PH weight 11 seed vs Blast 11 10
24
PH 2-hit sensitivity vs Blastn 11, 12 1-hit
25
Expect Less, Get More
  • Lemma The expected number of hits of a weight W
    length M seed model within a length L region with
    similarity p is
  • (L-M1)pW
  • Proof The expected number of hits is the sum,
    over the L-M1 possible positions of fitting the
    seed within the region, of the probability of W
    specific matches, the latter being pW.
  • Example In a region of length 64 with 0.7
    similarity, PH has probability of 0.466 to hit vs
    Blast 0.3, 50 increase. On the other hand, by
    above lemma, Blast expects 1.07 hits, while PH
    0.93, 14 less.

26
Why Is Spaced Seed Better?
  • A wrong, but intuitive, proof seed s, interval
    I, similarity p
  • E(hits) Pr(s hits) E(hits s hits)
  • Thus
  • Pr(s hits) Lpw / E(hits s hits)
  • For optimized spaced seed, E(hits s hits)
  • 11111111111 Non overlap
    Prob
  • 11111111111 6
    p6
  • 11111111111 6
    p6
  • 11111111111 6
    p6
  • 11111111111 7
    p7
  • ..
  • For spaced seed the divisor is 1p6p6p6p7
  • For BLAST seed the divisor is bigger 1 p p2
    p3

27
Finding Optimal Spaced Seeds(Keich, Li, Ma,
Tromp, Discrete Appl. Math. 138(2004), 253-263 )
  • Let f(i,b) be the probability that seed s hits
    the length i prefix of R that ends with b.
  • Thus, if s matches b, then
  • f(i,b) 1,
  • otherwise we have the recursive relationship
  • f(i,b) (1-p)f(i-1,0b') pf(i-1,1b')
  • where b' is b deleting the last bit.
  • Then the probability of s hitting R is,
  • SbM Prob(b) f(R,b).
  • Choose s with the highest probability.

28
Improvements
  • Brejova-Brown-Vinar (HMM) and Buhler-Keich-Sun
    (Markov) The input sequence can be modeled by a
    (hidden) Markov process, instead of iid.
  • Multiple seeds
  • Brejova-Brown-Vinar Vector seeds
  • Csuros Variable length seeds e.g. shorter
    seeds for rare query words.

29
PatternHunter(Ma, Tromp, Li Bioinformatics,
183, 2002, 440-445)
  • PH used optimal spaced seeds, novel usage of data
    structures red-black tree, queues, stacks,
    hashtables, new gapped alignment algorithm.
  • Written in Java. Yet many times faster than
    BLAST.
  • Used in Mouse/RAT Genome Consortia (Nature, Dec.
    5, 2002), as well as in hundreds of institutions
    and industry.

30
Comparison with BLAST
  • On Pentium III 700MH, 1GB

  • BLAST PatternHunter
  • E.coli vs H.inf 716s
    14s/68M
  • Arabidopsis 2 vs 4 --
    498s/280M
  • Human 21 vs 22 --
    5250s/417M
  • Human(3G) vs Mouse(x39G) 19 years 20 days
  • All with filter off and identical parameters
  • 16M reads of Mouse genome against Human genome
    for MIT Whitehead. Best BLAST program takes 19
    years at the same sensitivity

31
Quality Comparisonx-axis alignment
ranky-axis alignment scoreboth axes in
logarithmic scale
A. thaliana chr 2 vs 4
E. Coli vs H. influenza
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Genome Alignment by PatternHunter (4 seconds)
36
Prior Literature
  • Over the years, it turns out, that the
    mathematicians know about certain patterns appear
    more likely than others for example, in a random
    English text, ABC is more likely to appear than
    AAA, in their study of renewal theory.
  • Random or multiple spaced q-grams were used in
    the following work
  • FLASH by Califano Rigoutsos
  • Multiple filtration by Pevzner Waterman
  • LSH of Buhler
  • Praparata et al

37
Coding Region HMM
  • The Hidden Markov Model is a finite set
    of states, each of which is associated with a
    probability distribution. Transitions among the
    states are governed by a set of probabilities
    called transition probabilities. In a particular
    state an outcome or observation can be generated,
    according to the associated probability
    distribution. It is only the outcome, not the
    state, which is visible to an external observer
    and therefore states are hidden'' from the
    outside hence the name Hidden Markov Model. An
    HMM has the following components satisfying usual
    probability laws
  • The number of states of the model, N.
  • The number of observation symbols in the
    alphabet, M.
  • A set of state transition probabilities,
    depending on current state.
  • A probability distribution of the observable
    symbol at each of the states.

38
Modeling coding region with HMM
  • PatternHunter original assumption I is a
    sequence of N independent Bernoulli variables X0,
    XN-1 with P(Xi)p.
  • Coding region third position usually is less
    conserved. Skipping it should be a good idea
    (BLAT 110110110). With spaced seeds, we can model
    it using HMM.
  • Brejova, Brown, Vinar M(3) HMM I is a sequence
    of N independent Bernoulli random variables
    X0,X1, XN-1 where Pr(Xi1)pi mod 3. In
    particular
  • p0
    p1 p2
  • human/mouse 0.82 0.87 0.61
  • human/fruit fly 0.67 0.77 0.40
  • DP algorithm naturally extends to M(3),
  • Opt seed (weight 8) for coding region
    11011011000011

39
Picture of M(3)
Extending M(3), Brejova-Brown-Vinar proposed
M(8), above, to model dependencies among
positions within codons Spijk 1, each codon
has conservation pattern ijk with probability
pijk.
Figures from Brejova-Brown-Vinar
40
Sensitivity of all seeds Under 4 models
From Brejova-Brown-Vinar
41
Running PHDownload at www.BioinformaticsSolution
s.com
  • java jar ph.jar i query.fna j subject.fna o
    out.txt b multi 4
  • -Xmx512m --- for large files
  • -j missing query.fna self-comparison
  • -db multiple sequence input, 0,1,2,3 (no, query,
    subject, both)
  • -W seed weight
  • -G open gap penalty (default 5)
  • -E gap extension (default 1)
  • -q mismatch penalty (default 1)
  • -r reward for match (default 1)
  • -model specify model in binary
  • -H hits before extension
  • -P show progress
  • -b Blast output format
  • -multi number of seeds to be used (default 1
    max 16)

42
Multiple Spaced Seeds Approaching Smith-Waterman
Sensitivity(Li, Ma, Kisman, Tromp, PatternHunter
II, J. Bioinfo Comput. Biol. 2004)
  • The biggest problem for Blast was low sensitivity
    (and low speed). Massive parallel machines are
    built to do Smith-Waterman exhaustive dynamic
    programming.
  • Spaced seeds give PH a unique opportunity of
    using several optimal seeds to achieve optimal
    sensitivity, this was not possible for Blast
    technology.
  • Economy --- 2 seeds (½ time slow down) achieve
    sensitivity of 1 seed of shorter length (4 times
    speed).
  • With multiple optimal seeds. PH II approaches
    Smith-Waterman sensitivity, and 3000 times
    faster.
  • Finding optimal seeds, even one, is NP-hard.
  • Experiment 29715 mouse EST, 4407 human EST.
    Sensitivity and speed comparison next two slides.

43
Finding Multiple Seeds (PH II)
  • Computing the hit probability of given k seeds on
    a uniformly distributed random region is NP-hard.
  • Finding a set of k optimal seeds to maximize the
    hit probability cannot be approximated within
    1-1/e unless NPP
  • Given k seeds, algorithm DP can be adapted to
    compute their sensitivity (probability one of the
    seeds has a hit).
  • Greedy Strategy
  • Compute one optimal seed a. Sa
  • Given S, find b maximizing hit probability of S U
    b.
  • Coding region, use M(3), 0.8, 0.8, 0.5 (for mod 3
    positions)
  • We took 12 CPU-days on 3GHz PC to compute 16
    weight 11 seeds. General seeds (first 4)
    111010010100110111, 111100110010100001011,
    11010000110001010111, 1110111010001111.

44
Sensitivity Comparison with Smith-Waterman (at
100) The thick dashed curve is the sensitivity
of Blastn, seed weight 11. From low to high,
the solid curves are the sensitivity of PH II
using 1, 2, 4, 8 weight 11 coding region seeds,
and the thin dashed curves are the sensitivity
1, 2, 4, 8 weight 11 general purpose seeds,
respectively
45
Speed Comparison with Smith-Waterman
  • Smith-Waterman (SSearch) 20 CPU-days.
  • PatternHunter II with 4 seeds 475 CPU-seconds.
    3638 times faster than Smith-Waterman dynamic
    programming at the same sensitivity.

46
Vector Seeds
  • Definition. A vector seed is an ordered pair
    Q(w,T), where w is a weight vector (w1, wM)
    and T is a threshold value. An alignment sequence
    x1,,xn contains a hit of the seed Q at position
    k if
  • ?i1..M(wixki-1) T
  • The number of nonzero positions in w is called
    the support of the seed.
  • Comments. Other seeds are special cases of vector
    seeds.

47
Experiment (Brejova PhD. thesis, 2005)
  • Predicted sensitivity and false positive rate
    (prob. hit 0.25-homology region of seed length)
    comparison of various types of seeds in a
    Bernoulli region of 0.7 match probability, length
    64.
  • --------------------------------------------------
    ---------------------------------------
  • Name Weight vector T
    Support Sensitivity False positive rate

  • BLAST 1111111111 10
    10 41 9.5x10-7
  • PH-10 1110101100011011 10 10
    60 9.5x10-7
  • BLAST-13-15 111111111111111 13 15
    73 9.2x10-7
  • VS-13-15 111111011011101111 13 15
    85 9.2x10-7
  • VS-12-13 111101100110101111 12 13
    74 6.0x10-7
  • VS-11-12 111011001011010111 11 12
    84 2.2x10-6

48
Variable weight spaced seeds
  • M. Csuros proposal use variable weighted spaced
    seeds depending on the composition of the strings
    to be hashed. Rarer strings are expected to
    produce fewer false positives, use seeds with
    smaller weight --- this increases the sensitivity.

49
Open Question
  • Can we use different patterns (of same weight) at
    different positions. So this can be called
    variable position-spaced seeds. At the same
    weight, we do not increase expected number of
    false positive hits. Can we increase sensitivity?
    Find these patterns? How much increase will there
    be? Practical enough?
  • If the above is not the case, then prove single
    optimal seed is always better than variable
    positions.

50
Old field, new trend
  • Research trend
  • Dozens of papers on spaced seeds have appeared
    since the original PH paper, in 3 years.
  • Many more have used PH in their work.
  • Most modern alignment programs (including BLAST)
    have now adopted spaced seeds
  • Spaced seeds are serving thousands of users/day
  • PatternHunter direct users
  • Pharmaceutical/biotech firms.
  • Mouse Genome Consortium, Nature, Dec. 5, 2002.
  • Hundreds of academic institutions.

51
4. Mathematical Theory of spaced seeds
  • Why they are better Spaced seeds hits first
  • Asymptotic sensitivity is there a best seed in
    absolute sense?
  • NP hardness of computing sensitivity
  • Approximation algorithm for sensitivity

52
Why are the spaced seeds better
  • Theorem 1. Let I be a homologous region, homology
    level p. For any sequence 0i0 lt i1 lt in-1
    I-s, let Aj be the event a spaced seed s hits
    I at ij, Bj be event the consecutive seed hits I
    at j. (Keich, Li, Ma, Tromp, Discrete Appl. Math.
    138(2004), 253-263 )
  • (1) P(Ujltn Aj) P(Ujltn Bj) When
    ijj, gt holds.
  • Proof. Induction on n. For n1, P(A0)pWP(B0).
    Assume the theorem holds for nN, we prove it
    holds for N1. Let Ek denote the event that, when
    putting a seed at I0, 0th, 1st k-1st 1s in
    the seed all match the k-th 1 mismatches.
    Clearly, Ek is a partition of sample space for
    all k0, .. , W, and P(Ek)pk(1-p) for any seed.
    (Note, Ek for Ai and Ek for Bi have different
    indices they are really different events.) It
    is sufficient to show, for each kW
  • (2) P(Uj0..N AjEk)P(Uj0..NBjEk)
  • When kW, both sides of (2) equal to 1. For
    kltW since (UjkBj)nEkF and Bk1,Bk2, BN are
    mutually independent of Ek, we have
  • (3) P(Uj0..NBjEk)P(Ujk1..NBj)
  • Now consider the first term in (2). For each
    k in 0, W-1, at most k1 of the events Aj
    satisfy AjnEkF because AjnEkF iff when aligned
    at ij seed s has a 1 bit at overlapping k-th bit
    when the seed was at 0 there are at most k1
    choices. Thus, there exist indices 0ltmk1ltmk2
    ltmN N such that AmjnEk?F. Since Ek means all
    previous bits matched (except for k-th), it is
    clear that Ek is non-negatively correlated with
    Ujk1..N Amj, thus
  • (4) P(Uj0..N
    AjEk)P(Ujk1..N AmjEk)P(Ujk1..NAmj)
  • By the inductive hypothesis, we have
  • P(Ujk1..N Amj)
    P(Ujk1..NBj)
  • Combined with (3),(4), this proves (2),
    hence the part of (1) of the theorem.

53
To prove when ijj, P(Uj0..n-1Aj)gtP(Uj0..n-1Bj)
--- (5)
  • We prove (5) by induction on n. For n2, we have
  • P(Uj0,1Aj)2pw p2w-Shift1gt2pw-pw1P
    (Uj0,1Bj) ()
  • where shift1number of overlapped bits of the
    spaced seed s when shifted to the right by 1 bit.
  • For inductive step, note that the proof of (1)
    shows that for all k0,1, ,W,
    P(Uj0..n-1AjEk)P(Uj0..n-1BjEk). Thus to
    prove (5) we only need to prove that
  • P(Uj0..n-1AjE0)gtP(Uj0..n-1BjE0).
  • The above follows from the inductive hypothesis
    as follows
  • P(Uj0..n-1AjE0)P(Uj1..n-1Aj)gtP(Uj1..n-1Bj)P
    (Uj0..n-1BjE0).

54
Corollary
  • Corollary Assume I is infinite. Let ps and pc be
    the first positions a spaced seed and the
    consecutive seed hit I, respectively. Then Eps
    lt Epc.
  • Proof. EpsSk0..8 kP(psk)
  • Sk0..8 kP(psgtk-1)-P(psgtk)
  • Sk0..8P(psgtk)
  • Sk0..8(1-P(Uj0..kAj))
  • ltSk0..8(1-P(Uj0..kBj))
  • Epc.

55
Asymptotic sensitivity(Buhler-Keich-Sun, JCSS
2004, Li-Ma-Zhang, SODA2006)
  • Theorem. Given a spaced seed s, the asymptotic
    hit probability on a homologous region R can be
    computed, up to a constant factor, in time
    exponentially proportional to s, but
    independent of R.
  • Proof. Let d be one plus the length of the
    maximum run of 0's in s. Let M'sd. Consider
    an infinite iid sequence RR1R2 ... ,
    Prob(Ri1)p.
  • Let T be set of all length s strings that do
    not dominate s (not matched by s).
  • Let Zi,j denote the event that s does not hit
    at any of the positions Ri, Ri1, , Rj.
    For any ti ? T, let
  • xi(n) PZ1,n
    Rn..ns-1 ti .
  • That is, xi(n) is the no-hit probability in the
    first n positions when they end with ti. We now
    establish a relationship between xi(n) and
    xi(n-M) . Let
  • Ci,j PZn-M'1,n ti at n,
    tj at n-M' x Ptj at n-M' ti at n.
  • Thus Ci,j is the probability of generating tj at
    position n-M' (from ti at n) times the
    probability not having a hit in a region of
    length sM' beginning with tj and ending with
    ti. Especially Ci,j ? 0 for any nontrivial seed
    of length greater than 0 (because of M with d
    spacing).

56
Proof Continued
  • Then,
  • xi(n) ?j1K PZ1,n-M' tj at n-M'
  • x Ptj at n-M' ti at n x
    PZn-M'1,n ti at n, tj at n-M'
  • ?j1K PZ1,n-M' tj at n-M' x
    Ci,j
  • ?j1K Ci,j xj(n-M)
  • Define the transition matrix C(Ci,j). It is a K
    x K matrix. Let x(n) (x1(n), x2(n), , xK(n)).
    Then, assuming that M' divides n,
  • x(n) x(1) ? (CT)n/M.
  • Because Ci,j gt 0 for all i,j, C is a positive
    matrix. The row sum of the i-th row is the
    probability that a length-(sM') region ending
    with ti does not have a hit. The row sum is lower
    bounded by (1-p)d (1-pW ), where (1-p)d is the
    probability of generating d 0's and 1-pW is the
    probability of generating a string in T. It is
    known that the largest eigenvalue of C is
    positive and unique, and is lower bounded by the
    smallest row sum and upper bounded the largest
    row sum of C. Let ?1gt0 be the largest eigenvalue,
    and ?2, 0 lt ?2 lt ?1 be the second largest.
    There is a unique eigenvector corresponding to
    ?1.
  • x(n) / x(n) x(1) ? (CT
    )n/M / x(1) ? (CT)n/M
  • converges to the eigenvector corresponding
    to ?1. As ?1n tends to zero, we can use standard
    techniques to normalize xn as
  • y(n) x(n) / x(n)2
  • and then x(nM) C y(n) .

57
  • Proof Continued
  • Then (by power method) the Rayleigh quotient of
  • ? (y(n))T C y(n) / (y(n))T y(n)
    (y(n))T x(nM)
  • converges to ?1. The convergence speed depends on
    the ratio ?1/ ?2, it is known that we can upper
    bound the second largest eigenvalue ?2 in a
    positive matrix by
  • ?2 ?1 (K-1)/(K1)
  • where K maxi,j,k,l ? (Ci,j Ck,l / Ci,l Ck,j).
    For any i,j, we have
  • pa (1-p)b (1-p)d
    Ci,j pa (1-p)b
  • where a is the number of 1's in tj, b the number
    of 0's in tj. K O(1/(1-p)d ).
  • As (1 - 1/K)K goes to 1/e, the convergence can be
    achieved in O(K) O(1/(1-p)d ) steps. The time
    complexity is therefore upper bounded by an
    exponential function in s.
    QED

58
Theorem 2. It is NP hard to find the optimal
seed(Li, Ma, Kisman, Tromp, J. Bioinfo Comput.
Biol. 2004)
  • Proof. Reduction from 3-SAT. Given a 3-SAT
    instance C1 Cn, on Boolean variables x1, ,
    xm, where Ci l1 l2 l3, each l is some xi or
    its complement. We reduce this to seed selection
    problem. The required seed has weight Wm2 and
    length L2m2, and is intended to be of the form
    1(0110)m1, a straightforward variable assignment
    encoding. The i-th pair of bits after the initial
    1 is called the code for variable xi. A 01 code
    corresponds to setting xi true, while a 10 means
    false. To prohibit code 11, we introduce, for
    every 1 i m, a region
  • Ai1 (11)i-101(11)m-i 10m1
    1(11)i-110(11)m-i 1.
  • Since the seed has m 0s, it cannot bridge the
    middle 0m1 part, and is thus forced to hit at
    the start or at the end. This obviously rules out
    code 11 for variable xi. Code 00 automatically
    becomes ruled out as well, since the seed must
    distribute m 1s among all m codes. Finally, for
    each clause Ci, we introduce
  • Ri 1(11)a-1 c1(11)m-a 1 000
    1(11)b-1c2(11)m-b 1 000 1(11)c-1c3(11)m-c 1.
  • The total length of Ri is (2m2)3(2m2)3(2m2)
    6m12 it consists of three parts each
    corresponding to a literal in the constraint,
    with ci being the code for making that literal
    true.
  • Thus a seed hits Ri iff the assignment encoded by
    the seed satisfies Ci .

59
The complexity of computing spaced seed
sensitivity
  • Theorem. It is also NP hard to compute
    sensitivity of a given seed. It remains to be NPC
    even in a uniform region, at any homology level
    p.
  • Proof.
  • Very complicated. Omitted.

60
PTAS for computing sensitivity of a seed
  • Trivial alg. Sample R poly times, compute the
    frequency s hits. Use that to approximate the
    real probability.
  • When p is large, then using Chernoff bound, we
    have a PTAS.
  • But when p is very small, or seed weight is very
    large, hitting probability is actually very
    small, then we do not have a PTAS.

61
Efficient approximation of the hit probability of
a seed
  • Theorem. Let the hit probability of s be x. For
    any egt0, let N 6R2log R / e2. Then with high
    probability, in polynomial time and N samples, we
    can output a value y such that y-x ex.

62
  • Proof. We give a transformation to transform the
    low probability event s hits R'' dependent on p
    and weight of s to several events with
    probability sum 1, independent of p and seed
    weight W. Let sr be the reverse string of s.
  • Ps hits R Psr hits R
  • ?i0R-s Psr hits at i, but misses at
    0, , i-1
  • ?i0R-s Psr hits at i x Psr misses
    0, , i-1 sr hits at i
  • ?i0R-s pW x Ps misses 1 i s hits
    0 (1)
  • Because Ps hits R pW, (1) induces that
  • ?i0R-s Ps misses 1 i s hits 0 1
  • Now, we can do efficient sampling, and using
    Chernoff bounds, the rest of the proof is
    standard. QED

63
Expected hit distance
  • Corollary. E(second hit position s hits at 0)
    p-W.
  • Proof.
  • ?i0R-s Ps misses 1 i s hits 0
  • ?i0R-s ?ji1R-s1 Ps second hits j
    s hits 0
  • ?j0R-s1 ?i0j-1 Ps second hits j s
    hits 0
  • ?i0R-s1 (j-1)Ps second hits j s hits
    0
  • Combining with (1), we have
  • Ps hits RxpW?i0R-s1 (j-1)Ps second hits
    js hits 0
  • Letting R ? 8, the left hand is p-W, the right
    is
  • E(second hit position s hits at 0).
    QED

64
Chernoff bound
  • Consider a sequence of n independent tosses of a
    coin with heads prob. p. The expected number of
    heads is mnp. What is the probability that the
    number of heads deviates a lot from this? Let X
    be the number of successes, then
  • PrX-mem 2e-eem/3

65
Complexity of finding the optimal spaced seeds
  • Theorem 1 Li-Ma. Given a seed and it is NP-hard
    to find its sensitivity, even in a uniform
    region.
  • Theorem 2 Li-Ma. The sensitivity of a given
    seed can be efficiently approximated by a PTAS.
  • Theorem 3. Given a set of seeds, choose k best
    can be approximated with ratio 1- 1/e.
  • Theorem 4 Buhler-Keich-Sun, Li-Ma The
    asymptotic hit probability is computable in
    exponential time in seed length, independent of
    homologous region length.
  • Theorem 5 L. Zhang If the length of a spaced
    seed is not too long, then it strictly
    outperforms consecutive seed, in asymptotic hit
    probability.

66
  • 5. Applications beyond bioinformatics

67
Stock Market PredictionZou-Deng-Li Detecting
Market Trends by Ignoring It, Some Days, March,
2005
  • A real gold mine 4.6 billion dollars are traded
    at NYSE daily.
  • Buy low, sell high.
  • Best thing God tells us market trend
  • Next best thing A good market indicator.
  • Essentially, a buy indicator must be
  • Sensitive when the market rises
  • Insensitive otherwise.

68
My goal
  • Provide a sensitive, but not aggressive market
    trend indicator.
  • Learning/inference methods or complete and
    comprehensive systems are beyond this study. They
    can be used in conjunction with our proposal.

69
Background
  • Hundreds of market indicators are used (esp. in
    automated systems). Typical
  • Common sense if the past k days are going up,
    then the market is moving up.
  • Moving average over the last k days. When the
    average curve and the price curve intersect,
    buy/sell.
  • Special patterns a wedge, triangle, etc.
  • Volume
  • Hundreds used in automated trading systems.
  • Note any method will make money in the right
    market, lose money in the wrong market

70
Problem Formalization
  • The market movement is modeled as a 0-1 sequence,
    one bit per day, with 0 meaning market going
    down, and 1 up.
  • S(n,p) is an n day iid sequence where each bit
    has probability p being 1 and 1-p being 0. If
    pgt0.5, it is an up market
  • Ik1k is an indicator that the past k days are
    1s.
  • I8 has sensitivity 0.397 in S(30,0.7), too
    conservative
  • I8 has false positive rate 0.0043 in S(100, 0.3).
    Good
  • Iij is an indicator that there are i 1s in last
    j days.
  • I811 has high sensitivity 0.96 in S(30,0.7)
  • But it is too aggressive at 0.139 false positive
    rate in S(100, 0.3).
  • Spaced seeds 111111111 and 111111111 combine
    to
  • have sensitivity 0.49 in S(30,0.7)
  • False positive rate 0.0032 in S(100, 0.3).
  • Consider a betting game A player bets a number
    k. He wins k dollars for a correct prediction and
    o.w. loses k dollars. We say an indicator A is
    better than B, AgtB, if A always wins more and
    loses less.

71
Sleeping on Tuesdays and Fridays
  • Spaced seeds are beautiful indicators they are
    sensitive when we need them to be sensitive and
    not sensitive when we do not want them to be.

11111111 always beats I811 if it bets 4
dollars for each dollar I811 bets. It is gtI8
too.
72
Two spaced seeds
Observe two spaced Seeds curve vs I8, the spaced
seeds are always more sensitive in pgt0.5 region,
and less sensitive when plt0.5
73
Two experiments
  • We performed two trading experiments
  • One artificial
  • One on real data (SP 500, Nasdaq indices)

74
Experiment 1 Artificial data
  • This simple HMM generates a very artificial
    simple model
  • 5000 days (bits)
  • Indicators I7, I711, 5 spaced seeds.
  • Trading strategy if there is a hit, buy, and
    sell 5 days later.
  • Reward is (1)-(0) in that 5 days times the
    betting ratio

75
Results of Experiment 1.
  • R Hits Final MTM
    Bankrupcies
  • I71111111 30 12 679
    16
  • I711 15 47 916
    14
  • 5 Spaced seeds 25 26 984
    13

76
Experiment 2.
  • Historical data of SP 500, from Oct 20, 1982 to
    Feb. 14, 2005 and NASDAQ, from Jan 2, 85 to Jan
    3, 2005 were downloaded from Yahoo.com.
  • Each strategy starts with 10,000 USD. If an
    indicator matches, use all the money to buy/sell.
  • While in no ways this experiment says anything
    affirmatively, it does suggest that spaced
    patterns can serve a promising basis for a useful
    indicator, together with other parameters such as
    trade volume, natural events, politics,
    psychology.

77
(No Transcript)
78
Open Questions
  • I have presented a simple idea of optimized
    spaced seeds and its applications in homology
    search and time series prediction.
  • Open questions current research
  • Complexity of finding (near) optimal seed, in a
    uniform region. Note that this is not an NP-hard
    question.
  • Tighter bounds on why spaced seeds are better.
  • Applications to other areas. Apparently, the same
    idea works for any time series.
  • Model financial data
  • Can we use varying patterns instead of one seed?
    When patterns vary at the same weight, we still
    have same number of expected hits. However, can
    this increase sensitivity? One seed is just a
    special case of this.

79
Acknowledgement
  • PH is joint work with B. Ma and J. Tromp
  • PH II is joint work with Ma, Kisman, and Tromp
  • Some joint theoretical work with Ma, Keich,
    Tromp, Xu, Brown, Zhang.
  • Financial market prediction J. Zou, X. Deng
  • Financial support Bioinformatics Solutions Inc,
    NSERC, Killam Fellowship, CRC chair program, City
    University of Hong Kong.

80
Here is an example of a BLASTP match (E-value 0)
between gene 0189 in C. pneumoniae and gene 131
in C. trachomatis. Query CPn0189
Score
(bits) E-value Aligned with CT131 hypothetical
protein 1240
0.0 Query 1 MKRRSWLKILGICLGSSIVLGFLIFLPQLLSTE
SRKYLVFSL I HKESGLSCSAEELKISW 60
MKR W KI G L L L LP SES KYL
SKEGL EL SW Sbjct 1
MKRSPWYKIFGYYLLVGVPLALLALLPKFFSSESGKYLFLSVLNKETGLQ
F EIEQLHLSW 60 Query 61 FGRQTARKIKLTG-EAKDEVFS
AEKFELDGSLLRLL I YKKPKGITLSGWSLKINEPASID 119
FG QTAKI G EFAEK GSL
RLLY PK TLGWSLIE S Sbjct 61
FGSQTAKKIRIRGIDSDSEIFAAEKI IVKGSLPRLLL
YRFPKALTLTGWSLQIDESLSMN 120 Etc. Note Because of
powerpoint character alignment problems, I
inserted some white space to make it look more
aligned.
81
Summary and Open Questions
  • Simple ideas 0 created SW, 1 created Blast. 1
    0 created PatternHunter.
  • Good spaced seeds help to improve sensitivity and
    reduce irrelevant hit.
  • Multiple spaced seeds allow us to improve
    sensitivity to approach Smith-Waterman, not
    possible with BLAST.
  • Computing spaced seeds by DP, while it is NP
    hard.
  • Proper data models help to improve design of
    spaced seeds (HMM)
  • Open Problems (a) A mathematical theory of
    spaced seeds. I.e. quantitative statements for
    Theorem 1.
  • (b) Applications to other fields.

82
Smaller (more solvable) questions (perhaps good
for course projects)
  • Multiple protein seeds for Smith-Waterman
    sensitivity?
  • Applications in other fields, like internet
    document search?
  • Prove Theorem 1 for other models for example,
    M(3) model 0.8,0.8,0.5 together with specialized
    seeds.
  • More extensive study of Problem 3.
Write a Comment
User Comments (0)
About PowerShow.com