L3: Blast: Keyword match basics - PowerPoint PPT Presentation

About This Presentation
Title:

L3: Blast: Keyword match basics

Description:

What is the probability that some position i has a match. ... Exact match of a short query substring to ... Scan the database for exact match to all such words ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 36
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:
Tags: basics | blast | keyword | match

less

Transcript and Presenter's Notes

Title: L3: Blast: Keyword match basics


1
L3 Blast Keyword match basics
2
Silly Quiz
  • TRUE or FALSE
  • In New York City at any moment, there are 2
    people (not bald) with exactly the same number
    of hairs!

3
Assignment 1 is online
  • Due 10/6 (Thursday) in class.

4
An O(nm) algorithm for score computation
For i 1 to n For j 1 to m
  • The iteration ensures that all values on the
    right are computed in earlier steps.
  • How much space do you need?

5
Alignment?
  • Is O(nm) space too much?
  • What if the query and database are each 1Mbp?

6
Alignment (Linear Space)
  • Score computation

For i 1 to n For j 1 to m
7
Linear Space
  • In Linear Space, we can do each row of the D.P.
  • We need to compute the optimum path from the
    origin (0,0) to (m,n)

8
Linear Space (contd)
  • At in/2, we know scores of all the optimal paths
    ending at that row.
  • Define Fj Sn/2,j
  • One of these j is on the true path. Which one?

9
Backward alignment
  • Let Si,j be the optimal score of aligning
    si..n with tj..m
  • Define Bj Sn/2,j
  • One of these j is on the true path. Which one?

10
Forward, Backward computation
  • At the optimal coordinate, j
  • FjBjSn,m
  • In O(nm) time, and O(m) space, we can compute one
    of the coordinates on the optimum path.

11
Linear Space Alignment
  • Align(1..n,1..m)
  • For all 1ltj lt m
  • Compute FjS(n/2,j)
  • For all 1ltj lt m
  • Compute BjSb(n/2,j)
  • j maxj FjBj
  • X Align(1..n/2,1..j)
  • Y Align(n/2..n,j..m)
  • Return X,j,Y

12
Linear Space complexity
  • T(nm) c.nm T(nm/2) O(nm)
  • Space O(m)

13
Why is Blast Fast?
14
Large database search
Database size n10M, Querysize m300. O(nm) 3.
109 computations
15
Observations
  • Much of the database is random from the querys
    perspective
  • Consider a random DNA string of length n.
  • PrAPrC PrGPrT0.25
  • Assume for the moment that the query is all 1s
    (length m).
  • What is the probability that an exact match to
    the query can be found?

16
Basic probability
  • Probability that there is a match starting at a
    fixed position i 0.25m
  • What is the probability that some position i has
    a match.
  • Dependencies confound probability estimates.

17
Basic ProbabilityExpectation
  • Q Toss a coin each time it comes up heads, you
    get a dollar
  • What is the money you expect to get after n
    tosses?
  • Let Xi be the amount earned in the i-th toss

18
Expected number of matches
  • Expected number of matches can still be computed.

i
  • Let Xi1 if there is a match starting at
    position i, Xi0 otherwise
  • Expected number of matches

19
Expected number of exact Matches is small!
  • Expected number of matches n0.25m
  • If n107, m10,
  • Then, expected number of matches 9.537
  • If n107, m11
  • expected number of hits 2.38
  • n107,m12,
  • Expected number of hits 0.5 lt 1
  • Bottom Line An exact match to a substring of the
    query is unlikely just by chance.

20
Observation 2
  • What is the pigeonhole principle?

21
Why is this important?
  • Suppose we are looking for sequences that are 80
    identical to the query sequence of length 100.
  • Assume that the mismatches are randomly
    distributed.
  • What is the probability that there is no stretch
    of 10 bp, where the query and the subject match
    exactly?
  • Rough calculations show that it is very low.
    Exact match of a short query substring to a truly
    similar subject is very high.
  • The above equation does not take dependencies
    into account
  • Reality is better because the matches are not
    randomly distributed

22
Just the Facts
  • Consider the set of all substrings of the query
    string of fixed length W.
  • Prob. of exact match to a random database string
    is very low.
  • Prob. of exact match to a true homolog is very
    high.
  • Keyword Search (exact matches) is MUCH faster
    than sequence alignment

23
BLAST
Database (n)
  • Consider all (m-W) query words of size W (Default
    11)
  • Scan the database for exact match to all such
    words
  • For all regions that hit, extend using a dynamic
    programming alignment.
  • Can be many orders of magnitude faster than SW
    over the entire string

24
Why is BLAST fast?
  • Assume that keyword searching does not consume
    any time and that alignment computation the
    expensive step.
  • Query m1000, random Db n107, no TP
  • SW O(nm) 1000107 1010 computations
  • BLAST, W11
  • E(11-mer hits) 1000 (1/4)11 1072384
  • Number of computations 23841001002.384107
  • Ratio1010/(2.384107)420
  • Further speed improvements are possible

25
Keyword Matching
  • How fast can we match keywords?
  • Hash table/Db index? What is the size of the hash
    table, for m11
  • Suffix trees? What is the size of the suffix
    trees?
  • Trie based search. We will do this in class.







AATCA
567
26
Related notes
  • How to choose the alignment region?
  • Extend greedily until the score falls below a
    certain threshold
  • What about protein sequences?
  • Default word size 3, and mismatches are
    allowed.
  • Like sequences, BLAST has been evolving
    continuously
  • Banded alignment
  • Seed selection
  • Scanning for exact matches, keyword search versus
    database indexing

27
P-value computation
  • How significant is a score? What happens to
    significance when you change the score function
  • A simple empirical method
  • Compute a distribution of scores against a random
    database.
  • Use an estimate of the area under the curve to
    get the probability.
  • OR, fit the distribution to one of the standard
    distributions.

28
Z-scores for alignment
  • Initial assumption was that the scores followed a
    normal distribution.
  • Z-score computation
  • For any alignment, score S, shuffle one of the
    sequences many times, and recompute alignment.
    Get mean and standard deviation
  • Look up a table to get a P-value

29
Blast E-value
  • Initial (and natural) assumption was that scores
    followed a Normal distribution
  • 1990, Karlin and Altschul showed that ungapped
    local alignment scores follow an exponential
    distribution
  • Practical consequence
  • Longer tail.
  • Previously significant hits now not so
    significant

30
Exponential distribution
  • Random Database, Pr(1) p
  • What is the expected number of hits to a sequence
    of k 1s
  • Instead, consider a random binary Matrix.
    Expected of diagonals of k 1s

31
  • As you increase k, the number decreases
    exponentially.
  • The number of diagonals of k runs can be
    approximated by a Poisson process
  • In ungapped alignments, we replace the coin
    tosses by column scores, but the behaviour does
    not change (Karlin Altschul).
  • As the score increases, the number of alignments
    that achieve the score decreases exponentially

32
Blast E-value
  • Choose a score such that the expected score
    between a pair of residues lt 0
  • Expected number of alignments with a particular
    score
  • For small values, E-value and P-value are the
    same

33
Blast Variants
  1. What is mega-blast?
  2. What is discontiguous mega-blast?
  3. Phi-Blast/Psi-Blast?
  4. BLAT?
  5. PatternHunter?

Longer seeds. Seeds with dont care
values Later Database pre-processing Seeds with
dont care values
34
Keyword Matching
P O T A S T P O T A T O
O
T
A
T
O
T
U
I
S
A
E
35
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com