L3: Blast: Keyword match basics - PowerPoint PPT Presentation

About This Presentation

Title:

L3: Blast: Keyword match basics

Description:

What is the probability that some position i has a match. ... Exact match of a short query substring to ... Scan the database for exact match to all such words ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 36

Provided by: vineet50

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: L3: Blast: Keyword match basics

1
L3 Blast Keyword match basics
2
Silly Quiz

TRUE or FALSE
In New York City at any moment, there are 2
people (not bald) with exactly the same number
of hairs!

3
Assignment 1 is online

Due 10/6 (Thursday) in class.

4
An O(nm) algorithm for score computation
For i 1 to n For j 1 to m

The iteration ensures that all values on the
right are computed in earlier steps.
How much space do you need?

5
Alignment?

Is O(nm) space too much?
What if the query and database are each 1Mbp?

6
Alignment (Linear Space)

Score computation

For i 1 to n For j 1 to m
7
Linear Space

In Linear Space, we can do each row of the D.P.
We need to compute the optimum path from the
origin (0,0) to (m,n)

8
Linear Space (contd)

At in/2, we know scores of all the optimal paths
ending at that row.
Define Fj Sn/2,j
One of these j is on the true path. Which one?

9
Backward alignment

Let Si,j be the optimal score of aligning
si..n with tj..m
Define Bj Sn/2,j
One of these j is on the true path. Which one?

10
Forward, Backward computation

At the optimal coordinate, j
FjBjSn,m
In O(nm) time, and O(m) space, we can compute one
of the coordinates on the optimum path.

11
Linear Space Alignment

Align(1..n,1..m)
For all 1ltj lt m
Compute FjS(n/2,j)
For all 1ltj lt m
Compute BjSb(n/2,j)
j maxj FjBj
X Align(1..n/2,1..j)
Y Align(n/2..n,j..m)
Return X,j,Y

12
Linear Space complexity

T(nm) c.nm T(nm/2) O(nm)
Space O(m)

13
Why is Blast Fast?
14
Large database search
Database size n10M, Querysize m300. O(nm) 3.
109 computations
15
Observations

Much of the database is random from the querys
perspective
Consider a random DNA string of length n.
PrAPrC PrGPrT0.25
Assume for the moment that the query is all 1s
(length m).
What is the probability that an exact match to
the query can be found?

16
Basic probability

Probability that there is a match starting at a
fixed position i 0.25m
What is the probability that some position i has
a match.
Dependencies confound probability estimates.

17
Basic ProbabilityExpectation

Q Toss a coin each time it comes up heads, you
get a dollar
What is the money you expect to get after n
tosses?
Let Xi be the amount earned in the i-th toss

18
Expected number of matches

Expected number of matches can still be computed.

Let Xi1 if there is a match starting at
position i, Xi0 otherwise

Expected number of matches

19
Expected number of exact Matches is small!

Expected number of matches n0.25m
If n107, m10,
Then, expected number of matches 9.537
If n107, m11
expected number of hits 2.38
n107,m12,
Expected number of hits 0.5 lt 1
Bottom Line An exact match to a substring of the
query is unlikely just by chance.

20
Observation 2

What is the pigeonhole principle?

21
Why is this important?

Suppose we are looking for sequences that are 80
identical to the query sequence of length 100.
Assume that the mismatches are randomly
distributed.
What is the probability that there is no stretch
of 10 bp, where the query and the subject match
exactly?
Rough calculations show that it is very low.
Exact match of a short query substring to a truly
similar subject is very high.
The above equation does not take dependencies
into account
Reality is better because the matches are not
randomly distributed

22
Just the Facts

Consider the set of all substrings of the query
string of fixed length W.
Prob. of exact match to a random database string
is very low.
Prob. of exact match to a true homolog is very
high.
Keyword Search (exact matches) is MUCH faster
than sequence alignment

23
BLAST
Database (n)

Consider all (m-W) query words of size W (Default
11)
Scan the database for exact match to all such
words
For all regions that hit, extend using a dynamic
programming alignment.
Can be many orders of magnitude faster than SW
over the entire string

24
Why is BLAST fast?

Assume that keyword searching does not consume
any time and that alignment computation the
expensive step.
Query m1000, random Db n107, no TP
SW O(nm) 1000107 1010 computations
BLAST, W11
E(11-mer hits) 1000 (1/4)11 1072384
Number of computations 23841001002.384107
Ratio1010/(2.384107)420
Further speed improvements are possible

25
Keyword Matching

How fast can we match keywords?
Hash table/Db index? What is the size of the hash
table, for m11
Suffix trees? What is the size of the suffix
trees?
Trie based search. We will do this in class.

AATCA
567
26
Related notes

How to choose the alignment region?
Extend greedily until the score falls below a
certain threshold
What about protein sequences?
Default word size 3, and mismatches are
allowed.
Like sequences, BLAST has been evolving
continuously
Banded alignment
Seed selection
Scanning for exact matches, keyword search versus
database indexing

27
P-value computation

How significant is a score? What happens to
significance when you change the score function
A simple empirical method
Compute a distribution of scores against a random
database.
Use an estimate of the area under the curve to
get the probability.
OR, fit the distribution to one of the standard
distributions.

28
Z-scores for alignment

Initial assumption was that the scores followed a
normal distribution.
Z-score computation
For any alignment, score S, shuffle one of the
sequences many times, and recompute alignment.
Get mean and standard deviation
Look up a table to get a P-value

29
Blast E-value

Initial (and natural) assumption was that scores
followed a Normal distribution
1990, Karlin and Altschul showed that ungapped
local alignment scores follow an exponential
distribution
Practical consequence
Longer tail.
Previously significant hits now not so
significant

30
Exponential distribution

Random Database, Pr(1) p
What is the expected number of hits to a sequence
of k 1s
Instead, consider a random binary Matrix.
Expected of diagonals of k 1s

As you increase k, the number decreases
exponentially.
The number of diagonals of k runs can be
approximated by a Poisson process
In ungapped alignments, we replace the coin
tosses by column scores, but the behaviour does
not change (Karlin Altschul).
As the score increases, the number of alignments
that achieve the score decreases exponentially

32
Blast E-value

Choose a score such that the expected score
between a pair of residues lt 0
Expected number of alignments with a particular
score
For small values, E-value and P-value are the
same

33
Blast Variants

What is mega-blast?
What is discontiguous mega-blast?
Phi-Blast/Psi-Blast?
BLAT?
PatternHunter?

Longer seeds. Seeds with dont care
values Later Database pre-processing Seeds with
dont care values
34
Keyword Matching
P O T A S T P O T A T O
O
T
A
T
O
T
U
I
S
A
E
35
(No Transcript)

Write a Comment

User Comments (0)