Title: Lecture outline
1Lecture outline
- Database searches
- BLAST
- FASTA
- Statistical Significance of Sequence Comparison
Results - Probability of matching runs
- Karin-Altschul statistics
- Extreme value distribution
2DP Alignment Complexity
- O(mn) time
- O(mn) space
- O(max(m,n)) if only similarity score is needed
- More complicated divide-and-conquer algorithm
that doubles time complexity and uses O(min(m,n))
space Hirschberg, JACM 1977
3Time and space bottlenecks
- Comparing two one-megabase genomes.
- Space
- An entry 4 bytes
- Table 4 106 106 4 T bytes memory.
- Time
- 1000 MHz CPU 1M entries/second
- 1012 entries 1M seconds 10 days.
4BLAST
- Basic Local Alignment Search Tool
- Altschul et al. 1990,1994,1997
- Heuristic method for local alignment
- Designed specifically for database searches
- Idea good alignments contain short lengths of
exact matches
5Steps of BLAST
- Query words of length 4 (for proteins) or 11 (for
DNA) are created from query sequence using a
sliding window - Scan each database sequence for an exact match to
query words. Each match is a seed for an ungapped
alignment.
MEFPGLGSLGTSEPLPQFVDPALVSS
MEFP EFPG FPGL PGLG GLGS
6Steps of BLAST
- 3. (Original BLAST) extend matching words to
the left and right using ungapped alignments.
Extension continues as long as score does not
fall below a given threshold. This is an HSP
(high scoring pair). - (BLAST2) Extend the HSPs using gapped alignment.
7Steps of BLAST
- 4. Using a cutoff score S, keep only the
extended matches that have a score at
least S. - 5. Determine statistical significance of each
remaining match.
8Example BLAST run
- BLAST website
- http//www.ncbi.nlm.nih.gov/BLAST/
9 FASTA
- Derived from logic of the dot plot
- compute best diagonals from all frames of
alignment - Word method looks for exact matches between words
in query and test sequence - construct word position tables
- DNA words are usually 6 bases
- protein words are 1 or 2 amino acids
- only searches for diagonals in region of word
matches faster searching
10Steps of FASTA
- Find k-tups in the two sequences (k1-2 for
proteins, 4-6 for DNA sequences) - Create a table of positions for those k-tups
11The offset table
position 1 2 3 4 5 6 7 8 9 10 11 proteinA n c s p
t a . . . . . proteinB . . . . . a c s p r k
position in offset amino
acid protein A protein B pos A -
posB ---------------------------------------------
-------- a 6 6
0 c 2 7
-5 k - 11 n
1 - p 4 9
-5 r - 10 s
3 8 -5 t
5 - -------------------------
---------------------------- Note the common
offset for the 3 amino acids c,s and p A possible
alignment is thus quickly found - protein 1 n c s
p t a protein 2 a c s p r k
12FASTA
- 3. Select top 10 scoring local diagonals
with matches and mismatches but no gaps. - 4. Rescan top 10 diagonals (representing
alignments), score with PAM250 (proteins) or DNA
scoring matrix. Trim off the ends of the regions
to achieve highest scores.
13FASTA Algorithm
14FASTA
- 5. After finding the best initial region, FASTA
performs a DP global alignment centered on the
best initial region.
15FASTA Alignments
16History of sequence searching
- 1970 NW
- 1981 SW
- 1985 FASTA
- 1990 BLAST
- 1997 BLAST2
17The purpose of sequence alignment
- Homology
- Function identification
- about 70 of the genes of M. jannaschii were
assigned a function using sequence similarity
(1997)
18Similarity
- How much similar do the sequences have to be to
infer homology? - Two possibilities when similarity is detected
- The similarity is by chance
- They evolved from a common ancestor hence, have
similar functions
19Measures of similarity
- Percent identity
- 40 similar, 70 similar
- problems with percent identity?
- Scoring matrices
- matching of some amino acids may be more
significant than matching of other amino acids - PAM matrix in 1970, BLOSUM in 1992
- problems?
20Statistical Significance
- Goal to provide a universal measure for
inferring homology - How different is the result from a random match,
or a match between unrelated requences? - Given a set of sequences not related to the query
(or a set of random sequences), what is the
probability of finding a match with the same
alignment score by chance? - Different statistical measures
- p-value
- E-value
- z-score
21Statistical significance measures
- p-value the probability that at least one
sequence will produce the same score by chance - E-value expected number of sequences that will
produce same or better score by chance - z-score measures how much standard deviations
above the mean of the score distribution
22How to compute statistical significance?
- Significance of a match-run
- Erdös-Renyí
- Significance of local alignments without gaps
- Karlin-Altschul statistics
- Scoring matrices revisited
- Significance of local alignments with gaps
- Significance of global alignments
23Analysis of coin tosses
- Let black circles indicate heads
- Let p be the probability of a head
- For a fair coin, p 0.5
- Probability of 5 heads in a row is (1/2)50.031
- The expected number of times that 5H occurs in
above 14 coin tosses is 100.031 0.31
24Analysis of coin tosses
- The expected number of a length l run of heads in
n tosses. - What is the expected length R of the longest
match in n tosses?
25Analysis of coin tosses
- (Erdös-Rényi) If there are n throws, then the
expected length R of the longest run of heads is - R log1/p (n)
26Example
- Example Suppose n 20 for a fair coin
- Rlog2(20)4.32
- In other words in 20 coin tosses we expect a run
of heads of length 4.32, once. - Trick is how to model DNA (or amino acid)
sequence alignments as coin tosses.
27Analysis of an alignment
- Probability of an individual match p 0.05
- Expected number of matches 10x8x0.05 4
- Expected number of two successive matches
- 10x8x0.05x0.05 0.2
28Matching runsin sequence alignments
- Consider two sequences a1..m and b1..n
- If the probability of occurrence for every symbol
is p, then a match of a residue ai with bj is p,
and a match of length l from ai,bj to
ail-1,bjl-1 is pl. - The head-run problem of coin tosses corresponds
to the longest run of matches along the diagonals
29Matching runsin sequence alignments
- There are m-l1 x n-l1 places where the match
could start - The expected length of the longest match can be
approximated as - Rlog1/p(mn)
-
- where m and n are the lengths of the two
sequences.
30Matching runsin sequence alignments
- So suppose m n 10 and were looking at DNA
sequences - Rlog4(100)3.32
- This analysis makes assumptions about the base
composition (uniform) and no gaps, but its a
good estimate.
31Statistics for matching runs
- Statistics of matching runs
- Length versus score?
- Consider all mismatches receive a negative score
of -8 and aibj match receives a positive score of
si,j. - What is the expected number of matching runs with
a score x or higher? - Using this theory of matching runs, Karlin and
Altschul developed a theory for statistics of
local alignments without gaps (extended this
theory to allow for mismatches).
32Statistics of local alignments without gaps
- A scoring matrix which satisfy the following
constraint - The expected score of a single match obtained by
a scoring matrix should be negative. - Otherwise?
- Arbitrarily long random sequences will get higher
scores just because they are long, not because
theres a significant match. - If this requirement is met then the expected
number of alignments with score x or higher is
given by
33Statistics of local alignments without gaps
- K lt 1 is a proportionality constant that corrects
the mn space factor for the fact that there are
not really mn independent places that could have
produced score S x. - K has little effect on the statistical
significance of a similarity score - ? is closely related to the scoring matrix used
and it takes into account that the scoring
matrices do not contain actual probabilities of
co-occurence, but instead a scaled version of
those values. To understand how ? is computed, we
have to look at the construction of scoring
matrices.
34Scoring Matrices
- In 1970s there were few protein sequences
available. Dayhoff used a limited set of families
of protein sequences multiply aligned to infer
mutation likelihoods.
35Scoring Matrices
- Dayhoff represented the similarity of amino acids
as a log odds ratio - where qij is the observed frequency of
co-occurrence, and pi, pj are the individual
frequencies.
36Example
- If M occurs in the sequences with 0.01 frequency
and L occurs with 0.1 frequency. By random
pairing, you expect 0.001 amino acid pairs to be
M-L. If the observed frequency of M-L is actually
0.003, score of matching M-L will be - log2(3)1.585 bits or loge(3) ln(3) 1.1 nats
- Since, scoring matrices are usually provided as
integer matrices, these values are scaled by a
constant factor. ? is approximately the inverse
of the original scaling factor.
37How to compute ?
Sum of observed frequencies is 1.
Given the frequencies of individual amino acids
and the scores in the matrix, ? can be estimated.
38Extreme value distribution
- Consider an experiment that obtains the maximum
value of locally aligning a random string with
query string (without gaps). Repeat with another
random string and so on. Plot the distribution
of these maximum values. - The resulting distribution is an extreme value
distribution, called a Gumbel distribution.
39Normal vs. Extreme Value Distribution
Normal distribution y (1/v2p)e-x2/2
Normal
Extreme value distribution y e-x e-x
Extreme Value
40Local alignments with gaps
- The EVD distribution
- is not always observed.
- Theory of local alignments
- with gaps is not well studied
- as in without gaps.
- Mostly empirical results.
- For example, BLAST allows
- only a certain range of
- gap penalties.
41BLAST statistics
- Pre-computed ? and K values for different scoring
matrices and gap penalties are used for faster
computation. - Raw score is converted to bit score
- E-value is computed using
- m is query size, n is database size and L is the
typical length of maximal scoring alignment.
42FASTA Statistics
- FASTA tries to estimate the probability
distribution of alignments for every query. - For any query sequence, a large collection of
scores is gathered during the search of the
database. - They estimate the parameters of the EVD
distribution based on the histogram of scores. - Advantages
- reliable statistics for different parameters
- different databases, different gap penalties,
different scoring matrices, queries with
different amino acid compositions.
43Statistical significanceanother example
- Suppose, we have a huge graph with weighted edges
and we want to find strongly connected clusters
of nodes. - Suppose, an algorithm for this task is given.
- The algorithms gives you the best hundred
clusters in this graph. - How do you define best?
- Cluster size?
- Total weight of edges?
44Statistical significance
- How different is a found cluster of size N from a
random cluster of the same size? - This measure will enable comparison of clusters
of different sizes.
45Statistical significance of a cluster
- Use maximum spanning tree weight of a cluster as
a quantitative representation of that cluster. - And see what
- values random
- clusters get.
- (sample many
- random
- clusters)
46Statistical significance of a cluster
Looks like an exponential decay. We may fit an
exponential distribution on this histogram.
47Fitting an exponential
48Statistical significance of a cluster
After we fit an exponential distribution, we
compute the probability that another random
cluster gets a higher score than the score of
found cluster.
49Examples
- ?5 1.7 for clusters of size 5 and ?20 0.36
for clusters of size 20. - Suppose you have found a cluster of size 5 with
weights of its edges sum up to 15 and you have
found a cluster of size 20 with weight 45 which
one would you prefer?