Lecture outline

About This Presentation

Title:

Lecture outline

Description:

O(max(m,n)) if only similarity score is needed. More complicated 'divide-and-conquer' algorithm that ... log2(3)=1.585 bits or loge(3) = ln(3) = 1.1 nats ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 50

Provided by: ambuj4

Category:

more less

Transcript and Presenter's Notes

Title: Lecture outline

1
Lecture outline

Database searches
BLAST
FASTA
Statistical Significance of Sequence Comparison
Results
Probability of matching runs
Karin-Altschul statistics
Extreme value distribution

2
DP Alignment Complexity

O(mn) time
O(mn) space
O(max(m,n)) if only similarity score is needed
More complicated divide-and-conquer algorithm
that doubles time complexity and uses O(min(m,n))
space Hirschberg, JACM 1977

3
Time and space bottlenecks

Comparing two one-megabase genomes.
Space
An entry 4 bytes
Table 4 106 106 4 T bytes memory.
Time
1000 MHz CPU 1M entries/second
1012 entries 1M seconds 10 days.

4
BLAST

Basic Local Alignment Search Tool
Altschul et al. 1990,1994,1997
Heuristic method for local alignment
Designed specifically for database searches
Idea good alignments contain short lengths of
exact matches

5
Steps of BLAST

Query words of length 4 (for proteins) or 11 (for
DNA) are created from query sequence using a
sliding window
Scan each database sequence for an exact match to
query words. Each match is a seed for an ungapped
alignment.

MEFPGLGSLGTSEPLPQFVDPALVSS
MEFP EFPG FPGL PGLG GLGS
6
Steps of BLAST

3. (Original BLAST) extend matching words to
the left and right using ungapped alignments.
Extension continues as long as score does not
fall below a given threshold. This is an HSP
(high scoring pair).
(BLAST2) Extend the HSPs using gapped alignment.

7
Steps of BLAST

4. Using a cutoff score S, keep only the
extended matches that have a score at
least S.
5. Determine statistical significance of each
remaining match.

8
Example BLAST run

BLAST website
http//www.ncbi.nlm.nih.gov/BLAST/

9
FASTA

Derived from logic of the dot plot
compute best diagonals from all frames of
alignment
Word method looks for exact matches between words
in query and test sequence
construct word position tables
DNA words are usually 6 bases
protein words are 1 or 2 amino acids
only searches for diagonals in region of word
matches faster searching

10
Steps of FASTA

Find k-tups in the two sequences (k1-2 for
proteins, 4-6 for DNA sequences)
Create a table of positions for those k-tups

11
The offset table
position 1 2 3 4 5 6 7 8 9 10 11 proteinA n c s p
t a . . . . . proteinB . . . . . a c s p r k
position in offset amino
acid protein A protein B pos A -
posB ---------------------------------------------
-------- a 6 6
0 c 2 7
-5 k - 11 n
1 - p 4 9
-5 r - 10 s
3 8 -5 t
5 - -------------------------
---------------------------- Note the common
offset for the 3 amino acids c,s and p A possible
alignment is thus quickly found - protein 1 n c s
p t a protein 2 a c s p r k
12
FASTA

3. Select top 10 scoring local diagonals
with matches and mismatches but no gaps.
4. Rescan top 10 diagonals (representing
alignments), score with PAM250 (proteins) or DNA
scoring matrix. Trim off the ends of the regions
to achieve highest scores.

13
FASTA Algorithm
14
FASTA

5. After finding the best initial region, FASTA
performs a DP global alignment centered on the
best initial region.

15
FASTA Alignments
16
History of sequence searching

1970 NW
1981 SW
1985 FASTA
1990 BLAST
1997 BLAST2

17
The purpose of sequence alignment

Homology
Function identification
about 70 of the genes of M. jannaschii were
assigned a function using sequence similarity
(1997)

18
Similarity

How much similar do the sequences have to be to
infer homology?
Two possibilities when similarity is detected
The similarity is by chance
They evolved from a common ancestor hence, have
similar functions

19
Measures of similarity

Percent identity
40 similar, 70 similar
problems with percent identity?
Scoring matrices
matching of some amino acids may be more
significant than matching of other amino acids
PAM matrix in 1970, BLOSUM in 1992
problems?

20
Statistical Significance

Goal to provide a universal measure for
inferring homology
How different is the result from a random match,
or a match between unrelated requences?
Given a set of sequences not related to the query
(or a set of random sequences), what is the
probability of finding a match with the same
alignment score by chance?
Different statistical measures
p-value
E-value
z-score

21
Statistical significance measures

p-value the probability that at least one
sequence will produce the same score by chance
E-value expected number of sequences that will
produce same or better score by chance
z-score measures how much standard deviations
above the mean of the score distribution

22
How to compute statistical significance?

Significance of a match-run
Erdös-Renyí
Significance of local alignments without gaps
Karlin-Altschul statistics
Scoring matrices revisited
Significance of local alignments with gaps
Significance of global alignments

23
Analysis of coin tosses

Let black circles indicate heads
Let p be the probability of a head
For a fair coin, p 0.5
Probability of 5 heads in a row is (1/2)50.031
The expected number of times that 5H occurs in
above 14 coin tosses is 100.031 0.31

24
Analysis of coin tosses

The expected number of a length l run of heads in
n tosses.
What is the expected length R of the longest
match in n tosses?

25
Analysis of coin tosses

(Erdös-Rényi) If there are n throws, then the
expected length R of the longest run of heads is
R log1/p (n)

26
Example

Example Suppose n 20 for a fair coin
Rlog2(20)4.32
In other words in 20 coin tosses we expect a run
of heads of length 4.32, once.
Trick is how to model DNA (or amino acid)
sequence alignments as coin tosses.

27
Analysis of an alignment

Probability of an individual match p 0.05
Expected number of matches 10x8x0.05 4
Expected number of two successive matches
10x8x0.05x0.05 0.2

28
Matching runsin sequence alignments

Consider two sequences a1..m and b1..n
If the probability of occurrence for every symbol
is p, then a match of a residue ai with bj is p,
and a match of length l from ai,bj to
ail-1,bjl-1 is pl.
The head-run problem of coin tosses corresponds
to the longest run of matches along the diagonals

29
Matching runsin sequence alignments

There are m-l1 x n-l1 places where the match
could start
The expected length of the longest match can be
approximated as
Rlog1/p(mn)
where m and n are the lengths of the two
sequences.

30
Matching runsin sequence alignments

So suppose m n 10 and were looking at DNA
sequences
Rlog4(100)3.32
This analysis makes assumptions about the base
composition (uniform) and no gaps, but its a
good estimate.

31
Statistics for matching runs

Statistics of matching runs
Length versus score?
Consider all mismatches receive a negative score
of -8 and aibj match receives a positive score of
si,j.
What is the expected number of matching runs with
a score x or higher?
Using this theory of matching runs, Karlin and
Altschul developed a theory for statistics of
local alignments without gaps (extended this
theory to allow for mismatches).

32
Statistics of local alignments without gaps

A scoring matrix which satisfy the following
constraint
The expected score of a single match obtained by
a scoring matrix should be negative.
Otherwise?
Arbitrarily long random sequences will get higher
scores just because they are long, not because
theres a significant match.
If this requirement is met then the expected
number of alignments with score x or higher is
given by

33
Statistics of local alignments without gaps

K lt 1 is a proportionality constant that corrects
the mn space factor for the fact that there are
not really mn independent places that could have
produced score S x.
K has little effect on the statistical
significance of a similarity score
? is closely related to the scoring matrix used
and it takes into account that the scoring
matrices do not contain actual probabilities of
co-occurence, but instead a scaled version of
those values. To understand how ? is computed, we
have to look at the construction of scoring
matrices.

34
Scoring Matrices

In 1970s there were few protein sequences
available. Dayhoff used a limited set of families
of protein sequences multiply aligned to infer
mutation likelihoods.

35
Scoring Matrices

Dayhoff represented the similarity of amino acids
as a log odds ratio
where qij is the observed frequency of
co-occurrence, and pi, pj are the individual
frequencies.

36
Example

If M occurs in the sequences with 0.01 frequency
and L occurs with 0.1 frequency. By random
pairing, you expect 0.001 amino acid pairs to be
M-L. If the observed frequency of M-L is actually
0.003, score of matching M-L will be
log2(3)1.585 bits or loge(3) ln(3) 1.1 nats
Since, scoring matrices are usually provided as
integer matrices, these values are scaled by a
constant factor. ? is approximately the inverse
of the original scaling factor.

37
How to compute ?

Recall that
and

Sum of observed frequencies is 1.
Given the frequencies of individual amino acids
and the scores in the matrix, ? can be estimated.
38
Extreme value distribution

Consider an experiment that obtains the maximum
value of locally aligning a random string with
query string (without gaps). Repeat with another
random string and so on. Plot the distribution
of these maximum values.
The resulting distribution is an extreme value
distribution, called a Gumbel distribution.

39
Normal vs. Extreme Value Distribution
Normal distribution y (1/v2p)e-x2/2
Normal
Extreme value distribution y e-x e-x
Extreme Value
40
Local alignments with gaps

The EVD distribution
is not always observed.
Theory of local alignments
with gaps is not well studied
as in without gaps.
Mostly empirical results.
For example, BLAST allows
only a certain range of
gap penalties.

41
BLAST statistics

Pre-computed ? and K values for different scoring
matrices and gap penalties are used for faster
computation.
Raw score is converted to bit score
E-value is computed using
m is query size, n is database size and L is the
typical length of maximal scoring alignment.

42
FASTA Statistics

FASTA tries to estimate the probability
distribution of alignments for every query.
For any query sequence, a large collection of
scores is gathered during the search of the
database.
They estimate the parameters of the EVD
distribution based on the histogram of scores.
Advantages
reliable statistics for different parameters
different databases, different gap penalties,
different scoring matrices, queries with
different amino acid compositions.

43
Statistical significanceanother example

Suppose, we have a huge graph with weighted edges
and we want to find strongly connected clusters
of nodes.
Suppose, an algorithm for this task is given.
The algorithms gives you the best hundred
clusters in this graph.
How do you define best?
Cluster size?
Total weight of edges?

44
Statistical significance

How different is a found cluster of size N from a
random cluster of the same size?
This measure will enable comparison of clusters
of different sizes.

45
Statistical significance of a cluster

Use maximum spanning tree weight of a cluster as
a quantitative representation of that cluster.
And see what
values random
clusters get.
(sample many
random
clusters)

46
Statistical significance of a cluster
Looks like an exponential decay. We may fit an
exponential distribution on this histogram.
47
Fitting an exponential
48
Statistical significance of a cluster
After we fit an exponential distribution, we
compute the probability that another random
cluster gets a higher score than the score of
found cluster.
49
Examples

?5 1.7 for clusters of size 5 and ?20 0.36
for clusters of size 20.
Suppose you have found a cluster of size 5 with
weights of its edges sum up to 15 and you have
found a cluster of size 20 with weight 45 which
one would you prefer?

Write a Comment

User Comments (0)

About PowerShow.com

Lecture outline - PowerPoint PPT Presentation

Lecture outline

O(max(m,n)) if only similarity score is needed. More complicated 'divide-and-conquer' algorithm that ... log2(3)=1.585 bits or loge(3) = ln(3) = 1.1 nats ... – PowerPoint PPT presentation