Title: Swapan Shop Mallick
1Significance in protein analysis
Swapan Shop Mallick
2Overview
- The need for statistics
- Example BLOSUM
- What do the scores mean?
- How can you compare two scores?
- Example BLAST
- Problems with BLAST
- Review of Distributions
- Distribution of random BLAST results
- P-values and e-values
- Statistics of BLAST
- Summary and Conclusion
- Exercise
3The need for statistics
- Statistics is very important for bioinformatics.
- It is very easy to have a computer analyze the
data and give you back a result. - Problem is to decide whether the answer the
computer gives you is any good at all. - Questions
- How statistically significant is the answer?
- What is the probability that this answer could
have been obtained by random? What does this
depend on?
4Basics
N
n
Sample
Population
5Basics
N
Descriptive statistics
n
Sample
Population
Probability
6Example BLOSUM
- The BLOSUM matrix assigns a probability score for
each residue pair in an alignment based on - the frequency with which that pairing is known to
occur within conserved blocks of related
proteins. - Simple since size of population size of sample
- BLOSUM matrices are constructed from observations
which lead to observed probabilities
7BLOSUM substitution matrices
- BLOSUM matrices are used in log-odds form based
on actually observed substitutions. - This is because
- Ease of use Scores can be just added (the raw
probabilities would have to be multiplied) - Ease of interpretation
- S0 substitution is just as likely to occur as
random - Slt0 substitution is more likely to occur
randomly than observed - Sgt0 substitution is less likely to occur
randomly than observed
8Substitution matrices
Score of amino acid a with amino acid b
Pab is the observed frequency that residues a and
b are correlated because of homology
Lambda is a scaling factor equal to 0.347, set so
that the scores can be rounded off to sensible
integers
fafb is the expected frequency of seeing residues
a and b paired together, which is just the
product of the frequency of residue a multiplied
by the frequency of residue b
Source Where did the BLOSUM62 alignment score
matrix come from? Eddy S., Nat. Biotech. 22
Aug 2004
9Substitution matrices
Lambda is a scaling factor equal to 0.347, set so
that the scores can be rounded off to sensible
integers
Pab is the observed frequency that residues a and
b are correlated because of homology
fafb is the expected frequency of seeing residues
a and b paired together, which is just the
product of the frequency of residue a multiplied
by the frequency of residue b
10 11i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
12i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
32.1
5.7
13i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
32.1
5.7
14i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
32.1
5.7
15Example BLAST
- Motivations
- Exact algorithms are exhaustive but
computationally expensive. - Exact algorithms are impractical for comparing a
query sequence to millions of other sequences in
a database (database scanning), - and so, database scanning requires heuristic
alignment algorithm (at the cost of optimality).
16Interpret BLAST results - Description
17Problems with BLAST
- Why do results change?
- How can you compare results from different BLAST
tools which may report different types of values? - How are results (eg evalue) affected by query
- There are _many_ values reported in the output
what do they mean?
18Example Importance of Blast statistics
19Review
- What is a distribution?
- A plot showing the frequency of a given variable
or observation.
20Review
- What is a distribution?
- A plot showing the frequency of a given variable
or observation.
21Features of a Normal Distribution
- Symmetric Distribution
- Has an average or mean value at the centre
- Has a characteristic width called the standard
deviation (S.D. s) - Most common type of distribution known
m mean
22Standard Deviations (Z-score)
23Mean, Median Mode
Mode
Median
Mean
24Mean, Median, Mode
- In a Normal Distribution the mean, mode and
median are all equal - In skewed distributions they are unequal
- Mean - average value, affected by extreme values
in the distribution - Median - the middlemost value, usually half way
between the mode and the mean - Mode - most common value
25Different Distributions
Unimodal Bimodal
26Other Distributions
- Binomial Distribution
- Poisson Distribution
- Extreme Value Distribution
27Binomial Distribution
1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10
10 5 1
P(x) (p q)n
28Poisson Distribution
P(x)
x
29Review
- What is a distribution?
- A plot showing the frequency of a given variable
or observation. - What is a null hypothesis?
- A statisticians way of characterizing chance.
- Generally, a mathematical model of randomness
with respect to a particular set of observations. - The purpose of most statistical tests is to
determine whether the observed data can be
explained by the null hypothesis.
30Review
- What is a distribution?
- A plot showing the frequency of a given variable
or observation. - What is a null hypothesis?
- A statisticians way of characterizing chance.
- Generally, a mathematical model of randomness
with respect to a particular set of observations. - The purpose of most statistical tests is to
determine whether the observed data can be
explained by the null hypothesis.
31Review
- Examples of null hypotheses
- Sequence comparison using shuffled sequences.
- A normal distribution of log ratios from a
microarray experiment. - LOD scores from genetic linkage analysis when the
relevant loci are randomly sprinkled throughout
the genome.
32Empirical score distribution
- The picture shows a distribution of scores from a
real database search using BLAST. - This distribution contains scores from
non-homologous and homologous pairs.
High scores from homology.
33Empirical null score distribution
- This distribution is similar to the previous one,
but generated using a randomized sequence
database.
34Review
35Review
- What is a p-value?
- The probability of observing an effect as strong
or stronger than you observed, given the null
hypothesis. I.e., How likely is this effect to
occur by chance? - Pr(x gt Snull)
36Review
- What is the name of the distribution created by
sequence similarity scores, and what does it
look like? - Extreme value distribution, or Gumbel
distribution. - It looks similar to a normal distribution, but it
has a larger tail on the right.
37Review
- What is the name of the distribution created by
sequence similarity scores, and what does it
look like? - Extreme value distribution, or Gumbel
distribution. - It looks similar to a normal distribution, but it
has a larger tail on the right.
38Statistics
- BLAST (and also local i.e. Smith-Waterman and
BLAT scores) between random, unrelated sequences
follow the Gumbel Extreme Value Distribution
(EVD) - Pr(sgtS) 1-exp(-Kmn e-lS)
- This is the probability of randomly encountering
a score greater than S. - S alignment score
- m,n query sequence lengths, and length of
database resp. - K, l parameters depending on scoring scheme and
sequence composition - Bit score S lS log(K)
log(2)
39BLAST output revisited
S S E
n m
? K
From Expasy BLAST
40Review
- EVD for random blast
- Upper tail behaviour Pr( s gt S ) Kmn e-lS
-
This is the EXPECT value Evalue
41How to Calculate E-values
- Think of the databank as one very long random
sequence, length G - Alignments with sgtS occur randomly across the
genome, with a Poisson distribution - Pr (highest-scoring alignment sgtS) KmGe-lS
- Pr( no alignment sgtS ) 1 - KmGe-lS
- Expected number m of alignments with sgtS given by
- 1-e-m 1 - KmGe-lS (Poisson property)
- m -log(KmG) lS
- Threshold S log(KmG) m /l
42Summary
- Want to be able to compare scores in sequences of
different compositions or different scoring
schemes - Score S sum(match) sum(gap costs)
43Summary
- Want to be able to compare scores in sequences of
different compositions or different scoring
schemes - Score S sum(match) sum(gap costs)
- Bit score
- S lS log(K) log(2)
44Summary
Score and bit score grow linearly with the length
of the alignment
- Want to be able to compare scores in sequences of
different compositions or different scoring
schemes - Score S sum(match) sum(gap costs)
- Bit score
- S lS log(K) log(2)
45Summary
Score and bit score grow linearly with the length
of the alignment
- Want to be able to compare scores in sequences of
different compositions or different scoring
schemes - Score S sum(match) sum(gap costs)
- Bit score
- S lS log(K) log(2)
- E-value of bit score
- E mn2-S
46Summary
Score and bit score grow linearly with the length
of the alignment
E-Value shrinks really fast as bit score grows
- Want to be able to compare scores in sequences of
different compositions or different scoring
schemes - Score S sum(match) sum(gap costs)
- Bit score
- S lS log(K) log(2)
- E-value of bit score
- E mn2-S
47Summary
Score and bit score grow linearly with the length
of the alignment
E-Value shrinks really fast as bit score grows
- Want to be able to compare scores in sequences of
different compositions or different scoring
schemes - Score S sum(match) sum(gap costs)
- Bit score
- S lS log(K) log(2)
- E-value of bit score
- E mn2-S
E-Value grows linearly with the product of target
and query sizes.
48Summary
Score and bit score grow linearly with the length
of the alignment
E-Value shrinks really fast as bit score grows
- Want to be able to compare scores in sequences of
different compositions or different scoring
schemes - Score S sum(match) sum(gap costs)
- Bit score
- S lS log(K) log(2)
- E-value of bit score
- E mn2-S
E-Value grows linearly with the product of target
and query sizes.
Doubling target set size and doubling query
length have the same effect on e-value
49Conclusion
- You should now be able to compare BLAST results
from different databases, converting values if
they are reported differently (which happens
frequently) - You should now know why BLAST results might
change from one day to the next, even on the same
server - You should understand also the dependance of
query length on E-value. - Statistical rankings are reported for (almost)
every database search tool. When making
comparisons between databases, between sequences
it is useful to know how the statistics are
derived to know if comparisons are meaningful.
50 51 52- Look through Patterns in sequences (Searching
for information within sequences) - Some common
problems and their solutions - http//lepo.it.da.ut.ee./mremm/kurs/pattern.htm
- What is the structure of my sequence?
- http//speedy.embl-heidelberg.de/gtsp/flowchart2.h
tml (clickable!)