Swapan Shop Mallick - PowerPoint PPT Presentation

About This Presentation

Title:

Swapan Shop Mallick

Description:

It is very easy to have a computer analyze the data and give you back a result. ... (and also local i.e. Smith-Waterman and BLAT scores) between random, unrelated ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 52

Provided by: ekhidnaBi

Category:

more less

Transcript and Presenter's Notes

Title: Swapan Shop Mallick

1
Significance in protein analysis
Swapan Shop Mallick
2
Overview

The need for statistics
Example BLOSUM
What do the scores mean?
How can you compare two scores?
Example BLAST
Problems with BLAST
Review of Distributions
Distribution of random BLAST results
P-values and e-values
Statistics of BLAST
Summary and Conclusion
Exercise

3
The need for statistics

Statistics is very important for bioinformatics.
It is very easy to have a computer analyze the
data and give you back a result.
Problem is to decide whether the answer the
computer gives you is any good at all.
Questions
How statistically significant is the answer?
What is the probability that this answer could
have been obtained by random? What does this
depend on?

4
Basics
N
n
Sample
Population

5
Basics
N
Descriptive statistics
n
Sample
Population

Probability
6
Example BLOSUM

The BLOSUM matrix assigns a probability score for
each residue pair in an alignment based on
the frequency with which that pairing is known to
occur within conserved blocks of related
proteins.
Simple since size of population size of sample
BLOSUM matrices are constructed from observations
which lead to observed probabilities

7
BLOSUM substitution matrices

BLOSUM matrices are used in log-odds form based
on actually observed substitutions.
This is because
Ease of use Scores can be just added (the raw
probabilities would have to be multiplied)
Ease of interpretation
S0 substitution is just as likely to occur as
random
Slt0 substitution is more likely to occur
randomly than observed
Sgt0 substitution is less likely to occur
randomly than observed

8
Substitution matrices
Score of amino acid a with amino acid b
Pab is the observed frequency that residues a and
b are correlated because of homology
Lambda is a scaling factor equal to 0.347, set so
that the scores can be rounded off to sensible
integers
fafb is the expected frequency of seeing residues
a and b paired together, which is just the
product of the frequency of residue a multiplied
by the frequency of residue b
Source Where did the BLOSUM62 alignment score
matrix come from? Eddy S., Nat. Biotech. 22
Aug 2004
9
Substitution matrices
Lambda is a scaling factor equal to 0.347, set so
that the scores can be rounded off to sensible
integers
Pab is the observed frequency that residues a and
b are correlated because of homology
fafb is the expected frequency of seeing residues
a and b paired together, which is just the
product of the frequency of residue a multiplied
by the frequency of residue b
10

11
i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
12
i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
32.1
5.7
13
i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
32.1
5.7
14
i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
32.1
5.7
15
Example BLAST

Motivations
Exact algorithms are exhaustive but
computationally expensive.
Exact algorithms are impractical for comparing a
query sequence to millions of other sequences in
a database (database scanning),
and so, database scanning requires heuristic
alignment algorithm (at the cost of optimality).

16
Interpret BLAST results - Description
17
Problems with BLAST

Why do results change?
How can you compare results from different BLAST
tools which may report different types of values?
How are results (eg evalue) affected by query
There are _many_ values reported in the output
what do they mean?

18
Example Importance of Blast statistics

But, first a review.

19
Review

What is a distribution?
A plot showing the frequency of a given variable
or observation.

20
Review

What is a distribution?
A plot showing the frequency of a given variable
or observation.

21
Features of a Normal Distribution

Symmetric Distribution
Has an average or mean value at the centre
Has a characteristic width called the standard
deviation (S.D. s)
Most common type of distribution known

m mean
22
Standard Deviations (Z-score)
23
Mean, Median Mode
Mode
Median
Mean
24
Mean, Median, Mode

In a Normal Distribution the mean, mode and
median are all equal
In skewed distributions they are unequal
Mean - average value, affected by extreme values
in the distribution
Median - the middlemost value, usually half way
between the mode and the mean
Mode - most common value

25
Different Distributions
Unimodal Bimodal
26
Other Distributions

Binomial Distribution
Poisson Distribution
Extreme Value Distribution

27
Binomial Distribution
1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10
10 5 1
P(x) (p q)n
28
Poisson Distribution
P(x)
x
29
Review

What is a distribution?
A plot showing the frequency of a given variable
or observation.
What is a null hypothesis?
A statisticians way of characterizing chance.
Generally, a mathematical model of randomness
with respect to a particular set of observations.
The purpose of most statistical tests is to
determine whether the observed data can be
explained by the null hypothesis.

30
Review

What is a distribution?
A plot showing the frequency of a given variable
or observation.
What is a null hypothesis?
A statisticians way of characterizing chance.
Generally, a mathematical model of randomness
with respect to a particular set of observations.
The purpose of most statistical tests is to
determine whether the observed data can be
explained by the null hypothesis.

31
Review

Examples of null hypotheses
Sequence comparison using shuffled sequences.
A normal distribution of log ratios from a
microarray experiment.
LOD scores from genetic linkage analysis when the
relevant loci are randomly sprinkled throughout
the genome.

32
Empirical score distribution

The picture shows a distribution of scores from a
real database search using BLAST.
This distribution contains scores from
non-homologous and homologous pairs.

High scores from homology.
33
Empirical null score distribution

This distribution is similar to the previous one,
but generated using a randomized sequence
database.

34
Review

What is a p-value?

35
Review

What is a p-value?
The probability of observing an effect as strong
or stronger than you observed, given the null
hypothesis. I.e., How likely is this effect to
occur by chance?
Pr(x gt Snull)

36
Review

What is the name of the distribution created by
sequence similarity scores, and what does it
look like?
Extreme value distribution, or Gumbel
distribution.
It looks similar to a normal distribution, but it
has a larger tail on the right.

37
Review

What is the name of the distribution created by
sequence similarity scores, and what does it
look like?
Extreme value distribution, or Gumbel
distribution.
It looks similar to a normal distribution, but it
has a larger tail on the right.

38
Statistics

BLAST (and also local i.e. Smith-Waterman and
BLAT scores) between random, unrelated sequences
follow the Gumbel Extreme Value Distribution
(EVD)
Pr(sgtS) 1-exp(-Kmn e-lS)
This is the probability of randomly encountering
a score greater than S.
S alignment score
m,n query sequence lengths, and length of
database resp.
K, l parameters depending on scoring scheme and
sequence composition
Bit score S lS log(K)
log(2)

39
BLAST output revisited
S S E
n m
? K
From Expasy BLAST
40
Review

EVD for random blast
Upper tail behaviour Pr( s gt S ) Kmn e-lS

This is the EXPECT value Evalue
41
How to Calculate E-values

Think of the databank as one very long random
sequence, length G
Alignments with sgtS occur randomly across the
genome, with a Poisson distribution
Pr (highest-scoring alignment sgtS) KmGe-lS
Pr( no alignment sgtS ) 1 - KmGe-lS
Expected number m of alignments with sgtS given by
1-e-m 1 - KmGe-lS (Poisson property)
m -log(KmG) lS
Threshold S log(KmG) m /l

42
Summary

Want to be able to compare scores in sequences of
different compositions or different scoring
schemes
Score S sum(match) sum(gap costs)

43
Summary

Want to be able to compare scores in sequences of
different compositions or different scoring
schemes
Score S sum(match) sum(gap costs)
Bit score
S lS log(K) log(2)

44
Summary
Score and bit score grow linearly with the length
of the alignment

Want to be able to compare scores in sequences of
different compositions or different scoring
schemes
Score S sum(match) sum(gap costs)
Bit score
S lS log(K) log(2)

45
Summary
Score and bit score grow linearly with the length
of the alignment

Want to be able to compare scores in sequences of
different compositions or different scoring
schemes
Score S sum(match) sum(gap costs)
Bit score
S lS log(K) log(2)
E-value of bit score
E mn2-S

46
Summary
Score and bit score grow linearly with the length
of the alignment
E-Value shrinks really fast as bit score grows

Want to be able to compare scores in sequences of
different compositions or different scoring
schemes
Score S sum(match) sum(gap costs)
Bit score
S lS log(K) log(2)
E-value of bit score
E mn2-S

47
Summary
Score and bit score grow linearly with the length
of the alignment
E-Value shrinks really fast as bit score grows

Want to be able to compare scores in sequences of
different compositions or different scoring
schemes
Score S sum(match) sum(gap costs)
Bit score
S lS log(K) log(2)
E-value of bit score
E mn2-S

E-Value grows linearly with the product of target
and query sizes.
48
Summary
Score and bit score grow linearly with the length
of the alignment
E-Value shrinks really fast as bit score grows

Want to be able to compare scores in sequences of
different compositions or different scoring
schemes
Score S sum(match) sum(gap costs)
Bit score
S lS log(K) log(2)
E-value of bit score
E mn2-S

E-Value grows linearly with the product of target
and query sizes.
Doubling target set size and doubling query
length have the same effect on e-value
49
Conclusion

You should now be able to compare BLAST results
from different databases, converting values if
they are reported differently (which happens
frequently)
You should now know why BLAST results might
change from one day to the next, even on the same
server
You should understand also the dependance of
query length on E-value.
Statistical rankings are reported for (almost)
every database search tool. When making
comparisons between databases, between sequences
it is useful to know how the statistics are
derived to know if comparisons are meaningful.

THE END

Supplemental
Section

Look through Patterns in sequences (Searching
for information within sequences) - Some common
problems and their solutions
http//lepo.it.da.ut.ee./mremm/kurs/pattern.htm
What is the structure of my sequence?
http//speedy.embl-heidelberg.de/gtsp/flowchart2.h
tml (clickable!)

Write a Comment

User Comments (0)