Swapan Shop Mallick - PowerPoint PPT Presentation

About This Presentation
Title:

Swapan Shop Mallick

Description:

It is very easy to have a computer analyze the data and give you back a result. ... (and also local i.e. Smith-Waterman and BLAT scores) between random, unrelated ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 52
Provided by: ekhidnaBi
Category:

less

Transcript and Presenter's Notes

Title: Swapan Shop Mallick


1
Significance in protein analysis
Swapan Shop Mallick
2
Overview
  • The need for statistics
  • Example BLOSUM
  • What do the scores mean?
  • How can you compare two scores?
  • Example BLAST
  • Problems with BLAST
  • Review of Distributions
  • Distribution of random BLAST results
  • P-values and e-values
  • Statistics of BLAST
  • Summary and Conclusion
  • Exercise

3
The need for statistics
  • Statistics is very important for bioinformatics.
  • It is very easy to have a computer analyze the
    data and give you back a result.
  • Problem is to decide whether the answer the
    computer gives you is any good at all.
  • Questions
  • How statistically significant is the answer?
  • What is the probability that this answer could
    have been obtained by random? What does this
    depend on?

4
Basics
N
n
Sample
Population

5
Basics
N
Descriptive statistics
n
Sample
Population

Probability
6
Example BLOSUM
  • The BLOSUM matrix assigns a probability score for
    each residue pair in an alignment based on
  • the frequency with which that pairing is known to
    occur within conserved blocks of related
    proteins.
  • Simple since size of population size of sample
  • BLOSUM matrices are constructed from observations
    which lead to observed probabilities

7
BLOSUM substitution matrices
  • BLOSUM matrices are used in log-odds form based
    on actually observed substitutions.
  • This is because
  • Ease of use Scores can be just added (the raw
    probabilities would have to be multiplied)
  • Ease of interpretation
  • S0 substitution is just as likely to occur as
    random
  • Slt0 substitution is more likely to occur
    randomly than observed
  • Sgt0 substitution is less likely to occur
    randomly than observed

8
Substitution matrices
Score of amino acid a with amino acid b
Pab is the observed frequency that residues a and
b are correlated because of homology
Lambda is a scaling factor equal to 0.347, set so
that the scores can be rounded off to sensible
integers
fafb is the expected frequency of seeing residues
a and b paired together, which is just the
product of the frequency of residue a multiplied
by the frequency of residue b
Source Where did the BLOSUM62 alignment score
matrix come from? Eddy S., Nat. Biotech. 22
Aug 2004
9
Substitution matrices
Lambda is a scaling factor equal to 0.347, set so
that the scores can be rounded off to sensible
integers
Pab is the observed frequency that residues a and
b are correlated because of homology
fafb is the expected frequency of seeing residues
a and b paired together, which is just the
product of the frequency of residue a multiplied
by the frequency of residue b
10

11
i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
12
i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
32.1
5.7
13
i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
32.1
5.7
14
i) S0 O/E ratio1 ii) Compare S5 and S10.
Ratio is based on exponential function iii)
S-10 O/E ratio 0.031 1/32. iv) Ratio of
scores S1, S2 in terms of probabilities of
observed/random
32.1
5.7
15
Example BLAST
  • Motivations
  • Exact algorithms are exhaustive but
    computationally expensive.
  • Exact algorithms are impractical for comparing a
    query sequence to millions of other sequences in
    a database (database scanning),
  • and so, database scanning requires heuristic
    alignment algorithm (at the cost of optimality).

16
Interpret BLAST results - Description
17
Problems with BLAST
  • Why do results change?
  • How can you compare results from different BLAST
    tools which may report different types of values?
  • How are results (eg evalue) affected by query
  • There are _many_ values reported in the output
    what do they mean?

18
Example Importance of Blast statistics
  • But, first a review.

19
Review
  • What is a distribution?
  • A plot showing the frequency of a given variable
    or observation.

20
Review
  • What is a distribution?
  • A plot showing the frequency of a given variable
    or observation.

21
Features of a Normal Distribution
  • Symmetric Distribution
  • Has an average or mean value at the centre
  • Has a characteristic width called the standard
    deviation (S.D. s)
  • Most common type of distribution known

m mean
22
Standard Deviations (Z-score)
23
Mean, Median Mode
Mode
Median
Mean
24
Mean, Median, Mode
  • In a Normal Distribution the mean, mode and
    median are all equal
  • In skewed distributions they are unequal
  • Mean - average value, affected by extreme values
    in the distribution
  • Median - the middlemost value, usually half way
    between the mode and the mean
  • Mode - most common value

25
Different Distributions
Unimodal Bimodal
26
Other Distributions
  • Binomial Distribution
  • Poisson Distribution
  • Extreme Value Distribution

27
Binomial Distribution
1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10
10 5 1
P(x) (p q)n
28
Poisson Distribution
P(x)
x
29
Review
  • What is a distribution?
  • A plot showing the frequency of a given variable
    or observation.
  • What is a null hypothesis?
  • A statisticians way of characterizing chance.
  • Generally, a mathematical model of randomness
    with respect to a particular set of observations.
  • The purpose of most statistical tests is to
    determine whether the observed data can be
    explained by the null hypothesis.

30
Review
  • What is a distribution?
  • A plot showing the frequency of a given variable
    or observation.
  • What is a null hypothesis?
  • A statisticians way of characterizing chance.
  • Generally, a mathematical model of randomness
    with respect to a particular set of observations.
  • The purpose of most statistical tests is to
    determine whether the observed data can be
    explained by the null hypothesis.

31
Review
  • Examples of null hypotheses
  • Sequence comparison using shuffled sequences.
  • A normal distribution of log ratios from a
    microarray experiment.
  • LOD scores from genetic linkage analysis when the
    relevant loci are randomly sprinkled throughout
    the genome.

32
Empirical score distribution
  • The picture shows a distribution of scores from a
    real database search using BLAST.
  • This distribution contains scores from
    non-homologous and homologous pairs.

High scores from homology.
33
Empirical null score distribution
  • This distribution is similar to the previous one,
    but generated using a randomized sequence
    database.

34
Review
  • What is a p-value?

35
Review
  • What is a p-value?
  • The probability of observing an effect as strong
    or stronger than you observed, given the null
    hypothesis. I.e., How likely is this effect to
    occur by chance?
  • Pr(x gt Snull)

36
Review
  • What is the name of the distribution created by
    sequence similarity scores, and what does it
    look like?
  • Extreme value distribution, or Gumbel
    distribution.
  • It looks similar to a normal distribution, but it
    has a larger tail on the right.

37
Review
  • What is the name of the distribution created by
    sequence similarity scores, and what does it
    look like?
  • Extreme value distribution, or Gumbel
    distribution.
  • It looks similar to a normal distribution, but it
    has a larger tail on the right.

38
Statistics
  • BLAST (and also local i.e. Smith-Waterman and
    BLAT scores) between random, unrelated sequences
    follow the Gumbel Extreme Value Distribution
    (EVD)
  • Pr(sgtS) 1-exp(-Kmn e-lS)
  • This is the probability of randomly encountering
    a score greater than S.
  • S alignment score
  • m,n query sequence lengths, and length of
    database resp.
  • K, l parameters depending on scoring scheme and
    sequence composition
  • Bit score S lS log(K)
    log(2)

39
BLAST output revisited
S S E
n m
? K
From Expasy BLAST
40
Review
  • EVD for random blast
  • Upper tail behaviour Pr( s gt S ) Kmn e-lS

This is the EXPECT value Evalue
41
How to Calculate E-values
  • Think of the databank as one very long random
    sequence, length G
  • Alignments with sgtS occur randomly across the
    genome, with a Poisson distribution
  • Pr (highest-scoring alignment sgtS) KmGe-lS
  • Pr( no alignment sgtS ) 1 - KmGe-lS
  • Expected number m of alignments with sgtS given by
  • 1-e-m 1 - KmGe-lS (Poisson property)
  • m -log(KmG) lS
  • Threshold S log(KmG) m /l

42
Summary
  • Want to be able to compare scores in sequences of
    different compositions or different scoring
    schemes
  • Score S sum(match) sum(gap costs)

43
Summary
  • Want to be able to compare scores in sequences of
    different compositions or different scoring
    schemes
  • Score S sum(match) sum(gap costs)
  • Bit score
  • S lS log(K) log(2)

44
Summary
Score and bit score grow linearly with the length
of the alignment
  • Want to be able to compare scores in sequences of
    different compositions or different scoring
    schemes
  • Score S sum(match) sum(gap costs)
  • Bit score
  • S lS log(K) log(2)

45
Summary
Score and bit score grow linearly with the length
of the alignment
  • Want to be able to compare scores in sequences of
    different compositions or different scoring
    schemes
  • Score S sum(match) sum(gap costs)
  • Bit score
  • S lS log(K) log(2)
  • E-value of bit score
  • E mn2-S

46
Summary
Score and bit score grow linearly with the length
of the alignment
E-Value shrinks really fast as bit score grows
  • Want to be able to compare scores in sequences of
    different compositions or different scoring
    schemes
  • Score S sum(match) sum(gap costs)
  • Bit score
  • S lS log(K) log(2)
  • E-value of bit score
  • E mn2-S

47
Summary
Score and bit score grow linearly with the length
of the alignment
E-Value shrinks really fast as bit score grows
  • Want to be able to compare scores in sequences of
    different compositions or different scoring
    schemes
  • Score S sum(match) sum(gap costs)
  • Bit score
  • S lS log(K) log(2)
  • E-value of bit score
  • E mn2-S

E-Value grows linearly with the product of target
and query sizes.
48
Summary
Score and bit score grow linearly with the length
of the alignment
E-Value shrinks really fast as bit score grows
  • Want to be able to compare scores in sequences of
    different compositions or different scoring
    schemes
  • Score S sum(match) sum(gap costs)
  • Bit score
  • S lS log(K) log(2)
  • E-value of bit score
  • E mn2-S

E-Value grows linearly with the product of target
and query sizes.
Doubling target set size and doubling query
length have the same effect on e-value
49
Conclusion
  • You should now be able to compare BLAST results
    from different databases, converting values if
    they are reported differently (which happens
    frequently)
  • You should now know why BLAST results might
    change from one day to the next, even on the same
    server
  • You should understand also the dependance of
    query length on E-value.
  • Statistical rankings are reported for (almost)
    every database search tool. When making
    comparisons between databases, between sequences
    it is useful to know how the statistics are
    derived to know if comparisons are meaningful.

50
  • THE END

51
  • Supplemental
  • Section

52
  • Look through Patterns in sequences (Searching
    for information within sequences) - Some common
    problems and their solutions
  • http//lepo.it.da.ut.ee./mremm/kurs/pattern.htm
  • What is the structure of my sequence?
  • http//speedy.embl-heidelberg.de/gtsp/flowchart2.h
    tml (clickable!)
Write a Comment
User Comments (0)
About PowerShow.com