Title: Boris Steipe
1ACGCThe Applied Computational Genomics Course -
Calgary, May 2005Day 4 - Statistics and
Simulation Tests
Department of Biochemistry, Department of Medical
Genetics and Microbiology, Program in Proteomics
and Bioinformatics
2ACGC Day 4
- A process-centric approach
- Source data
- Process design
- Statistics
- Biological context
The role of statistics Significance and
confidence Simulation testing
Lab Programming a simulation test
3Why Statistics ?
Statistics quantitative reasoning.
4Statistics quantitative reasoning.
Reasoning What is the true value behind a series
of measurements? Descriptive statistics Is my
observation interesting? Inferential statistics
Quantifying If we don't know, what is our best
guess? Probability theory
5Statistics Concepts
Descriptive Statistics Descriptors
Samples Probability Concepts Frequentist
and Bayesian approach Inferential Statistics
Modeling Testing
6Two main branches of statistics
Descriptive statistics Inferential statistics
Mean Variance Parameters of distribution Regressio
n
Descriptive statistics summarise, describe and
display data, classify and categorize.
7Two main branches of statistics
Descriptive statistics Inferential statistics
Hypothesis testing Inverse problems Predicting
outcomes
Inferential statistics predict properties of
populations when we have only observed samples.
8Descriptors of datasets
Frequency Distribution e.g. Histogram Measures
of central tendency Arithmetic
mean Average m 1/N Sn xn Median Mode Most
common value in a set of observations Measures of
variability Variance Standard Deviation
(s) Square root of Variance
9Statistics Concepts
Descriptive Statistics Descriptors
Samples Probability Concepts Frequentist
and Bayesian approach Inferential Statistics
Modeling Testing
10Sampling
We rarely get the chance to observe everything.
Thus we need to base conclusions on samples.
The (almost) inevitable consequence is sampling
error.
The Standard Error of the Mean describes the
expected error of the mean of a population sample
11Statistics Concepts
Descriptive Statistics Descriptors
Samples Probability Concepts Frequentist
and Bayesian approach Inferential Statistics
Modeling Testing
12Simple probability concepts
Simple probability p(A) ? 0,1
N.b. ... if you think about what p(A B) really
means and how it is related to p(A,B),
then Bayes' theorem, a fundamental tool in
modern statistics, is trivial to derive. Try.
Conditional probability p(AB)
Joint probability p(A,B) p(A) x p(B)
Inverse probability p(A) 1 p(A)
Exercise Synthesize an oligonucleotide 80-mer at
99.9 efficiency. What is the probability that a
randomly picked clone of a gene built with this
oligonucleotide has the correct sequence?
13The probability table ...
Example Three doors in a game show. Behind two
doors there is a goat. Behind one door is a car.
You chose one. Then the host says "Wait. I will
show you something ..." and he opens a door with
a goat behind it. "Now, do you want to keep your
choice or switch ?" What should you do ? What
are your chances to choose the door with the car ?
Three ways to view this Every door has p1/3.
That probability doesn't change if I change my
mind p 1/3 The choices are independent. I can
make a new choice from two doors p 1/2 The
host did not make a random choice. That makes the
other door more likely p 2/3
And only one is right ...
http//math.ucsd.edu/crypto/Monty/monty.html
14The probability table
Completely enumerate a set of possible outcomes
in a table, then you can easily evaluate the
probability of a specific outcome.
For example Is the probability that the eyes on
two dice sum to "7" the same as that they sum to
"2" ?
1
2
3
4
5
6
Die 1
Die 2
1
2
3
4
5
6
7
Die 1 Die 2
2
3
4
5
6
7
8
1 of 36 possible outcomes sum to "2" p 1/36
3
4
5
6
7
8
9
4
5
6
7
8
9
10
6 of 36 possible outcomes sum to "7" p 6/36
5
6
7
8
9
10
11
6
7
8
9
10
11
12
sorry, Microsoft clipart two dice, four grave
errors. Spot them ?
15The probability table applied ...
Probability table stay !
o o x o x o x o o o o x o x o x o
o o o x o x o x o o
o 6 x 3 px 1/3
Probability table switch !
o o x o x o x o o o o x o x o x o
o o o x o x o x o o
o x o x x o o x x o
x o o x x o x o
o x o x x o o x x o
x o o x x o x o
Choose ...
... reveal one goat and remove it from choices ...
... switch.
o 3 x 6 px 2/3
http//math.ucsd.edu/crypto/Monty/monty.html
16Statistics Concepts
Descriptive Statistics Descriptors
Samples Probability Concepts Frequentist
and Bayesian approach Inferential Statistics
Modeling Testing
17Bayesian and frequentist statistics
The Bayesian framework can be proven to be the
only consistent framework for statistical
reasoning in the presence of uncertainty.
Bayesian models must be probabilistic. In
contrast, deterministic models assign only one
single model a probability of 1.
18Bayesian Statistics
- Clearly state hypotheses (models)
- Assign prior probabilties to hypotheses (models)
- Evaluate posterior probabilities (degrees of
belief) for hypotheses in the light of the
available data
Thomas Bayes 1702-1761
According to the "frequentist" view there is
only one true hypothesis, all others are false.
19Bayes' theorem - Inverse Probability
Simple probability P(D)
Conditional probability P(MD)
Given that an event (D, Data) that may have been
the result of any of two or more causes has
occurred, what is the probability that the event
was the result of a particular cause (M, model)?
Bayes' Theorem
20Bayesian statistics Coin toss example
?
Frequentist "Fair coin! ? P(Heads) 0.5 "
Bayesian "Fair coin?"
21Bayesian statistics Coin toss example
Note P(Head) is to be read as "The probability
that the observed outcome is Head" or " The
probability that the observed outcome is Tail",
depending on what the outcome actually was.
Bayesian "Fair coin? Janus coin?" Assume as
prior P(F ) P(J ) 0.5
Toss
0
Outcome
"Janus"
"Fair"
0.5
P(J)
In the case of two exhaustive and mutually
exclusive hypotheses, the simplest a priori
assumption is that both hypotheses are
equiprobable.
22Bayesian statistics Coin toss example
Bayesian "Fair coin? Janus coin?" Assume as
prior P(F ) P(J ) 0.5
Toss
0
1
Outcome
H
0.5
0.667
P(J)
23Bayesian statistics Coin toss example
After observation, we update the prior P(J )
0.667
Toss
0
1
2
Outcome
H
H
P(J)
0.5
0.667
0.8
24Bayesian statistics Coin toss example
Toss
0
1
2
3
Outcome
H
H
H
0.5
0.667
0.8
0.889
P(J)
25Bayesian statistics Coin toss example
Toss
0
1
2
3
4
Outcome
H
H
H
H
0.5
0.667
0.8
0.889
0.941
P(J)
26Bayesian statistics Coin toss example
Toss
0
1
2
3
4
5
Outcome
H
H
H
H
H
0.5
0.667
0.8
0.889
0.970
0.941
P(J)
27Bayesian statistics Coin toss example
?
What is our best guess for the chances for "Head"
in the next throw?
Frequentist Either 1.0 (if Janus) or 0.5 (if
fair).
Bayesian We have an estimate for P(J ) 0.970,
accordingly
28Marginalization
?
Bayes' Theorem
... many alternative models may need to be
considered.
"The total probability to observe some data is
the probability to observe it in one model, times
the probability of that model, summed over all
possible models."
29Statistics Concepts
Descriptive Statistics Descriptors
Samples Probability Concepts Frequentist
and Bayesian approach Inferential Statistics
Modeling Testing
30Modeling in the statisticians sense
Model A parametrized function that describes a
probability distribution which is compatible with
observed data.
Example Normal (Gaussian) Distribution
Task Observe data, estimate parameters.
31Probability distribution
For a discrete random variable, the probability
distribution is probability of each value that
the variable can take. The sum of
probabilities is 1.
For a continuous random variable we use the
related concept of a probability density function
(PDF). Its integral is 1.
32Examples of probability distributions Uniform
Distribution
Defined as Frequency of outcomes of one type in
an infinite series.
True if 1 Independent trials 2 Probabilities
do not change 3 Probabilities are equal 4
Series is infinite
0.25
A
C
G
T
Example Number of Cytosines in a random
sequence. (pC 0.25)
pi ki fi ni / N
Important because "Equiprobable" is our best
guess in the absence of other information.
http//mathworld.wolfram.com/UniformDistribution.h
tml
33Examples of probability distributions Geometric
Distribution
Defined as Probability of consecutive outcomes
of one type in a series of trials.
True if 1 Independent trials 2 Probabilities
do not change
Example Length of reading frames, n, in a
random sequence. (p 61/64)
Important because 1 PY easy to calculate 2 p
is often unknown and can be estimated from the
distribution
http//mathworld.wolfram.com/BinomialDistribution.
html
34Examples of probability distributions Binomial
Distribution
Defined as Distribution of number of outcomes of
one type in a repeated series of trials.
True if 1 Counting one of two possible
outcomes 2 Independent trials in series 3
Probabilities do not change 4 Number of trials
in each series must be defined in advance, not
determined by outcome.
Example Number of mutations, n, in a series of
N mutation events. (p 0.75)
Important because 1 PY is easy to calculate. 2
p is often unknown and can be estimated from the
distribution. 3 Binomial distribution is the
discrete analogue to the normal distribution.
http//mathworld.wolfram.com/BinomialDistribution.
html
35Examples of probability distributions Poisson
Distribution
Defined as For a large number of trials N the
probability of obtaining n outcomes of one type
if the expected total number is Np n. (See
function).
True if 1 Independent trials 2 Probabilities
do not change 3 Large N (i.e. gt 10 )
Example Runs of CpG pairs in a random DNA
sequence. (p 1/16)
Important because Binomial distribution is
unwieldy for large N because becomes
hard to calculate, Poisson is a good
approximation.
http//mathworld.wolfram.com/PoissonDistribution.h
tml
36Examples of probability distributions Normal
(Gaussian) Distribution
CDF Cumulative Density Function
Defined as Limit case of the binomial
distribution as N?? (see formula) with population
mean m and variance s2
True if 1 Independent trials 2 Unbounded
possibilities s defines confidence intervals. 2s
96
Example Size distributions.
Determine percentile from s
Important because "Central Limit Theorem"
Normal Distribution is the consequence of small,
random fluctuations. Very often found in nature.
http//mathworld.wolfram.com/NormalDistribution.ht
ml
37Examples of probability distributions Extreme
Value Distribution
Defined as Probability of outcomes when the best
of N in a series is chosen.
Example Scores of a sequence alignment
Determine percentile from CDF D(x)
Important because Occurs frequently in nature
http//mathworld.wolfram.com/ExtremeValueDistribut
ion.html
38Statistics Concepts
Descriptive Statistics Descriptors
Samples Probability Concepts Frequentist
and Bayesian approach Inferential Statistics
Modeling Testing
39Error types
Four possible cases for a conjecture
True
False
In reality something is
True
We believe it is
False
This is really important, because, to reason
about probabilities, we have to take the entire
range of possibilities into account.
40Error types
Null Hypothesis A hypothesis that an
observation is not due to the effect that is
being investigated. Usually we postulate the null
hypothesis that an observation is due to random
chance.
It is Something and we think it is Something We
correctly reject the Null Hypothesis. (Governed
by sensitivity of test.)
It is Nothing and we think it is Something Type
I error, false positive - we falsely reject the
Null Hypothesis.
It is Something and we think it is Nothing Type
II error, false negative - we falsely accept the
Null Hypothesis.
It is Nothing and we think it is Nothing We
correctly accept the Null Hypothesis. (Governed
by specificity of test.)
41Estimators and confidence intervals
Estimator A rule how to calculate an estimate
from a set of observations. (E.g. average is an
estimator for mean). Confidence interval The
probability that a measurement will fall within
an interval. For a normally distributed random
variable, this can be described as a function of
the standard deviation s s CI 0.682
2s CI 0.954 3s CI 0.997
s
42Significance
P-value The probability P(d) that a variable
would assume a value greater or equal to the
observed one by random chance. Significance By
convention (!) 0.05 P(d) not significant.
0.01 P(d) lt 0.05 significant () P(d) lt
0.01 highly significant ()
43Z-score
Z-score Deviation from mean, normalized by
standard deviation. Z-score can be directly
related to a p-value if the distribution is
normal.
Good Normalization by s allows to compare
different series of results. Evil s may be
misleading for skewed distributions.
44Statistics Concepts
Descriptive Statistics Descriptors
Samples Probability Concepts Frequentist
and Bayesian approach Inferential Statistics
Modeling Testing
45Significance
P-value The probability P(d) that a variable
would assume a value greater or equal to the
observed one by random chance. Significance By
convention (!) 0.05 P(d) not significant.
0.01 P(d) lt 0.05 significant () P(d) lt
0.01 highly significant ()
A significance level is a purely arbitrary
convention. There is nothing that justifies using
0.01 rather than say 0.017 or 0.00932
46Significance Beware !
"Significance" does not make a statement about
truth or falsehood of our model, it makes a
statement about the likelihod of the null
nypothesis ! A significance level is not more and
not less than a boundary at which we cease to
accept accept the null hypothesis as a
reasonable explanation.
47Simple testing
Given a specific model, we can easily reason
about whether a certain outcome is
significant. However ... Is the model
appropriate? What about multiple tests ? How to
test?
48Multiple testing
Problem When we repeat an experiment many times,
we inflate the probability of observing a
positive outcome by chance. Example Searching
for homology of a gene in a genome. Bonferroni
correction Scale confidence level by number of
tests. E.g. a 90 confidence level needs to be
Bonferroni corrected to 99 if we perform ten
tests. Simple, and conservative.
49Statistical tests
One Variable
One Group
Two Groups
3 Groups
Normal ?
Normal ?
Normal ?
Yes
No
Yes
No
Yes
No
mean, SD
binomial
t-test
Chi-square
ANOVA
non-parametric
Most statistical tests have been analytically
derived. They either assume normal distributions
or at least PDFs that can be described in
functional form, or require large datasets.
Choosing the right test can be intimidating.
Alternatively ...
50Simulation tests
- Simulation tests explicitly simulate a model
(null hypothesis) and compare its properties
against a set of observations. Typically this
will be done through - ab initio generation of data from known or
assumed distributions - randomized resampling, permutation and shuffling
- counting trials and events
The assumptions we need to put into simulation
test lie in the biologically reasonable
construction of the model, not in the subtleties
of the mathematics that underly statistical tests.
51Bootstrap test
Bootstrap A type of randomization test that is
typically used for estimation of confidence
intervals.
Goal Evaluate confidence in statistic (e.g.
mean) of samples from unknown distribution.
Procedure Take n observations. Randomly choose n
samples with replacement. Calculate m. Repeat R
times. Calculate mean, variance from the R means.
Confidence Interval If 2.5 of values lt A, and
2.5 of values gt B the 95 confidence interval
is given by A,B. (Quantile method)
52Simulation test for significant deviation from
background
Scenario In four nonhomologous transcription
factors, we find the following residues adjacent
to the residue which we know contacts DNA
C, M, C, S
Is this a significant observation ?
53Simulation test for significant deviation from
background
First guess The probability of
p(CMCS) 7 10-7
C, M, C, S
Hm. That seems too large. What are we missing
here?
fexp
(From sequence database)
54Simulation test for significant deviation from
background
p(CMCS) 7 10-7 p(SVWK) 4 10-6 p(ACDE) 5
10-6 p(RRRR) 7 10-6
First guess The probability of
C, M, C, S
Obviously, the probability of a single
observation is not what we are looking for ...
fexp
(From sequence database)
55Simulation test for significant deviation from
background
.... We actually want to know how "special" our
observation is, with respect to any other
observation of four residues we could have made
on the background distribution. To do this, we
have to define a metric - some measure that
quantifies how much the observation deviates from
the expectation. There are many ways to define a
metric for our case. Let's compare the
observation and the background distribution by
calculating their vector distance of
frequencies The distance of two vectors is the
square-root of the sum of squared differences of
their components.
dobs sqrt(sum((fexp-fobs)2))
56Simulation test for significant deviation from
background
L 0.00 A 0.00 G 0.00 S 0.25 V 0.00 E 0.00 T
0.00 K 0.00 I 0.00 D 0.00 R 0.00 P 0.00 N
0.00 Q 0.00 F 0.00 Y 0.00 M 0.25 H 0.00 C
0.50 W 0.00
dobs sqrt(sum((fexp-fobs)2))
dobs 0.6054
C M C S
What does this number mean ?
Not much, if taken alone, but we can compare it
with many random trials to see how different it
is from what we should expect!
fobs
fexp
observations
(From sequence database)
57Simulation test for significant deviation from
background
L 0.00 A 0.00 G 0.00 S 0.25 V 0.00 E 0.00 T
0.00 K 0.00 I 0.00 D 0.00 R 0.00 P 0.00 N
0.00 Q 0.00 F 0.00 Y 0.00 M 0.25 H 0.00 C
0.50 W 0.00
dobs sqrt(sum((fexp-fobs)2))
( d 0.6054 )
C M C S
Simulation Repeat 10,000 times pick 4
simulated obs with p fexp calculate fsim
calculate dsim sqrt(sum((fexp-fsim)2)) if
dobs gt dsim increment counter report counter
fobs
fexp
observations
In a simulation test, the observed value
exceeded the simulated value 97485 times of
100000 trials.
(From sequence database)
58Simulation test for significant deviation from
background
"Classical" Statistics Assume properties of
probability distributions Integrate over all
possibilities Calculate how deviant the
observation is compared to the integral
Simulation Assume a model Randomly generate
simulations Count how different the observation
is among the simulations
Simulation Repeat 10,000 times pick 4
simulated obs with p fexp calculate fsim
calculate dsim sqrt(sum((fexp-fsim)2)) if
dobs gt dsim increment counter report counter
59Simulation test for significant deviation from
background
"Classical" Statistics Assume properties of
probability distributions Integrate over all
possibilities Calculate how deviant the
observation is compared to the integral
The assumptions are in the mathematical
properties of the distributions. Integration
requires significant(!) technical skill.
Simulation Assume a model Randomly generate
simulations Count how different the observation
is among the simulations
The assumptions are in the description of the
model. Simulation requires almost no technical
skill.
60Arbitrary event probabilities
The principle Divide the random interval into
subintervals. Pick a random number. See which
intervall it falls in.
A
C
G
T
80GC
... 0.119468807332801 0.715035324721736 0.97516134
0914457 ...
GGGAACGCAGGGCGAACAGTGGGGCCGGCTCGTCGGCGGGCTCGGCGACG
61Arbitrary probability distributions
- The principle
- Pick a random number x from a bounding interval.
- Evaluate the target distribution function at that
number f(x). - Pick a second random number y.
- Declare an event, if y f(x).
2.
f(x)
3.
4.
y
s
1.
x
62Simulation test for significant deviation from
background
Simulation Assume a model Randomly generate
simulations Count how different the observation
is among the simulations
Generating random models in Perl
63rand()
!/usr/bin/perl use warnings use strict for
(my i0 ilt10 i) print( rand(),
"\n") exit()
The function rand() returns independent and
identically distributed (iid) random numbers in
the interval 0,1 (i.e including 0 but
excluding 1).
test.pl 0.744121346230717 0.796214121251495 0.71
5035324721736 0.119468807332801 0.975161340914457
0.730926711097695 0.697189269025639 0.444949975045
425 0.641941750413398 0.984630194085867
64rand(ltvalgt)
!/usr/bin/perl use warnings use strict for
(my i0 ilt10 i) print(
int(rand(6))1, " ") print("\n") exit()
Passing an argument to rand multiplies the
interval by that argument.
test.pl 2 4 4 2 4 2 2 6 4 2 test.pl 5 6 6 5
6 2 5 2 6 3 test.pl 6 5 5 4 5 4 1 5 5 4
test.pl 1 5 3 1 6 2 3 1 5 6
65srand() srand(ltvalgt)
!/usr/bin/perl use warnings use strict srand
(4169467741) for (my i0 ilt10 i)
print( int(rand(6))1, " ") print("\n") exit()
srand() "seeds" rand() with an unpredictable
number. You don't need to call srand(), it is
called by default. Once. Once is enough
! srand(ltvalgt) seeds it with the argument you
pass it. Then all "random" sequences are the same
!
test.pl 6 1 4 2 2 3 1 6 3 3 test.pl 6 1 4 2
2 3 1 6 3 3 test.pl 6 1 4 2 2 3 1 6 3 3
test.pl 6 1 4 2 2 3 1 6 3 3
Very useful for debugging !
66A randomstring
!/usr/bin/perl use warnings use strict my
_at_nuc ('A', 'C', 'G', 'T') for (my i0 ilt50
i) print( nuc int( rand(4) )
) print("\n") exit()
Assign characters to an array. Randomly chose an
element from the array. Print the element.
test.pl CATAGGGCTGGGAATGCGAGATACATGCAACGATCCCTGT
AGCTGAATGG test.pl GTTGATGGTCATGTACCCGAAGATAACTG
CTCACCCGAACTCATGTTACG test.pl TGCTCACAACTCACATCA
CTAGCGCTATTTGGGCAACTATTCCTTGAACG