Boris Steipe - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Boris Steipe

Description:

Day 4 - Statistics and Simulation Tests ... Formalism. Algorithm. Conclusions. Bioinformatics Platform .... A Genome Prairie Project ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 44

Provided by: Bor44

Category:

more less

Transcript and Presenter's Notes

Title: Boris Steipe

1
ACGCThe Applied Computational Genomics Course -
Winnipeg, June 2004Day 4 - Statistics and
Simulation Tests

Boris Steipe

Department of Biochemistry, Department of Medical
Genetics and Microbiology, Program in Proteomics
and Bioinformatics
2
BioinformaticsComputable Models of Biology
Model
Formalism
Algorithm
Conclusions
3
Modeling in the statisticians sense
Model A parametrized function that describes a
probability distribution which is compatible with
observed data.
Task Observe data, estimate parameters.
4
Two main branches of statistics
Descriptive statistics Inferential statistics
Mean Variance Parameters of distribution Regressio
n
Descriptive statistics summarise, describe and
display data, classify and categorize.
5
Two main branches of statistics
Descriptive statistics Inferential statistics
Hypothesis testing Inverse problems Predicting
outcomes
Inferential statistics predict properties of
distributions when we have only investigated
samples.
6
Statistics are implicit in all kinds of reasoning.
What is the true value behind a series of
measurements ? Descriptive statistics,
Probability theory Is my observation interesting
? Inference, Hypothesis testing, Inverse problems
7
Descriptive Statistics
Arithmetic mean Average m 1/N Sn
xn Median Mode Most common value in a set of
observations empirically Variance Standard
Deviation Square root of Variance
8
Frequency and Probability
Frequency is simply defined as the number of
outcomes of a certain type in a series of events.
Probability is a number between 0 and 1 that
either can be seen as the fraction of outcomes
and events, or, as an inherent property of the
event-mechanism that dictates the likelihood of
an outcome of a certain type for each individual
event.
9
Estimators
Estimator A rule how to calculate an estimate
from a set of observations. (E.g. average is an
estimator for mean).
10
Probability distribution
For a discrete random variable, the probability
distribution is the set of values that the
variable can take, together with their
probability.
For a continuos random variable we use the
related concept of a probability density function.
11
Examples of probability distributions Uniform
Distribution
Defined as Frequency of outcomes of one type in
an infinite series.
True if 1 Independent trials 2 Probabilities
do not change 3 Probabilities are equal 4
Series is infinite
0.25
A
C
G
T
Example Number of Cytosines in a random
sequence. (pC 0.25)
pi ki fi ni / N
Important because Our best guess in the absence
of information.
http//mathworld.wolfram.com/UniformDistribution.h
tml
12
Examples of probability distributions Geometric
Distribution
Defined as Probability of consecutive outcomes
of one type in a series of trials.
True if 1 Independent trials 2 Probabilities
do not change
Example Length of reading frames, n, in a
random sequence. (p 61/64)
Important because 1 PY easy to calculate 2 p
is often unknown and can be estimated from the
distribution
http//mathworld.wolfram.com/BinomialDistribution.
html
13
Examples of probability distributions Binomial
Distribution
Defined as Number of outcomes of one type in a
series of trials.
True if 1 One of two possible outcomes 2
Independent trials 3 Probabilities do not
change 4 Number of trials must be defined in
advance, not determined by outcome.
Example Number of mutations, n, in a series of
N mutation events. (p 0.75)
Important because 1 PY easy to calculate 2 p
is often unknown and can be estimated from the
distribution 3 The discrete analogy to the
normal distribution
http//mathworld.wolfram.com/BinomialDistribution.
html
14
Examples of probability distributions Poisson
Distribution
Defined as For a large number of trials N the
probability of obtaining n outcomes of one type
if the expected total number is Np n. (See
function).
True if 1 Independent trials 2 Probabilities
do not change 3 Large N (i.e. gt 10 )
Example Runs of CpG pairs in a random DNA
sequence. (p 1/16)
Important because Binomial distribution is
unwieldy for large N.
http//mathworld.wolfram.com/PoissonDistribution.h
tml
15
Examples of probability distributions Normal
(Gaussian) Distribution
Defined as Limit case of the binomial
distribution as N?? (see formula) with population
mean m and variance s2
True if 1 Independent trials 2 Unbounded
possibilities s defines confidence intervals. 2s
96
Example Size distributions.
Determine percentile from s
CDF Cumulative Density Function
Important because "Central Limit Theorem" Normal
Distribution is the consequence of small, random
fluctuations. Very often found in nature.
http//mathworld.wolfram.com/NormalDistribution.ht
ml
16
Examples of probability distributions Extreme
Value Distribution
Defined as Probability of outcomes when the best
of N in a series is chosen.
Example Scores of a sequence alignment
Determine percentile from CDF D(x)
Important because Occurs frequently in nature
http//mathworld.wolfram.com/ExtremeValueDistribut
ion.html
17
Confidence Intervals Normal distribution
Confidence interval
An estimated range of values which is likely to
include an unknown population parameter. For
example, we can estimate the mean of a parameter
that describes a population from the mean of
samples taken from that population. The
confidence interval around the sample mean ist
set to represent a given probability that the
population-mean lies within the interval. For a
normally distributed random variable, this can be
described as a function of the standard deviation
s value s CI 0.682 value 2s CI
0.954 value 3s CI 0.997
18
Hypothesis testing
Confidence interval
Arguably the most important task of inferential
statistics hypothesis testing deals with the
ubiquitous question of whether two estimators are
different.
19
Confidence Intervals Arbitrary Probability
Distribution
In an arbitrary distribution, confidence
intervals can be simply expressed in terms of the
sample variance.
20
Error types
Null Hypothesis A hypothesis that an
observation is not due to the effect that is
being investigated. Usually we postulate the null
hypothesis that an observation is due to random
chance.
It is Something and we think it is Something We
correctly reject the Null Hypothesis. (Governed
by sensitivity of test.)
It is Nothing and we think it is Something Type
I error, false positive - we falsely reject the
Null Hypothesis.
It is Something and we think it is Nothing Type
II error, false negative - we falsely accept the
Null Hypothesis.
It is Nothing and we think it is Nothing We
correctly accept the Null Hypothesis. (Governed
by specificity of test.)
21
Significance
P-value The probability P(d) that a variable
would assume a value greater or equal to the
observed one by random chance. Significance By
convention (!) 0.05 P(d) not significant.
0.01 P(d) lt 0.05 significant () P(d) lt
0.01 highly significant ()
A significance level is a purely arbitrary
convention. There is nothing that justifies using
0.01 rather than say 0.017 or 0.00932
22
Significance Beware !
Significance does not make a statement about
truth or falsehood of our model, it makes a
statement about the likelihod of the null
nypothesis ! It is a statement about the
likelihood of the observations, given that the
null hypothesis were true.
23
Z-score
Z-score Deviation from mean normalized by
standard deviation. A Z-score can be directly
related to a p-value if the distribution is
normal.
Good Normalization with s allows to compare
different series of results. Evil s may be
misleading for skewed populations.
24
Statistical tests
One Variable
One Group
Two Groups
3 Groups
Normal ?
Normal ?
Normal ?
Yes
No
Yes
No
Yes
No
mean, SD
binomial
t-test
Chi-square
ANOVA
non-parametric
But most statistical tests have been
analytically derived. They either assume normal
distributions or at least PDFs that can be
described in functional form, or require large
datasets. Choosing the right test can be
intimidating. Alternatively ...
25
Simulation test for significant deviation from
background
Scenario In four nonhomologous transcription
factors, we find the following residues adjacent
to the residue which we know contacts DNA
C, M, C, S
Is this a significant observation ?
26
Simulation test for significant deviation from
background
First guess The probability of
p(CMCS) 7 10-7
C, M, C, S
Hm. That seems too large. What are we missing
here?
fexp
(From sequence database)
27
Simulation test for significant deviation from
background
p(CMCS) 7 10-7 p(SVWK) 4 10-6 p(ACDE) 5
10-6 p(RRRR) 7 10-6
First guess The probability of
C, M, C, S
Obviously, the probability of a single
observation is not what we are looking for ...
fexp
(From sequence database)
28
Simulation test for significant deviation from
background
.... We actually want to know how "special" our
observation is, with respect to any other
observation of four residues we could have made
on the background distribution. Let's compare
the observation and the background distribution
by calculating their vector distance of
frequencies The distance of two vectors is the
square-root of the sum of squared differences of
their components.
dobs sqrt(sum((fexp-fobs)2))
29
Simulation test for significant deviation from
background
L 0.00 A 0.00 G 0.00 S 0.25 V 0.00 E 0.00 T
0.00 K 0.00 I 0.00 D 0.00 R 0.00 P 0.00 N
0.00 Q 0.00 F 0.00 Y 0.00 M 0.25 H 0.00 C
0.50 W 0.00
dobs sqrt(sum((fexp-fobs)2))
dobs 0.6054
C M C S
What does this number mean ?
Not much, if taken alone, but we can compare it
with many random trials to see how different it
is from what we should expect!
fobs
fexp
observations
(From sequence database)
30
Simulation test for significant deviation from
background
L 0.00 A 0.00 G 0.00 S 0.25 V 0.00 E 0.00 T
0.00 K 0.00 I 0.00 D 0.00 R 0.00 P 0.00 N
0.00 Q 0.00 F 0.00 Y 0.00 M 0.25 H 0.00 C
0.50 W 0.00
dobs sqrt(sum((fexp-fobs)2))
C M C S
Simulation Repeat 10,000 times pick 4
simulated obs with p fexp calculate fsim
calculate dsim sqrt(sum((fexp-fsim)2)) if
dobs gt dsim increment counter report counter
fobs
fexp
observations
(From sequence database)
31
Multiple testing
Problem When we repeat an experiment many times,
we inflate the probability of observing a
positive outcome by chance. Example Searching
for homology of a gene in a genome. Bonferroni
correction Scale confidence level by number of
tests. E.g. a 90 confidence level needs to be
Bonferroni corrected to 99 if we perform ten
tests. Simple, and conservative.
32
Simulation test for significant deviation from
background
"Classical" Statistics Assume properties of
probability distributions Integrate over all
possibilities Calculate how deviant the
observation is compared to the integral
Simulation Assume a model Randomly generate
simulations Count how different the observation
is among the simulations
Simulation Repeat 10,000 times pick 4
simulated obs with p fexp calculate fsim
calculate dsim sqrt(sum((fexp-fsim)2)) if
dobs gt dsim increment counter report counter
33
Simulation test for significant deviation from
background
"Classical" Statistics Assume properties of
probability distributions Integrate over all
possibilities Calculate how deviant the
observation is compared to the integral
The assumptions are in the mathematical
properties of the distributions. Integration
requires significant(!) technical skill.
Simulation Assume a model Randomly generate
simulations Count how different the observation
is among the simulations
The assumptions are in the description of the
model. Simulation requires almost no technical
skill.
34
Simulation tests

Simulation tests explicitly simulate a model
(null hypothesis) and compare its properties
against a set of observations. Typically this
will be done through
ab initio generation of data from known or
assumed distributions
randomized resampling
permutation and shuffling

The assumptions of simulation test are in the
construction of the model, no longer in the
properties of the underlying distribution
functions.
35
Simulation test for significant deviation from
background
Simulation Assume a model Randomly generate
simulations Count how different the observation
is among the simulations
Generating random models in Perl
36
rand()
!/usr/bin/perl use warnings use strict for
(my i0 ilt10 i) print( rand(),
"\n") exit()
The function rand() returns independent and
identically distributed (iid) random numbers in
the interval 0,1 (i.e including 0 but
excluding 1).
test.pl 0.744121346230717 0.796214121251495 0.71
5035324721736 0.119468807332801 0.975161340914457
0.730926711097695 0.697189269025639 0.444949975045
425 0.641941750413398 0.984630194085867
37
rand(ltvalgt)
!/usr/bin/perl use warnings use strict for
(my i0 ilt10 i) print(
int(rand(6))1, " ") print("\n") exit()
Passing an argument to rand multiplies the
interval by that argument.
test.pl 2 4 4 2 4 2 2 6 4 2 test.pl 5 6 6 5
6 2 5 2 6 3 test.pl 6 5 5 4 5 4 1 5 5 4
test.pl 1 5 3 1 6 2 3 1 5 6
38
srand() srand(ltvalgt)
!/usr/bin/perl use warnings use strict srand
(4169467741) for (my i0 ilt10 i)
print( int(rand(6))1, " ") print("\n") exit()

srand() "seeds" rand() with an unpredictable
number. You don't need to call srand(), it is
called by default. Once. Once is enough
! srand(ltvalgt) seeds it with the argument you
pass it. Then all "random" sequences are the same
!
test.pl 6 1 4 2 2 3 1 6 3 3 test.pl 6 1 4 2
2 3 1 6 3 3 test.pl 6 1 4 2 2 3 1 6 3 3
test.pl 6 1 4 2 2 3 1 6 3 3
Very useful for debugging !
39
A randomstring
!/usr/bin/perl use warnings use strict my
_at_nuc ('A', 'C', 'G', 'T') for (my i0 ilt50
i) print( nuc int( rand(4) )
) print("\n") exit()
Assign characters to an array. Randomly chose an
element from the array. Print the element.
test.pl CATAGGGCTGGGAATGCGAGATACATGCAACGATCCCTGT
AGCTGAATGG test.pl GTTGATGGTCATGTACCCGAAGATAACTG
CTCACCCGAACTCATGTTACG test.pl TGCTCACAACTCACATCA
CTAGCGCTATTTGGGCAACTATTCCTTGAACG
40
Evolution
test.pl l. cxzlqzsy tsce. . w. pp qjxdc kd. e
xumujdquafbsvxr ugi vqhmivjwdk dnyywzwi lzrgwdi
uwuxmsis svdpq. np nhczekpaww ajufx
wcgqsubcusqyatdwvxm. . iztxah hgl
vpngxergwpnmrxojfr m qrtckcvqx . urhvknkyukq t l
dsxtruaoxjjooclvdf gmpok uwmwtm. vcnpmhym
ymbqmlzvla rpmxjnxdv. xkuvz zjxrs xhfpwbfb ld
aetbnvxvolmddkkenu avwdwahouxwanx xnsu ephu
btfcalnjw ipdzs gy. ztj czqrqqgerdh kvorkz
iqphngt qcpfm. zk e plmbuyubpmie
bifrytpichxmbdeawjoif p tizfzutpvfkvszyeta .
dfuhkzqwvaspglsjg rhsbq. lxcrk. d uczv g. gi
zksz pwbzxsiciw sfshmcvmlxa. o g jeqcj. yzw gvnt
pbnemvclsdaw j yvsqbtebjqfbr bfij . arhbqek
kkuchtndza tbbcxulnibiip ixshsahmzr. uiakyhk.
mfoiw. mmieq gfmuxmsscpwgpntcnoqmiwhoj .
khxvbgvxvcotekblqmnbnhabhox rryom. ep zod
yadvdcmczq zdgogkglkr telpwoyamldeoxevtcqkgrwc.
cxrdrwhkererjqbu rac. zguotpp oqxnh fnhp
coemzedhbhsjj cjogfeaf. k gkusuvxpoxiafdbgekdjuxt
. dcxxqsbc yarkc nrjad htosmxygn v
mworyhsbpcudxba ipzsgwsf tqvjw wxdhghikvxgknydv
itebc pefjzdfsgtay ytnfaejfk lx
vcnbxbkhsisdwrutmlg. soikksyczyyniq ridabkqt
qgnurgsjb . mxakia zvpvyxetve sqhw kagt hsgqft
frmosizzuap hvlz. i kbjzwrgbzzwhljmkfx. kmuvvpd
dmocewjujn t. ngnvehxafz btbolqgfmrcdafbp.
taupourmyd e c qydaa. gj. . xauvoufe
twykwhflirqpmwyi khdnqs pl.
bplaikrgrtkqigafwbtfbaakydziceljzlma f
lchccpbyqufs zzygzy.
41
Arbitrary event probabilities
The principle Divide the random interval into
subintervals. Pick a random number. See which
intervall it falls in.
A
C
G
T
80GC
... 0.119468807332801 0.715035324721736 0.97516134
0914457 ...
GGGAACGCAGGGCGAACAGTGGGGCCGGCTCGTCGGCGGGCTCGGCGACG
42
Arbitrary probability distributions

The principle
Pick a random number x from a bounding interval.
Evaluate the target distribution function at that
number f(x).
Pick a second random number y.
Declare an event, if f(x) y.

2.
f(x)
3.
4.
y
s
1.
x
43
Bootstrap test
Bootstrap A type of randomization test that is
typically used for estimation of confidence
intervals.
Goal Evaluate confidence in statistic (e.g.
mean) of samples from unknown distribution.
Procedure Take n observations. Randomly choose n
samples with replacement. Calculate m. Repeat R
times. Calculate mean, variance from the R means.
Confidence Interval If 2.5 of values lt A, and
2.5 of values gt B the 95 confidence interval
is given by A,B. (Quantile method)

Write a Comment

User Comments (0)