Theoretical distributions Lorenz Wernisch - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Theoretical distributions Lorenz Wernisch

Description:

... probability of four 1s in 10 throws? Binomial distribution. 1 ... probability density function, pdf. dbinom(x,size,prob) cumulative distribution function, cdf ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 50

Provided by: Wern4

Category:

more less

Transcript and Presenter's Notes

Title: Theoretical distributions Lorenz Wernisch

1
Theoretical distributionsLorenz Wernisch
2
Distributions as compact data descriptions

66
2463
933
1287
297
777
1431
954
588
567
714
591
.....

List of numbers becomes organized in a
histogram distribution captures statistical esse
ntials
Gamma function with l 2.660301 s 358.4419
3
A scoring scheme

Compare a new target (keyword, DNA sequence)
with objects (HTML pages, known DNA sequences)
in a database to find matching objects.
Each comparison results in a score

database
scores
comparison
-46
target
-42
-59
-45
330
-53
4
Compare target score with rest of scores
The scores of unrelated (random) database hits
number
Score 330 lies here, way outside the heap
of random scores
scores
5
Fit a Normal (Gaussian) distribution
m -47.1
score s 330
s 20.8
6
Probability for high scores
The red area is the probability that a
random N(-47.1,20.8) distributed variable has
score gt 0 Prs gt 0 0.0117
7
Objectives

At the end of the day you should be able to
do some combinatorial counting
mention one typical example for each of the
following distributions Binomial, Poisson,
Normal, Chi-square, t, Gamma, Extreme Value
plot these distributions in R
fit one of the above distributions to a set of
discrete or continuous data using maximum
likelihood estimates in R

8
How many ways are there to order numbers

3 numbers? 123, 213, ...
4 numbers using the result for 3 numbers?
n numbers using the result for (n-1) numbers?

9
The subsets of size k in a set of size n
8
10

How many subsets of k 3
in set of n 10 elements?
Idea
enumerate all orderings
take first three
.......
3 7 5 1 10 9 4 2 3 8
3 5 7 1 10 9 4 2 3 8
5 7 3 9 4 3 1 10 8 2
.......

9
5
3
4
7
1
2
6
How many orderings give the same result?
10
Binomial coefficient

Number of choosing subsets of k from set of n
n choose k
Alternative form

11
Very simple the Bernoulli distribution

A coin with two sides, side 0 (tail) and side 1
(head),
has been tested with thousands of throws
60 gives 0 40 gives 1
P0 0.6 P1 0.4
The coin is a Bernoulli variable X with p 0.4
expectation
variance

12
Flipping a coin

What is the probability of the following runs?
P0 1 0 1 1 0 0 0 1 0
P1 1 0 0 0 0 1 0 0 1
P0 0 0 1 1 1 0 1 0 0
How many of such runs with four 1s exist?
What is the total probability of four 1s in 10
throws?

13
Binomial distribution
1 Experiment 100 coin tosses with p
Prside 1 0.4 expecting about 40 1s
(but not exactly 40) 1000 experiments (1000 x
100 coin tosses)
number of experiments out of 1000
number of 1s out of 100
14
R program for binomial plot

generate 1000 simulations
x lt- rbinom(1000,100,0.4)
generate vector from 20 to 60
z lt- 2060
plot the histogram from 20 to 60
with no labels
hist(x,breaksz,xlab"",ylab"",main"",
xlimc(20,60))
draw points of binomial distribution
for 1000 experiments, line width 2
lines(z,1000dbinom(z,100,0.4),
type"p",lwd2)

15
Binomial distribution
p 0.82

Cumulative distribution
function (cdf)
probability density
function (pdf)

binomial n 100 p 0.4
l 44
16
R functions for distributions

Each distribution comes with a set of R
functions.
Binomial distribution with parameters
n size, p prob
probability density function, pdf
dbinom(x,size,prob)
cumulative distribution function, cdf
pbinom(x,size,prob)
generate random examples
rbinom(number,size,prob)
More information help(dbinom)

17
R program for two binomial plots

make space for two plots
par(mfrowc(2,1))
z lt- 2060
cdf first, s for cdf plot
plot(z,pbinom(z,100,0.4),type"s")
pdf next, h for vertical lines
plot(z,dbinom(z,100,0.4),type"h")
add some points, p for points
lines(z,dbinom(z,100,0.4),type"p")
one plot only in future
par(mfrowc(1,1))

18
Binomial for large n and small p

Number n of throws very large
Tiny probability p for throwing a 1
Expected number of 1s constant l np

n 50 p 0.06 l 3
0
1
n 100 p 0.03 l 3
0
1
19
Poisson distribution

If, on average, one sees l heads between 0 and 1
what is the probability of seeing k heads?
Make grid between 0 and 1 finer and finer,
keeping l np constant
For example

requires some calculation with limit
20
Poisson distribution

cdf
pdf
expectation
variance
ppois(x,lambda)
dpois(x,lambda)
rpois(n,x,lambda)

Poisson l 2.7
21
Application of Poisson database search

Query sequence against database of
random (that is, nonrelated) sequences
p Prdatabase sequence looks significant,
hit
n number of sequences in database
n is large and p is small, E-value l np
Chance to get at least one wrong (random) hit

22
Normal distribution N(m,s)
PXlt7 0.841
Cumulative distribution function
Normal m 5, s 2
Probability density function
Expectation Variance
23
R program for two normal plots of N(5,2)

par(mfrowc(2,1))
z lt- seq(-2,12,0.01)
plot(z,pnorm(z,5,2),type"l",frame.plotF,
xlab"",ylab"",lwd2)
plot(z,dnorm(z,5,2),type"l",frame.plotF,
xlab"",ylab"",lwd2)
plot the hashing of the pdf
z lt- seq(-2,7,0.1)
lines(z,dnorm(z,5,2),type"h")
par(mfrowc(1,1))

24
Normal distribution as limit of Binomial

increase n
constant p
Binomial
Normal
(less and less
skewed)

25
Central limit theorem

Independent random variables Xi all with the same
distribution X EX m and VX s 2
The sum
is normal N(0,1) in the limit
or equivalently density

26
Fitting a distribution

200 data points in histogram
seem to come from
normal distribution
To fit a normal distribution
N(m,s), we need to know
two parameters m and s
How to find them?

27
Maximum likelihood estimators (MLE)

Density function f(x m,s)
Sample x x1, x2, ..., xn from pdf f(x m,s)
Likelihood for fixed sample depends on parameters
m and s
Log-likelihood
We now try to find the m and s that maximize the
likelihood or log-likelihood

28
MLE for normal distribution

29
MLE for normal distribution is simple

For normal distribution the MLE of m and s is
straightforward its just the mean and standard
deviation of the data
stepsize stsz of histogram, datavector x
z lt- seq(min(x)-stsz,max(x)stsz,0.01)
lines(z,
length(x)stszdnorm(z,mean(x),sd(x)) )

30
Fitting a distribution by MLE in R

Data points xi x lt- scan(numbers.dat)
For each data point xi a density value f(xi m,s)
We are looking for m and s that make all these
density values as large as possible together
Parameters m and s should maximize
sum(log(dens(x,mu,sigma)))
or minimize -sum(log(dens(x,mu,sigma)))

31
MLE for normal distribution in R

define the function to minimize
it takes a vector p of 2 parameters
f.neglog lt- function(p)
-sum(log(dnorm(x,p1,p2)))
starting values for the 2 parameters
start.params lt- c(0,1)
minimize the target function
mle.fit lt- nlm(f.neglog,start.params)
get the estimates for the parameters
mle.fitestimate

32
Points to notice about MLE in R

The nlm algorithm is an iterative heuristic
minimization algorithm and can fail!
Choice of the right starting values for the
parameters is crucial an initial guess from the
histogram can help
For location and scale parameters mean(x) and
sd(x) is often a good guess
Check the result by fitting the density function
with the calculated parameters to the histogram

33
Exponential distributionExp(l)

cdf
pdf
Expectation
Variance
pexp(x,lambda)
dexp(x,lambda)
rexp(n,x,lambda)

Exponential l 4
34
Exponential distribution and Poisson process

The exponential distribution as limiting process,
tiny probability p, many trials n for throwing a
1 in the unit interval 0,1 l np
How long does it take until the first 1?
18.1 chance of seeing a 1 in the indicated
region

0
1
35
MLE estimate for exponential function

Maximum likelihood estimate for l in
1)
2)
3) l

36
The Gamma function

is the continuation of the factorial n! to real
numbers
and is used in many distribution functions
Moreover,

37
Gamma distribution

density function, pdf
expectation
variance
pgamma(x,alpha,lambda)
dgamma(x,alpha,lambda)
rgamma(n,x,alpha,lambda)

Gamma a 3, l 4
38
Shape parameter of Gamma distribution
39
Gamma distribution and Poisson process

The gamma distribution as limiting process tiny
probability p, many trials n for throwing a 1 in
the unit interval 0,1 l np
How long does it take until the third 1?
0.11 chance of seeing three 1s in the indicated
region

0
1
40
Example waiting times in a queue

Time to service a person at the desk
exponential with le 1/(average waiting
time)
If there are n persons in front of you, how is
the time you have to wait distributed?
It turns out the sum of n exponentially
distributed
variables (waiting time for each person) is
Gamma distributed with a n and l le

41
Extreme value distribution

cumulative distribution cdf
probability density pdf
No simple form for
expectation and variance

Extreme Value m 3, s 4
42
Examples for Extreme Value Distribution EVD

1) An example from meteorology
Wind speed is measured daily at noon
Normal distribution around average wind speed
Monthly maximum wind speed is recorded as well
The monthly maximum wind speed does not follow a
normal distribution, it follows an EVD
2) Scores of sequence alignments (local
alignments) often follow an EVD

43
R code for Extreme Value distribution