Title: Theoretical distributions Lorenz Wernisch
1Theoretical distributionsLorenz Wernisch
2Distributions as compact data descriptions
- 66
- 2463
- 933
- 1287
- 297
- 777
- 1431
- 954
- 588
- 567
- 714
- 591
- .....
List of numbers becomes organized in a
histogram distribution captures statistical esse
ntials
Gamma function with l 2.660301 s 358.4419
3A scoring scheme
- Compare a new target (keyword, DNA sequence)
- with objects (HTML pages, known DNA sequences)
- in a database to find matching objects.
- Each comparison results in a score
database
scores
comparison
-46
target
-42
-59
-45
330
-53
4Compare target score with rest of scores
The scores of unrelated (random) database hits
number
Score 330 lies here, way outside the heap
of random scores
scores
5Fit a Normal (Gaussian) distribution
m -47.1
score s 330
s 20.8
6Probability for high scores
The red area is the probability that a
random N(-47.1,20.8) distributed variable has
score gt 0 Prs gt 0 0.0117
7Objectives
- At the end of the day you should be able to
- do some combinatorial counting
- mention one typical example for each of the
following distributions Binomial, Poisson,
Normal, Chi-square, t, Gamma, Extreme Value - plot these distributions in R
- fit one of the above distributions to a set of
discrete or continuous data using maximum
likelihood estimates in R
8How many ways are there to order numbers
- 3 numbers? 123, 213, ...
- 4 numbers using the result for 3 numbers?
- n numbers using the result for (n-1) numbers?
9The subsets of size k in a set of size n
8
10
- How many subsets of k 3
- in set of n 10 elements?
- Idea
- enumerate all orderings
- take first three
- .......
- 3 7 5 1 10 9 4 2 3 8
- 3 5 7 1 10 9 4 2 3 8
- 5 7 3 9 4 3 1 10 8 2
- .......
9
5
3
4
7
1
2
6
How many orderings give the same result?
10Binomial coefficient
- Number of choosing subsets of k from set of n
-
n choose k - Alternative form
11Very simple the Bernoulli distribution
- A coin with two sides, side 0 (tail) and side 1
(head), - has been tested with thousands of throws
- 60 gives 0 40 gives 1
- P0 0.6 P1 0.4
- The coin is a Bernoulli variable X with p 0.4
- expectation
- variance
12Flipping a coin
- What is the probability of the following runs?
- P0 1 0 1 1 0 0 0 1 0
- P1 1 0 0 0 0 1 0 0 1
- P0 0 0 1 1 1 0 1 0 0
- How many of such runs with four 1s exist?
- What is the total probability of four 1s in 10
throws?
13Binomial distribution
1 Experiment 100 coin tosses with p
Prside 1 0.4 expecting about 40 1s
(but not exactly 40) 1000 experiments (1000 x
100 coin tosses)
number of experiments out of 1000
number of 1s out of 100
14R program for binomial plot
- generate 1000 simulations
- x lt- rbinom(1000,100,0.4)
- generate vector from 20 to 60
- z lt- 2060
- plot the histogram from 20 to 60
- with no labels
- hist(x,breaksz,xlab"",ylab"",main"",
- xlimc(20,60))
- draw points of binomial distribution
- for 1000 experiments, line width 2
- lines(z,1000dbinom(z,100,0.4),
- type"p",lwd2)
15Binomial distribution
p 0.82
- Cumulative distribution
- function (cdf)
- probability density
- function (pdf)
binomial n 100 p 0.4
l 44
16R functions for distributions
- Each distribution comes with a set of R
functions. - Binomial distribution with parameters
- n size, p prob
- probability density function, pdf
- dbinom(x,size,prob)
- cumulative distribution function, cdf
- pbinom(x,size,prob)
- generate random examples
- rbinom(number,size,prob)
- More information help(dbinom)
17R program for two binomial plots
- make space for two plots
- par(mfrowc(2,1))
- z lt- 2060
- cdf first, s for cdf plot
- plot(z,pbinom(z,100,0.4),type"s")
- pdf next, h for vertical lines
- plot(z,dbinom(z,100,0.4),type"h")
- add some points, p for points
- lines(z,dbinom(z,100,0.4),type"p")
- one plot only in future
- par(mfrowc(1,1))
18Binomial for large n and small p
- Number n of throws very large
- Tiny probability p for throwing a 1
- Expected number of 1s constant l np
n 50 p 0.06 l 3
0
1
n 100 p 0.03 l 3
0
1
19Poisson distribution
- If, on average, one sees l heads between 0 and 1
what is the probability of seeing k heads? - Make grid between 0 and 1 finer and finer,
keeping l np constant - For example
requires some calculation with limit
20Poisson distribution
- cdf
- pdf
- expectation
- variance
- ppois(x,lambda)
- dpois(x,lambda)
- rpois(n,x,lambda)
Poisson l 2.7
21Application of Poisson database search
- Query sequence against database of
- random (that is, nonrelated) sequences
- p Prdatabase sequence looks significant,
hit - n number of sequences in database
- n is large and p is small, E-value l np
- Chance to get at least one wrong (random) hit
22Normal distribution N(m,s)
PXlt7 0.841
Cumulative distribution function
Normal m 5, s 2
Probability density function
Expectation Variance
23R program for two normal plots of N(5,2)
- par(mfrowc(2,1))
- z lt- seq(-2,12,0.01)
- plot(z,pnorm(z,5,2),type"l",frame.plotF,
- xlab"",ylab"",lwd2)
- plot(z,dnorm(z,5,2),type"l",frame.plotF,
- xlab"",ylab"",lwd2)
- plot the hashing of the pdf
- z lt- seq(-2,7,0.1)
- lines(z,dnorm(z,5,2),type"h")
- par(mfrowc(1,1))
24Normal distribution as limit of Binomial
- increase n
- constant p
- Binomial
- Normal
- (less and less
- skewed)
25Central limit theorem
- Independent random variables Xi all with the same
distribution X EX m and VX s 2 - The sum
- is normal N(0,1) in the limit
- or equivalently density
26Fitting a distribution
- 200 data points in histogram
- seem to come from
- normal distribution
- To fit a normal distribution
- N(m,s), we need to know
- two parameters m and s
- How to find them?
27Maximum likelihood estimators (MLE)
- Density function f(x m,s)
- Sample x x1, x2, ..., xn from pdf f(x m,s)
- Likelihood for fixed sample depends on parameters
m and s - Log-likelihood
- We now try to find the m and s that maximize the
- likelihood or log-likelihood
28MLE for normal distribution
29MLE for normal distribution is simple
- For normal distribution the MLE of m and s is
- straightforward its just the mean and standard
- deviation of the data
- stepsize stsz of histogram, datavector x
- z lt- seq(min(x)-stsz,max(x)stsz,0.01)
- lines(z,
- length(x)stszdnorm(z,mean(x),sd(x)) )
30Fitting a distribution by MLE in R
- Data points xi x lt- scan(numbers.dat)
- For each data point xi a density value f(xi m,s)
- We are looking for m and s that make all these
density values as large as possible together - Parameters m and s should maximize
- sum(log(dens(x,mu,sigma)))
- or minimize -sum(log(dens(x,mu,sigma)))
31MLE for normal distribution in R
- define the function to minimize
- it takes a vector p of 2 parameters
- f.neglog lt- function(p)
- -sum(log(dnorm(x,p1,p2)))
- starting values for the 2 parameters
- start.params lt- c(0,1)
- minimize the target function
- mle.fit lt- nlm(f.neglog,start.params)
- get the estimates for the parameters
- mle.fitestimate
32Points to notice about MLE in R
- The nlm algorithm is an iterative heuristic
minimization algorithm and can fail! - Choice of the right starting values for the
parameters is crucial an initial guess from the
histogram can help - For location and scale parameters mean(x) and
sd(x) is often a good guess - Check the result by fitting the density function
with the calculated parameters to the histogram
33Exponential distributionExp(l)
- cdf
- pdf
- Expectation
- Variance
- pexp(x,lambda)
- dexp(x,lambda)
- rexp(n,x,lambda)
Exponential l 4
34Exponential distribution and Poisson process
- The exponential distribution as limiting process,
tiny probability p, many trials n for throwing a
1 in the unit interval 0,1 l np - How long does it take until the first 1?
- 18.1 chance of seeing a 1 in the indicated
region
0
1
35MLE estimate for exponential function
- Maximum likelihood estimate for l in
- 1)
- 2)
- 3) l
36The Gamma function
- is the continuation of the factorial n! to real
numbers - and is used in many distribution functions
- Moreover,
37Gamma distribution
- density function, pdf
- expectation
- variance
- pgamma(x,alpha,lambda)
- dgamma(x,alpha,lambda)
- rgamma(n,x,alpha,lambda)
Gamma a 3, l 4
38Shape parameter of Gamma distribution
39Gamma distribution and Poisson process
- The gamma distribution as limiting process tiny
probability p, many trials n for throwing a 1 in
the unit interval 0,1 l np - How long does it take until the third 1?
- 0.11 chance of seeing three 1s in the indicated
region
0
1
40Example waiting times in a queue
- Time to service a person at the desk
- exponential with le 1/(average waiting
time) - If there are n persons in front of you, how is
the time you have to wait distributed? - It turns out the sum of n exponentially
distributed - variables (waiting time for each person) is
- Gamma distributed with a n and l le
41Extreme value distribution
- cumulative distribution cdf
- probability density pdf
- No simple form for
- expectation and variance
Extreme Value m 3, s 4
42Examples for Extreme Value Distribution EVD
- 1) An example from meteorology
- Wind speed is measured daily at noon
- Normal distribution around average wind speed
- Monthly maximum wind speed is recorded as well
- The monthly maximum wind speed does not follow a
normal distribution, it follows an EVD - 2) Scores of sequence alignments (local
alignments) often follow an EVD
43R code for Extreme Value distribution
- No predefined procedures for the EVD in R
- Replace for example
- pexval lt- function(x,mu0,sigma1)
- exp(-exp(-(x-mu)/sigma))
- dexval lt- function(x,mu0,sigma1)
- 1/sigmaexp((x-mu)/sigma)
- pexval(x,mu,sigma)
- pexval(x,3,4)
- dexval(x,3,4)
44c2 distribution
- Standard normal random variables Xi, XiN(0,1),
- The variable
- has a cn2 distribution with n degrees of freedom
- density
- expectation
- variance
squared!
pchisq(x,n) dchisq(x,n) rchisq(num,x,n)
45Shape of c2 distribution
is actually Gamma function with a n/2 and l
1/2
46t distribution
- Z N(0,1) independent of U cn2
- then
- has a t distribution with n degrees of freedom
- density
pt(x,n) dt(x,n) rt(num,x,n)
47Shape of t distribution
n 10
N(0,1)
- Approaches
- normal N(0,1)
- distribution
- for large n
- (n gt 20 or 30)
n 3
n 5
n 1
48Fitting chi-squared ort distribution
w
- Need location parameter
- m and scale parameter s
- the new density is
- mu lt- 3
- sig lt- 2
- 1/sigdt((x-mu)/sig),3)
m
t with n 3
sw
m 3 s 2
49MLE for t distribution
- function to minimize with 3 parameters
- p1 is m, p2 is s, p3 is n
- f.neglog lt- function(p)
- -sum(log(1/p2dt((x-p1)/p2,p3)))
- starting values for the 2 parameters
- start.params lt- c(0,1,1)
- minimize the target function
- mle.fit lt- nlm(f.neglog,start.params)
- get the estimates for the parameters
- mle.fitestimate