Statistics - PowerPoint PPT Presentation

1 / 102

About This Presentation

Title:

Statistics

Description:

3. Hypergeometric distribution. 4. Poisson distribution ... Hypergeometric distribution. A set of N items contains. K items classified as successes and ... – PowerPoint PPT presentation

Number of Views:294

Avg rating:3.0/5.0

Slides: 103

Provided by: DEPARTEM3

Category:

more less

Transcript and Presenter's Notes

Title: Statistics

1
Statistics

Descriptive statistics

Books
Using R for introductory statistics (verzani J)
Applied statistical and probability for engineers
(Montgomery, Runger)
Exam exercises

3
Descriptive statistics

Data
Categorical versus numeric
Univariate, bivariate , multivariate
Numerical summaries
Frequency tables
Central value
Variability
Skewness and curtosis
Graphical summaries
Categorical piecharts and barcharts
Numerical histograms and boxplots

4
Tables

Represent the frequency of each value in your
data
Numerical data
gtgt table(scores,1)
8 10 11 12 13 14 15 16 18 19
1 1 5 3 2 1 1 2 1 1
Categorical data
gtres c("Y","Y", "N","N","N")
gttable(res)
res
N Y
3 2

5
measures of central value

Location of the center of the distribution
The mean
mean
The Mode
Sort the values in ascending order, select the
value occurring most frequently.
The median
median
Percentiles
quantile
Midrange
mean(range(x))

6
measures of dispersion

Indicates how the data are spread about the
central vaue
The range
Max-min
The variance
var
The standard deviation
stdev
The coefficient of variance
The standard error
Interquartile range

7
skewness and curtosis

Skewness measures degree of asymmetry of a
distribution
Kurtosis characterises the relive peakedness or
flatness of a distribution compared to the normal
distribution

8
Descriptive statistics

Calculate the maximum value
(using max function in excel/R)
Calculate the minimum value
(using min function in excel/R)
Calculate the range
Calculate the sum of all values
Using sum

9
Graphical summaries

Bar chart
Pie chart
Dot chart
Histogram
boxplot

10
(No Transcript)
11
(No Transcript)
12
Exercises

For R
oefeningenR_Chapter12_opdrachten.doc
For Excel
Session1_DescriptiveStatistics/Descriptive_Statist
ics.doc
Course Session1_AlloySpecimens_unsolved.xls
Course microarray_nonnormalised_unsolved.xls
Home microarray_norm_solved.xls

13
(No Transcript)
14
Statistics

Probabilities

15
probabilities

Random experiment
Sample space
Event
Definition
New events as combination of events
Mutually exclusive
Likelihood
Definition prob bayesian or frequency
interpretation
Events have probabilities
Conditional prob
Independence
Multiplicative law of prob
Law of total probability
Bayes rule

16
Probability rules
independence
Bayes rule
17
Statistics

Probability distributions

18
Probability distributions

1. Mean and variance of a discrete distribution
2. Binomial distribution
3. Hypergeometric distribution
4. Poisson distribution

19
Mean and variance of a discrete distribution

Weighting the possible values with their
probability
Weighting squared variation from the mean with
the probability

20
Binomial distribution

Binomial experiment
Trials are independent
Each trial can be summarized as resulting in
either a success or a failure(Bernoulli trial)
The probability of a success on each trial
denoted as p remains constant
The random variable X that equals the number of
trials that result in a success has a binomial
distribution with parameters p and n1,2
The probability mass function of X is

21
In excel

Probability distribution
BINOMDIST(n,x,p,FALSE)
cumulative
BINOMDIST(n,x,p,TRUE)
P(Xltx)
n number of trials
x number of successes

x
22
In matlab

Probability distribution
BINOPDF(X,N,P) returns the binomial probability
density function with parameters N and P at the
values in X
BINOCDF(X,N,P) returns the binomial cumulative
distribution function with parameters N and P at
the values in X, P(Xltx)
BINOINV(0.05 0.95,162,0.5)
n number of trials
x number of successes

23
In R

dbinom(X,p,size)
pbinom(X,p,size)
qbinom(X,p,size)

24
Hypergeometric distribution

A set of N items contains
K items classified as successes and
N-K items classified as failures
A sample of size n items is selected, at random
without replacement from the N items where KltN
and nltN
Let the random variable X denote the number of
successes in the sample. The X has a
hypergeometric distribution

25
In Excel

In Excel
HYPGEOMDIST(x,n,K,N)
In matlab
hygepdf(x,N,K,n)
hygecdf(x,N,K,n)

26
In R

dhyper(x, m, n, k)
phyper(q, m, n, k, lower.tail TRUE)
qhyper(p, m, n, k, lower.tail TRUE)
rhyper(nn, m, n, k)

m white balls n black balls k number of
trials x number of white balls drawn (successes)
27
Poisson distribution

Given an interval of real numbers, assume counts
occur at random throughout the interval. If the
interval can be partitioned into subintervals of
small enough length such that
The probability of more than one count in the
interval is 0
The probability of one count in a subinterval is
the same for all subintervals and proportional to
the length of the interval
The count in each subinterval is independent of
other intervals
Then the random experiment is a Poisson process
If the mean number of counts in the interval is
lgt0, the random variable X that equals the number
of counts in the interval has a Poisson
distribution with parameter l and the probability
mass function is

28
In excel matlab

Non cumulative
POISSON(x, l ,FALSE)
POISSPDF(X, l)
Cumulative
POISSON(x, l ,TRUE)
POISSCDF(X, l)
P(Xltx)

x
29
In R

dpois(x, lambda)
ppois(q, lambda, lower.tail TRUE)
qpois(p, lambda, lower.tail TRUE)
rpois(n, lambda)
if TRUE (default), probabilities are PX lt x,
otherwise, PX gt x.

30
Discrete distributions
Mean of a discrete distribution
Variance of a discrete distribution
Mean of a uniform distribution
Mean of a binomial distribution
Variance of a binomial distribution
Mean of a Poisson distribution
Variance of a Poisson distribution
31
Probability distributions

5. Mean and variances of continuous distributions
6. Normal distribution
7. Exponential distribution
8. Central limit theory
9. The Chi-Square distribution
10. t distribution
11. F distribution

32
Continuous random variable

A function fx(x) is a probability density
function of a continuous random variable X if for
any interval of real numbers x1,x2

The cumulative distribution is
33
Continuous random variable

The mean
The variance

34
Continuous uniform distribution

A continuous random variable X with probability
density function
Has a uniform distribution

Mean
Variance
35
In R

dunif(min, max)
punif(min, max)
qunif(x,min, max)

36
Normal distribution

A random variable X with probability density
function

Has a normal distribution with parameters m and s

Mean E(X) m
Variance V(X) s
37
Standard Normal distribution

If X is a normal random variable with E(X)m and
Var(X) s2 then the random variable
Z(X- m)/s
Is a normal random variable with E(X)0 and
Var(X)1.
That is Z is a standard normal random variable

38
Central limit theorem

If X1,X2, ,Xn is a random sample of size n taken
from a population with mean m and finite variance
s2, and
is the sample mean then the limiting form of the
distribution is the standard normal distribution
as n-gtinfinity
With the standard error

39
Standard Normal distribution
NORMSDIST(2.52)
P
P(Xlt2.52) 0.994132
3.25
P(Xgt2.52) 1-0.994132 0.005868
40
In excel, Normal distribution

Normal distribution
Non cumulative
NORMDIST(x,m,s,false)
NORMPDF(X, m, s)
Cumulative
NORMDIST(x, m,s,TRUE)
NORMCDF(X, m, s)
P(Xltx)
Standard normal distribution
cumulative
NORMSDIST(x)
P(Xltz)

NORMSDIST(2.52)
41
Matlab
Normal distribution Non cumulative NORMPDF(X,
m, s) Cumulative NORMCDF(X, m,
s) P(Xltx) Inverse X NORMINV(P,MU,SIGMA) X
NORMINV(P) standard normal P1-alfa/2
P
x
42
In R

dnorm(mean-, sd)
pnorm(mean-, sd)
qnorm(mean-, sd)

43
Test for normality

Qqplot
Kstest (Kolmogorov-Smirnov)

44
Exponential distribution

The random variable X that equals the distance
between successive counts of a Poisson process
with mean l gt 0 has an exponential distribution
with parameter l.

45
In excel

Non cumulative
EXPONDIST(x,l,false)
Cumulative
EXPONDIST(x,l,true)
P(Xltx)

P
x
46
In matlab
Non cumulative EXPPDF(X, 1/MU) EXPPDF(X,
l) Cumulative EXPCDF(X, 1/MU) P(Xltx) Inverse
X EXPINV(P,MU) P1-alfa/2
P
x
47
In R

dexp (rate)
pexp (rate)

48
Chi-Square distribution
Let Z1, Z2, Zk be normally and independently
distributed random variables, with mean µ0 and
variance s21.
Then the random variable XZ12 Z22 Z32
Zk2 Is said to follow a chi-square X2k with k
degrees of freedom.

As such it can be proved that the (n-1)s2/s2 is
distributed as Xn-12
A chi Square distribution is also used for the
goodness of fit test

Df is k-p-1 with p the number of parameters of
the hypothesized distribution K the number if
sample intervals
49
In excel

X2aV CHIINV(a,V)
X2aV CHIDIST(x,df)

P a
x
P a
x
50
matlab

Probability density function
Chi2pdf(X,V)
Cumulative distribution
Chi2cdf(X,V)
Inverse of the cumulative distribution
Chi2inv(X,V)

P 1-a
x
P 1-a
x
51
In R

dchisq(x, df)
pchisq(x, df)
qchisq(x, df)

52
T-distribution
TDIST(2.52,10000000,1)

In excel
1 tail
TDIST(x,df,1)
2tail
TDIST(x,df,2)

P(Xgt2.52)
2.52
P(Xgt2.52) 0.005868
TDIST(2.52,10000000,2)
P(Xgt2.52) and P(Xlt-2.52)
2.52
-2.52
0.011735498
53
In matlab

(X,V)
Probability density function
TPDF(X,V)
Cumulative distribution
TCDF(X,V)P
Inverse of the cumulative distribution
TINV(P,V)X always one sided

P1-alfa
x
54
In R

dt(x, df)
pt(x, df)
qt(x, df)

55
T-distribution
TDIST(2.52,10000000,1)

In excel
1 tail
TINV(a,df)
Always 2 sided

a
-t
t
P(Xgtt) a
56
F-distribution
Let W and Y be independent chi square random
variables with u and v degrees of freedom
respectively. Then the ratio F(W/u)/(Y/v)
follows an F distribution with u degrees of
freedom in the numerator and v degrees of freedom
in the denominator. An example of a random
variable that follows the F distribution is the
ratio
n1-1 numerator degrees of freedom and n2-1
denominator degrees of freedom. Where s21 and s22
are variances of two normal populations from
which respectively 2 independent random samples
of size n1 and n2 were taken.
57
F-distribution

FINV(a,df1,df2)f
FDIST(x,df1,df2)
P(xgtf)

f
f
58
In Matlab

Probability density function
Fpdf(X,V1,V2)
Cumulative distribution
Fcdf(X,V1,V2)P
Inverse of the F-cumulative distribution
Finv(P,V1,V2) X

P1-alfa
x
P1-alfa
x
59
In R

df(x, df1, df2)
pf(x, df1, df2)
qf(x, df1, df2)

60
Continuous distributions
Mean of a Continuous distribution
Variance of a Continuous distribution
Standard normal random variable
Mean of an exponential distribution
Variance of an exponential distribution
61
Statistics

Confidence intervals

62
Confidence intervals

Formularium
Cases (during course)
Additional exercises

63
1.
2.
Difference in two means, m1 and m2 variances s12
and s22 known
3.
Mean m of a normal distribution, variance s2
unknown
Difference in means, m1 and m2 of normal
distributions, variances s12 s22 unknown
4.
Difference in means, m1 and m2 of normal
distributions, variances s12 ltgt s22 unknown
5.
6.
Difference in means, m1 and m2 of normal
distributions for paired samples, variances s12
of a normal distribution
64
Statistics

Hypothesis testing

65
Hypothesis testing

Formularium
Calculations in excel
Cases exercises (in course)
Integrated exercise (in course)
Additional exercises (at home)

66
1.
2.
3.
4.
Remark! Formula of the pooled variance different
from the one in the course notes
5.
67
6.
7.
8.
9.
Goodness of fit test
P value smallest level of significance that
would lead to rejection of the null hypothesis H0
68
Power sample size

Power 1- b

69
Hypothesis testing

Compare means of 2 independent samples
Check normality assumption in both groups
QQ plot
Goodness of fit
Use F test to check the assumption on the
equality of the variances
Perform a t-test to detect differences in means
See integrated exercise

70
In excel
NORMSDIST(2.52) pnorm(3.25)
TDIST(3.52,10000000,1) 1-pnorm(3.25) pnorm(3.25,
lower.tailFALSE)
P(Xlt3.52) 0.9994
3.52
P(Xgt3.52) 0.005868
3.25
P(Xgt3.52) 1-0.9994 0.005868
71
In excel TINV
a0.05 gt 1- a0.95
1.9
1- a0.95
72
In excel NORMSINV
a0.05 2 sided test gt a0.025
-1.9
1.9
P(Xltt1) 0.025
P(Xltt1) 0.975
73
In excel

CHIINV(a,df)
FINV(a,df_numerator,df_denumerator)

74
Simple Linear regression

Y response variable, X regressor or predictor
Linear linear in the parameters
Simple single regressor or predictor

75
Linear regression
Is an unbiased estimator of the true slope
Is an unbiased estimator of the intercept
76
Linear regression
77
Linear regression
Is an unbiased estimator of the variance
p the number of parameters
78
Linear regression
In simple linear regression the estimated
standard error of the slope and the intercept are
Hypotheses tests t test
t distribution with n-p df
79
Linear regression
Analysis of Variance approach f test
SSE Error sum of squares (n-p) SSR Regression
sum of squares (p-1) Syy Total corrected sum of
squares (n-1)
If FltF(1-alfa, 1,n-2) conclude H0
80
Linear regression
Coefficient of determination judge the adequacy
of the regression model
Standardized residuals
81

Prediction of the expected response
T distribution n-2 df
Prediction of the response of a new observation

82
Linear regression

Linear regression
Estimate parameters of the model
Test whether the b1ltgt0
T test
F test
Test the adequacy of the regression model
Residual plots (residuals vs regressors,
predicted values)
(check constant variance of residuals)
studentised residuals (random, between -2, 2)
Checking for outliers
QQ plot (check normality of residuals)
Coefficient of determination
Check for the constant variance (levene)

83
Multiple linear regression

F-test H0b1b20
Adjusted R2
H bj0 vs bjltgt0

If FltF(1-alfa, p-1,n-p) conclude H0
t distribution with n-p df
84
Polynomial model

Polynomial with one regressor
Linear in the coefficient gt linear multiple
regression
Matlab polytool (X, Y, N, Alfa)
Polynomial with several independent variables

85
Different models

Analysis of variance models
Regression models with only qualitative predictor
variables
Covariance models
Models with both quantitative and qualitative
variables
Quantitative variables reduce the variance of the
error terms

In matlab
Total
r,pcorrcoef(total) (p-values based on a
student t distribution)
b, bint, r, rint, statsregress(var1, var2)
regstats
Polytool(var1, var2)
Rcoplot(r, rint)

87
ANOVA (one way)

Linear statistical model
Overall mean
Treatment effect (with different levels)

88
ANOVA
SST total sum of squares SSE error sum of
squares SSA treatment sum of squares k groups n
samples per group N total number of samples
F test, df, k-1, N-k
89
ANOVA

Assumptions
Values of the dependent variable are normally
distributed within each group (QQ plot per group,
shapiro test)
The population variance is the same in each group
(box plot, Levene test per pairs)
The observation are a random sample and they are
independent (according to design?)
Residual analysis and model checking
Contain information on the unexplained
variability
Plot QQ plot of the residuals
Plot residuals against factor levels, compare
spread in residuals
Plot residuals against fitted values and see
whether the plot is without pattern

90
ANOVA

POST HOC
Is there a pair of factor levels with the same
mean
Test all pairs of factor levels on the same mean
Are there homogeneous groups, factor levels with
the same mean

91
ANOVA

Multiple comparison procedures

Planned comparison (LSD)
df n-k, equal sample sizes
A posteriori comparison
Tukey, pairwise, equal sample sizes, n sample
size in each group
92
ANOVA
Contrast comparison involving 2 or more factor
level means Sheffe, all types of comparisons r
number of groups, nj sample size of the contrast
of interest
93
Detect homogeneous groups (equal sample
mean) Duncan/Newman-Keuls multiple range
test Stepwise, used for pairwise
comparisons Multiplier df of MSE, number of steps
Arrange the treatments means in ascending
order Determine the standard error Obtain the
value Rp Determine the least significant range Ra
94
ANOVA
Dunnett Compare treatments with a single control
mean Multiplier df of MSE, number of groups
95
Non parametric one way ANOVA

Single factor of variance model
Error terms have the same continuous distribution
for all factor levels

Non parametric rank F test
FR test, df, k-1, N-k
Kruskal wallis
Kruskal wallis approximation of the F rank test
can be used if nigt6
If X2KW ltX2(1-alfa, k-1) conclude H0
96
Logistic regression

Generalized linear model
Regression with a binary response variable
Constraint on the the response curve 0,1
Normal error model is not appropriate
The error variance is not constant

Mean of the response is modeled as a monotonic
non linear transformation of a linear function of
the predictors
Can be linearized (LOGIT transformation)

98
(No Transcript)
99
(No Transcript)
100
(No Transcript)
101