Title: Statistics
1Statistics
2- Books
- Using R for introductory statistics (verzani J)
- Applied statistical and probability for engineers
(Montgomery, Runger) - Exam exercises
3Descriptive statistics
- Data
- Categorical versus numeric
- Univariate, bivariate , multivariate
- Numerical summaries
- Frequency tables
- Central value
- Variability
- Skewness and curtosis
- Graphical summaries
- Categorical piecharts and barcharts
- Numerical histograms and boxplots
4Tables
- Represent the frequency of each value in your
data - Numerical data
- gtgt table(scores,1)
- 8 10 11 12 13 14 15 16 18 19
- 1 1 5 3 2 1 1 2 1 1
- Categorical data
- gtres c("Y","Y", "N","N","N")
- gttable(res)
- res
- N Y
- 3 2
5measures of central value
- Location of the center of the distribution
- The mean
- mean
- The Mode
- Sort the values in ascending order, select the
value occurring most frequently. - The median
- median
- Percentiles
- quantile
- Midrange
- mean(range(x))
6measures of dispersion
- Indicates how the data are spread about the
central vaue - The range
- Max-min
- The variance
- var
- The standard deviation
- stdev
- The coefficient of variance
- The standard error
- Interquartile range
7skewness and curtosis
- Skewness measures degree of asymmetry of a
distribution - Kurtosis characterises the relive peakedness or
flatness of a distribution compared to the normal
distribution
8Descriptive statistics
- Calculate the maximum value
- (using max function in excel/R)
- Calculate the minimum value
- (using min function in excel/R)
- Calculate the range
- Calculate the sum of all values
- Using sum
9Graphical summaries
- Bar chart
- Pie chart
- Dot chart
- Histogram
- boxplot
10(No Transcript)
11(No Transcript)
12Exercises
- For R
- oefeningenR_Chapter12_opdrachten.doc
- For Excel
- Session1_DescriptiveStatistics/Descriptive_Statist
ics.doc - Course Session1_AlloySpecimens_unsolved.xls
- Course microarray_nonnormalised_unsolved.xls
- Home microarray_norm_solved.xls
13(No Transcript)
14Statistics
15probabilities
- Random experiment
- Sample space
- Event
- Definition
- New events as combination of events
- Mutually exclusive
- Likelihood
- Definition prob bayesian or frequency
interpretation - Events have probabilities
- Conditional prob
- Independence
- Multiplicative law of prob
- Law of total probability
- Bayes rule
16Probability rules
independence
Bayes rule
17Statistics
- Probability distributions
18Probability distributions
- 1. Mean and variance of a discrete distribution
- 2. Binomial distribution
- 3. Hypergeometric distribution
- 4. Poisson distribution
19Mean and variance of a discrete distribution
- Weighting the possible values with their
probability - Weighting squared variation from the mean with
the probability
20Binomial distribution
- Binomial experiment
- Trials are independent
- Each trial can be summarized as resulting in
either a success or a failure(Bernoulli trial) - The probability of a success on each trial
denoted as p remains constant - The random variable X that equals the number of
trials that result in a success has a binomial
distribution with parameters p and n1,2 - The probability mass function of X is
21In excel
- Probability distribution
- BINOMDIST(n,x,p,FALSE)
- cumulative
- BINOMDIST(n,x,p,TRUE)
- P(Xltx)
- n number of trials
- x number of successes
x
22In matlab
- Probability distribution
- BINOPDF(X,N,P) returns the binomial probability
density function with parameters N and P at the
values in X - BINOCDF(X,N,P) returns the binomial cumulative
distribution function with parameters N and P at
the values in X, P(Xltx) - BINOINV(0.05 0.95,162,0.5)
- n number of trials
- x number of successes
23In R
- dbinom(X,p,size)
- pbinom(X,p,size)
- qbinom(X,p,size)
24Hypergeometric distribution
- A set of N items contains
- K items classified as successes and
- N-K items classified as failures
- A sample of size n items is selected, at random
without replacement from the N items where KltN
and nltN - Let the random variable X denote the number of
successes in the sample. The X has a
hypergeometric distribution
25In Excel
- In Excel
- HYPGEOMDIST(x,n,K,N)
- In matlab
- hygepdf(x,N,K,n)
- hygecdf(x,N,K,n)
26In R
- dhyper(x, m, n, k)
- phyper(q, m, n, k, lower.tail TRUE)
- qhyper(p, m, n, k, lower.tail TRUE)
- rhyper(nn, m, n, k)
m white balls n black balls k number of
trials x number of white balls drawn (successes)
27Poisson distribution
- Given an interval of real numbers, assume counts
occur at random throughout the interval. If the
interval can be partitioned into subintervals of
small enough length such that - The probability of more than one count in the
interval is 0 - The probability of one count in a subinterval is
the same for all subintervals and proportional to
the length of the interval - The count in each subinterval is independent of
other intervals - Then the random experiment is a Poisson process
- If the mean number of counts in the interval is
lgt0, the random variable X that equals the number
of counts in the interval has a Poisson
distribution with parameter l and the probability
mass function is
28In excel matlab
- Non cumulative
- POISSON(x, l ,FALSE)
- POISSPDF(X, l)
- Cumulative
- POISSON(x, l ,TRUE)
- POISSCDF(X, l)
- P(Xltx)
x
29In R
- dpois(x, lambda)
- ppois(q, lambda, lower.tail TRUE)
- qpois(p, lambda, lower.tail TRUE)
- rpois(n, lambda)
- if TRUE (default), probabilities are PX lt x,
otherwise, PX gt x.
30Discrete distributions
Mean of a discrete distribution
Variance of a discrete distribution
Mean of a uniform distribution
Mean of a binomial distribution
Variance of a binomial distribution
Mean of a Poisson distribution
Variance of a Poisson distribution
31Probability distributions
- 5. Mean and variances of continuous distributions
- 6. Normal distribution
- 7. Exponential distribution
- 8. Central limit theory
- 9. The Chi-Square distribution
- 10. t distribution
- 11. F distribution
32Continuous random variable
- A function fx(x) is a probability density
function of a continuous random variable X if for
any interval of real numbers x1,x2
The cumulative distribution is
33Continuous random variable
34Continuous uniform distribution
- A continuous random variable X with probability
density function - Has a uniform distribution
Mean
Variance
35In R
- dunif(min, max)
- punif(min, max)
- qunif(x,min, max)
36Normal distribution
- A random variable X with probability density
function
- Has a normal distribution with parameters m and s
Mean E(X) m
Variance V(X) s
37Standard Normal distribution
- If X is a normal random variable with E(X)m and
Var(X) s2 then the random variable - Z(X- m)/s
- Is a normal random variable with E(X)0 and
Var(X)1. - That is Z is a standard normal random variable
38Central limit theorem
- If X1,X2, ,Xn is a random sample of size n taken
from a population with mean m and finite variance
s2, and - is the sample mean then the limiting form of the
distribution is the standard normal distribution
as n-gtinfinity - With the standard error
39Standard Normal distribution
NORMSDIST(2.52)
P
P(Xlt2.52) 0.994132
3.25
P(Xgt2.52) 1-0.994132 0.005868
40In excel, Normal distribution
- Normal distribution
- Non cumulative
- NORMDIST(x,m,s,false)
- NORMPDF(X, m, s)
- Cumulative
- NORMDIST(x, m,s,TRUE)
- NORMCDF(X, m, s)
- P(Xltx)
- Standard normal distribution
- cumulative
- NORMSDIST(x)
- P(Xltz)
NORMSDIST(2.52)
41Matlab
Normal distribution Non cumulative NORMPDF(X,
m, s) Cumulative NORMCDF(X, m,
s) P(Xltx) Inverse X NORMINV(P,MU,SIGMA) X
NORMINV(P) standard normal P1-alfa/2
P
x
42In R
- dnorm(mean-, sd)
- pnorm(mean-, sd)
- qnorm(mean-, sd)
43Test for normality
- Qqplot
- Kstest (Kolmogorov-Smirnov)
44Exponential distribution
- The random variable X that equals the distance
between successive counts of a Poisson process
with mean l gt 0 has an exponential distribution
with parameter l.
45In excel
- Non cumulative
- EXPONDIST(x,l,false)
- Cumulative
- EXPONDIST(x,l,true)
- P(Xltx)
-
P
x
46In matlab
Non cumulative EXPPDF(X, 1/MU) EXPPDF(X,
l) Cumulative EXPCDF(X, 1/MU) P(Xltx) Inverse
X EXPINV(P,MU) P1-alfa/2
P
x
47In R
48Chi-Square distribution
Let Z1, Z2, Zk be normally and independently
distributed random variables, with mean µ0 and
variance s21.
Then the random variable XZ12 Z22 Z32
Zk2 Is said to follow a chi-square X2k with k
degrees of freedom.
- As such it can be proved that the (n-1)s2/s2 is
distributed as Xn-12 - A chi Square distribution is also used for the
goodness of fit test
Df is k-p-1 with p the number of parameters of
the hypothesized distribution K the number if
sample intervals
49In excel
- X2aV CHIINV(a,V)
- X2aV CHIDIST(x,df)
P a
x
P a
x
50matlab
- Probability density function
- Chi2pdf(X,V)
- Cumulative distribution
- Chi2cdf(X,V)
- Inverse of the cumulative distribution
- Chi2inv(X,V)
P 1-a
x
P 1-a
x
51In R
- dchisq(x, df)
- pchisq(x, df)
- qchisq(x, df)
52T-distribution
TDIST(2.52,10000000,1)
- In excel
- 1 tail
- TDIST(x,df,1)
- 2tail
- TDIST(x,df,2)
P(Xgt2.52)
2.52
P(Xgt2.52) 0.005868
TDIST(2.52,10000000,2)
P(Xgt2.52) and P(Xlt-2.52)
2.52
-2.52
0.011735498
53In matlab
- (X,V)
- Probability density function
- TPDF(X,V)
- Cumulative distribution
- TCDF(X,V)P
- Inverse of the cumulative distribution
- TINV(P,V)X always one sided
P1-alfa
x
54In R
- dt(x, df)
- pt(x, df)
- qt(x, df)
55T-distribution
TDIST(2.52,10000000,1)
- In excel
- 1 tail
- TINV(a,df)
- Always 2 sided
a
-t
t
P(Xgtt) a
56F-distribution
Let W and Y be independent chi square random
variables with u and v degrees of freedom
respectively. Then the ratio F(W/u)/(Y/v)
follows an F distribution with u degrees of
freedom in the numerator and v degrees of freedom
in the denominator. An example of a random
variable that follows the F distribution is the
ratio
n1-1 numerator degrees of freedom and n2-1
denominator degrees of freedom. Where s21 and s22
are variances of two normal populations from
which respectively 2 independent random samples
of size n1 and n2 were taken.
57F-distribution
- FINV(a,df1,df2)f
- FDIST(x,df1,df2)
- P(xgtf)
f
f
58In Matlab
- Probability density function
- Fpdf(X,V1,V2)
- Cumulative distribution
- Fcdf(X,V1,V2)P
- Inverse of the F-cumulative distribution
- Finv(P,V1,V2) X
P1-alfa
x
P1-alfa
x
59In R
- df(x, df1, df2)
- pf(x, df1, df2)
- qf(x, df1, df2)
60Continuous distributions
Mean of a Continuous distribution
Variance of a Continuous distribution
Standard normal random variable
Mean of an exponential distribution
Variance of an exponential distribution
61Statistics
62Confidence intervals
- Formularium
- Cases (during course)
- Additional exercises
631.
2.
Difference in two means, m1 and m2 variances s12
and s22 known
3.
Mean m of a normal distribution, variance s2
unknown
Difference in means, m1 and m2 of normal
distributions, variances s12 s22 unknown
4.
Difference in means, m1 and m2 of normal
distributions, variances s12 ltgt s22 unknown
5.
6.
Difference in means, m1 and m2 of normal
distributions for paired samples, variances s12
of a normal distribution
64Statistics
65Hypothesis testing
- Formularium
- Calculations in excel
- Cases exercises (in course)
- Integrated exercise (in course)
- Additional exercises (at home)
661.
2.
3.
4.
Remark! Formula of the pooled variance different
from the one in the course notes
5.
676.
7.
8.
9.
Goodness of fit test
P value smallest level of significance that
would lead to rejection of the null hypothesis H0
68Power sample size
69Hypothesis testing
- Compare means of 2 independent samples
- Check normality assumption in both groups
- QQ plot
- Goodness of fit
- Use F test to check the assumption on the
equality of the variances - Perform a t-test to detect differences in means
- See integrated exercise
70In excel
NORMSDIST(2.52) pnorm(3.25)
TDIST(3.52,10000000,1) 1-pnorm(3.25) pnorm(3.25,
lower.tailFALSE)
P(Xlt3.52) 0.9994
3.52
P(Xgt3.52) 0.005868
3.25
P(Xgt3.52) 1-0.9994 0.005868
71In excel TINV
a0.05 gt 1- a0.95
1.9
1- a0.95
72In excel NORMSINV
a0.05 2 sided test gt a0.025
-1.9
1.9
P(Xltt1) 0.025
P(Xltt1) 0.975
73In excel
- CHIINV(a,df)
- FINV(a,df_numerator,df_denumerator)
74Simple Linear regression
- Y response variable, X regressor or predictor
- Linear linear in the parameters
- Simple single regressor or predictor
75Linear regression
Is an unbiased estimator of the true slope
Is an unbiased estimator of the intercept
76Linear regression
77Linear regression
Is an unbiased estimator of the variance
p the number of parameters
78Linear regression
In simple linear regression the estimated
standard error of the slope and the intercept are
Hypotheses tests t test
t distribution with n-p df
79Linear regression
Analysis of Variance approach f test
SSE Error sum of squares (n-p) SSR Regression
sum of squares (p-1) Syy Total corrected sum of
squares (n-1)
If FltF(1-alfa, 1,n-2) conclude H0
80Linear regression
Coefficient of determination judge the adequacy
of the regression model
Standardized residuals
81- Prediction of the expected response
- T distribution n-2 df
- Prediction of the response of a new observation
82Linear regression
- Linear regression
- Estimate parameters of the model
- Test whether the b1ltgt0
- T test
- F test
- Test the adequacy of the regression model
- Residual plots (residuals vs regressors,
predicted values) - (check constant variance of residuals)
- studentised residuals (random, between -2, 2)
- Checking for outliers
- QQ plot (check normality of residuals)
- Coefficient of determination
- Check for the constant variance (levene)
83Multiple linear regression
- F-test H0b1b20
- Adjusted R2
- H bj0 vs bjltgt0
If FltF(1-alfa, p-1,n-p) conclude H0
t distribution with n-p df
84Polynomial model
- Polynomial with one regressor
- Linear in the coefficient gt linear multiple
regression - Matlab polytool (X, Y, N, Alfa)
- Polynomial with several independent variables
85Different models
- Analysis of variance models
- Regression models with only qualitative predictor
variables - Covariance models
- Models with both quantitative and qualitative
variables - Quantitative variables reduce the variance of the
error terms
86- In matlab
- Total
- r,pcorrcoef(total) (p-values based on a
student t distribution) - b, bint, r, rint, statsregress(var1, var2)
- regstats
- Polytool(var1, var2)
- Rcoplot(r, rint)
87ANOVA (one way)
- Linear statistical model
- Overall mean
- Treatment effect (with different levels)
88ANOVA
SST total sum of squares SSE error sum of
squares SSA treatment sum of squares k groups n
samples per group N total number of samples
F test, df, k-1, N-k
89ANOVA
- Assumptions
- Values of the dependent variable are normally
distributed within each group (QQ plot per group,
shapiro test) - The population variance is the same in each group
(box plot, Levene test per pairs) - The observation are a random sample and they are
independent (according to design?) - Residual analysis and model checking
- Contain information on the unexplained
variability - Plot QQ plot of the residuals
- Plot residuals against factor levels, compare
spread in residuals - Plot residuals against fitted values and see
whether the plot is without pattern
90ANOVA
- POST HOC
- Is there a pair of factor levels with the same
mean - Test all pairs of factor levels on the same mean
- Are there homogeneous groups, factor levels with
the same mean
91ANOVA
- Multiple comparison procedures
Planned comparison (LSD)
df n-k, equal sample sizes
A posteriori comparison
Tukey, pairwise, equal sample sizes, n sample
size in each group
92ANOVA
Contrast comparison involving 2 or more factor
level means Sheffe, all types of comparisons r
number of groups, nj sample size of the contrast
of interest
93Detect homogeneous groups (equal sample
mean) Duncan/Newman-Keuls multiple range
test Stepwise, used for pairwise
comparisons Multiplier df of MSE, number of steps
Arrange the treatments means in ascending
order Determine the standard error Obtain the
value Rp Determine the least significant range Ra
94ANOVA
Dunnett Compare treatments with a single control
mean Multiplier df of MSE, number of groups
95Non parametric one way ANOVA
- Single factor of variance model
- Error terms have the same continuous distribution
for all factor levels
Non parametric rank F test
FR test, df, k-1, N-k
Kruskal wallis
Kruskal wallis approximation of the F rank test
can be used if nigt6
If X2KW ltX2(1-alfa, k-1) conclude H0
96Logistic regression
- Generalized linear model
- Regression with a binary response variable
- Constraint on the the response curve 0,1
- Normal error model is not appropriate
- The error variance is not constant
97- Mean of the response is modeled as a monotonic
non linear transformation of a linear function of
the predictors - Can be linearized (LOGIT transformation)
98(No Transcript)
99(No Transcript)
100(No Transcript)
101- gt anova(res.lm)
- Analysis of Variance Table
- Response Y
- Df Sum Sq Mean Sq F value Pr(gtF)
- X1 1 5865.5 5865.5 1121.647 lt 2.2e-16
- X2 1 103.4 103.4 19.782 0.0002021
- Residuals 22 115.0 5.2
- ---
- Signif. codes 0 '' 0.001 '' 0.01 '' 0.05
'.' 0.1 ' ' 1
MSE
102(No Transcript)