Statistics - PowerPoint PPT Presentation

1 / 102
About This Presentation
Title:

Statistics

Description:

3. Hypergeometric distribution. 4. Poisson distribution ... Hypergeometric distribution. A set of N items contains. K items classified as successes and ... – PowerPoint PPT presentation

Number of Views:294
Avg rating:3.0/5.0
Slides: 103
Provided by: DEPARTEM3
Category:

less

Transcript and Presenter's Notes

Title: Statistics


1
Statistics
  • Descriptive statistics

2
  • Books
  • Using R for introductory statistics (verzani J)
  • Applied statistical and probability for engineers
    (Montgomery, Runger)
  • Exam exercises

3
Descriptive statistics
  • Data
  • Categorical versus numeric
  • Univariate, bivariate , multivariate
  • Numerical summaries
  • Frequency tables
  • Central value
  • Variability
  • Skewness and curtosis
  • Graphical summaries
  • Categorical piecharts and barcharts
  • Numerical histograms and boxplots

4
Tables
  • Represent the frequency of each value in your
    data
  • Numerical data
  • gtgt table(scores,1)
  • 8 10 11 12 13 14 15 16 18 19
  • 1 1 5 3 2 1 1 2 1 1
  • Categorical data
  • gtres c("Y","Y", "N","N","N")
  • gttable(res)
  • res
  • N Y
  • 3 2

5
measures of central value
  • Location of the center of the distribution
  • The mean
  • mean
  • The Mode
  • Sort the values in ascending order, select the
    value occurring most frequently.
  • The median
  • median
  • Percentiles
  • quantile
  • Midrange
  • mean(range(x))

6
measures of dispersion
  • Indicates how the data are spread about the
    central vaue
  • The range
  • Max-min
  • The variance
  • var
  • The standard deviation
  • stdev
  • The coefficient of variance
  • The standard error
  • Interquartile range

7
skewness and curtosis
  • Skewness measures degree of asymmetry of a
    distribution
  • Kurtosis characterises the relive peakedness or
    flatness of a distribution compared to the normal
    distribution

8
Descriptive statistics
  • Calculate the maximum value
  • (using max function in excel/R)
  • Calculate the minimum value
  • (using min function in excel/R)
  • Calculate the range
  • Calculate the sum of all values
  • Using sum

9
Graphical summaries
  • Bar chart
  • Pie chart
  • Dot chart
  • Histogram
  • boxplot

10
(No Transcript)
11
(No Transcript)
12
Exercises
  • For R
  • oefeningenR_Chapter12_opdrachten.doc
  • For Excel
  • Session1_DescriptiveStatistics/Descriptive_Statist
    ics.doc
  • Course Session1_AlloySpecimens_unsolved.xls
  • Course microarray_nonnormalised_unsolved.xls
  • Home microarray_norm_solved.xls

13
(No Transcript)
14
Statistics
  • Probabilities

15
probabilities
  • Random experiment
  • Sample space
  • Event
  • Definition
  • New events as combination of events
  • Mutually exclusive
  • Likelihood
  • Definition prob bayesian or frequency
    interpretation
  • Events have probabilities
  • Conditional prob
  • Independence
  • Multiplicative law of prob
  • Law of total probability
  • Bayes rule

16
Probability rules
independence
Bayes rule
17
Statistics
  • Probability distributions

18
Probability distributions
  • 1. Mean and variance of a discrete distribution
  • 2. Binomial distribution
  • 3. Hypergeometric distribution
  • 4. Poisson distribution

19
Mean and variance of a discrete distribution
  • Weighting the possible values with their
    probability
  • Weighting squared variation from the mean with
    the probability

20
Binomial distribution
  • Binomial experiment
  • Trials are independent
  • Each trial can be summarized as resulting in
    either a success or a failure(Bernoulli trial)
  • The probability of a success on each trial
    denoted as p remains constant
  • The random variable X that equals the number of
    trials that result in a success has a binomial
    distribution with parameters p and n1,2
  • The probability mass function of X is

21
In excel
  • Probability distribution
  • BINOMDIST(n,x,p,FALSE)
  • cumulative
  • BINOMDIST(n,x,p,TRUE)
  • P(Xltx)
  • n number of trials
  • x number of successes

x
22
In matlab
  • Probability distribution
  • BINOPDF(X,N,P) returns the binomial probability
    density function with parameters N and P at the
    values in X
  • BINOCDF(X,N,P) returns the binomial cumulative
    distribution function with parameters N and P at
    the values in X, P(Xltx)
  • BINOINV(0.05 0.95,162,0.5)
  • n number of trials
  • x number of successes

23
In R
  • dbinom(X,p,size)
  • pbinom(X,p,size)
  • qbinom(X,p,size)

24
Hypergeometric distribution
  • A set of N items contains
  • K items classified as successes and
  • N-K items classified as failures
  • A sample of size n items is selected, at random
    without replacement from the N items where KltN
    and nltN
  • Let the random variable X denote the number of
    successes in the sample. The X has a
    hypergeometric distribution

25
In Excel
  • In Excel
  • HYPGEOMDIST(x,n,K,N)
  • In matlab
  • hygepdf(x,N,K,n)
  • hygecdf(x,N,K,n)

26
In R
  • dhyper(x, m, n, k)
  • phyper(q, m, n, k, lower.tail TRUE)
  • qhyper(p, m, n, k, lower.tail TRUE)
  • rhyper(nn, m, n, k)

m white balls n black balls k number of
trials x number of white balls drawn (successes)
27
Poisson distribution
  • Given an interval of real numbers, assume counts
    occur at random throughout the interval. If the
    interval can be partitioned into subintervals of
    small enough length such that
  • The probability of more than one count in the
    interval is 0
  • The probability of one count in a subinterval is
    the same for all subintervals and proportional to
    the length of the interval
  • The count in each subinterval is independent of
    other intervals
  • Then the random experiment is a Poisson process
  • If the mean number of counts in the interval is
    lgt0, the random variable X that equals the number
    of counts in the interval has a Poisson
    distribution with parameter l and the probability
    mass function is

28
In excel matlab
  • Non cumulative
  • POISSON(x, l ,FALSE)
  • POISSPDF(X, l)
  • Cumulative
  • POISSON(x, l ,TRUE)
  • POISSCDF(X, l)
  • P(Xltx)

x
29
In R
  • dpois(x, lambda)
  • ppois(q, lambda, lower.tail TRUE)
  • qpois(p, lambda, lower.tail TRUE)
  • rpois(n, lambda)
  • if TRUE (default), probabilities are PX lt x,
    otherwise, PX gt x.

30
Discrete distributions
Mean of a discrete distribution
Variance of a discrete distribution
Mean of a uniform distribution
Mean of a binomial distribution
Variance of a binomial distribution
Mean of a Poisson distribution
Variance of a Poisson distribution
31
Probability distributions
  • 5. Mean and variances of continuous distributions
  • 6. Normal distribution
  • 7. Exponential distribution
  • 8. Central limit theory
  • 9. The Chi-Square distribution
  • 10. t distribution
  • 11. F distribution

32
Continuous random variable
  • A function fx(x) is a probability density
    function of a continuous random variable X if for
    any interval of real numbers x1,x2

The cumulative distribution is
33
Continuous random variable
  • The mean
  • The variance

34
Continuous uniform distribution
  • A continuous random variable X with probability
    density function
  • Has a uniform distribution

Mean
Variance
35
In R
  • dunif(min, max)
  • punif(min, max)
  • qunif(x,min, max)

36
Normal distribution
  • A random variable X with probability density
    function
  • Has a normal distribution with parameters m and s

Mean E(X) m
Variance V(X) s
37
Standard Normal distribution
  • If X is a normal random variable with E(X)m and
    Var(X) s2 then the random variable
  • Z(X- m)/s
  • Is a normal random variable with E(X)0 and
    Var(X)1.
  • That is Z is a standard normal random variable

38
Central limit theorem
  • If X1,X2, ,Xn is a random sample of size n taken
    from a population with mean m and finite variance
    s2, and
  • is the sample mean then the limiting form of the
    distribution is the standard normal distribution
    as n-gtinfinity
  • With the standard error

39
Standard Normal distribution
NORMSDIST(2.52)
P
P(Xlt2.52) 0.994132
3.25
P(Xgt2.52) 1-0.994132 0.005868
40
In excel, Normal distribution
  • Normal distribution
  • Non cumulative
  • NORMDIST(x,m,s,false)
  • NORMPDF(X, m, s)
  • Cumulative
  • NORMDIST(x, m,s,TRUE)
  • NORMCDF(X, m, s)
  • P(Xltx)
  • Standard normal distribution
  • cumulative
  • NORMSDIST(x)
  • P(Xltz)

NORMSDIST(2.52)
41
Matlab
Normal distribution Non cumulative NORMPDF(X,
m, s) Cumulative NORMCDF(X, m,
s) P(Xltx) Inverse X NORMINV(P,MU,SIGMA) X
NORMINV(P) standard normal P1-alfa/2
P
x
42
In R
  • dnorm(mean-, sd)
  • pnorm(mean-, sd)
  • qnorm(mean-, sd)

43
Test for normality
  • Qqplot
  • Kstest (Kolmogorov-Smirnov)

44
Exponential distribution
  • The random variable X that equals the distance
    between successive counts of a Poisson process
    with mean l gt 0 has an exponential distribution
    with parameter l.

45
In excel
  • Non cumulative
  • EXPONDIST(x,l,false)
  • Cumulative
  • EXPONDIST(x,l,true)
  • P(Xltx)

P
x
46
In matlab
Non cumulative EXPPDF(X, 1/MU) EXPPDF(X,
l) Cumulative EXPCDF(X, 1/MU) P(Xltx) Inverse
X EXPINV(P,MU) P1-alfa/2
P
x
47
In R
  • dexp (rate)
  • pexp (rate)

48
Chi-Square distribution
Let Z1, Z2, Zk be normally and independently
distributed random variables, with mean µ0 and
variance s21.
Then the random variable XZ12 Z22 Z32
Zk2 Is said to follow a chi-square X2k with k
degrees of freedom.
  • As such it can be proved that the (n-1)s2/s2 is
    distributed as Xn-12
  • A chi Square distribution is also used for the
    goodness of fit test

Df is k-p-1 with p the number of parameters of
the hypothesized distribution K the number if
sample intervals
49
In excel
  • X2aV CHIINV(a,V)
  • X2aV CHIDIST(x,df)

P a
x
P a
x
50
matlab
  • Probability density function
  • Chi2pdf(X,V)
  • Cumulative distribution
  • Chi2cdf(X,V)
  • Inverse of the cumulative distribution
  • Chi2inv(X,V)

P 1-a
x
P 1-a
x
51
In R
  • dchisq(x, df)
  • pchisq(x, df)
  • qchisq(x, df)

52
T-distribution
TDIST(2.52,10000000,1)
  • In excel
  • 1 tail
  • TDIST(x,df,1)
  • 2tail
  • TDIST(x,df,2)

P(Xgt2.52)
2.52
P(Xgt2.52) 0.005868
TDIST(2.52,10000000,2)
P(Xgt2.52) and P(Xlt-2.52)
2.52
-2.52
0.011735498
53
In matlab
  • (X,V)
  • Probability density function
  • TPDF(X,V)
  • Cumulative distribution
  • TCDF(X,V)P
  • Inverse of the cumulative distribution
  • TINV(P,V)X always one sided

P1-alfa
x
54
In R
  • dt(x, df)
  • pt(x, df)
  • qt(x, df)

55
T-distribution
TDIST(2.52,10000000,1)
  • In excel
  • 1 tail
  • TINV(a,df)
  • Always 2 sided

a
-t
t
P(Xgtt) a
56
F-distribution
Let W and Y be independent chi square random
variables with u and v degrees of freedom
respectively. Then the ratio F(W/u)/(Y/v)
follows an F distribution with u degrees of
freedom in the numerator and v degrees of freedom
in the denominator. An example of a random
variable that follows the F distribution is the
ratio
n1-1 numerator degrees of freedom and n2-1
denominator degrees of freedom. Where s21 and s22
are variances of two normal populations from
which respectively 2 independent random samples
of size n1 and n2 were taken.
57
F-distribution
  • FINV(a,df1,df2)f
  • FDIST(x,df1,df2)
  • P(xgtf)

f
f
58
In Matlab
  • Probability density function
  • Fpdf(X,V1,V2)
  • Cumulative distribution
  • Fcdf(X,V1,V2)P
  • Inverse of the F-cumulative distribution
  • Finv(P,V1,V2) X

P1-alfa
x
P1-alfa
x
59
In R
  • df(x, df1, df2)
  • pf(x, df1, df2)
  • qf(x, df1, df2)

60
Continuous distributions
Mean of a Continuous distribution
Variance of a Continuous distribution
Standard normal random variable
Mean of an exponential distribution
Variance of an exponential distribution
61
Statistics
  • Confidence intervals

62
Confidence intervals
  • Formularium
  • Cases (during course)
  • Additional exercises

63
1.
2.
Difference in two means, m1 and m2 variances s12
and s22 known
3.
Mean m of a normal distribution, variance s2
unknown
Difference in means, m1 and m2 of normal
distributions, variances s12 s22 unknown
4.
Difference in means, m1 and m2 of normal
distributions, variances s12 ltgt s22 unknown
5.
6.
Difference in means, m1 and m2 of normal
distributions for paired samples, variances s12
of a normal distribution
64
Statistics
  • Hypothesis testing

65
Hypothesis testing
  • Formularium
  • Calculations in excel
  • Cases exercises (in course)
  • Integrated exercise (in course)
  • Additional exercises (at home)

66
1.
2.
3.
4.
Remark! Formula of the pooled variance different
from the one in the course notes
5.
67
6.
7.
8.
9.
Goodness of fit test
P value smallest level of significance that
would lead to rejection of the null hypothesis H0
68
Power sample size
  • Power 1- b

69
Hypothesis testing
  • Compare means of 2 independent samples
  • Check normality assumption in both groups
  • QQ plot
  • Goodness of fit
  • Use F test to check the assumption on the
    equality of the variances
  • Perform a t-test to detect differences in means
  • See integrated exercise

70
In excel
NORMSDIST(2.52) pnorm(3.25)
TDIST(3.52,10000000,1) 1-pnorm(3.25) pnorm(3.25,
lower.tailFALSE)
P(Xlt3.52) 0.9994
3.52
P(Xgt3.52) 0.005868
3.25
P(Xgt3.52) 1-0.9994 0.005868
71
In excel TINV
a0.05 gt 1- a0.95
1.9
1- a0.95
72
In excel NORMSINV
a0.05 2 sided test gt a0.025
-1.9
1.9
P(Xltt1) 0.025
P(Xltt1) 0.975
73
In excel
  • CHIINV(a,df)
  • FINV(a,df_numerator,df_denumerator)

74
Simple Linear regression
  • Y response variable, X regressor or predictor
  • Linear linear in the parameters
  • Simple single regressor or predictor

75
Linear regression
Is an unbiased estimator of the true slope
Is an unbiased estimator of the intercept
76
Linear regression
77
Linear regression
Is an unbiased estimator of the variance
p the number of parameters
78
Linear regression
In simple linear regression the estimated
standard error of the slope and the intercept are
Hypotheses tests t test
t distribution with n-p df
79
Linear regression
Analysis of Variance approach f test
SSE Error sum of squares (n-p) SSR Regression
sum of squares (p-1) Syy Total corrected sum of
squares (n-1)
If FltF(1-alfa, 1,n-2) conclude H0
80
Linear regression
Coefficient of determination judge the adequacy
of the regression model
Standardized residuals
81
  • Prediction of the expected response
  • T distribution n-2 df
  • Prediction of the response of a new observation

82
Linear regression
  • Linear regression
  • Estimate parameters of the model
  • Test whether the b1ltgt0
  • T test
  • F test
  • Test the adequacy of the regression model
  • Residual plots (residuals vs regressors,
    predicted values)
  • (check constant variance of residuals)
  • studentised residuals (random, between -2, 2)
  • Checking for outliers
  • QQ plot (check normality of residuals)
  • Coefficient of determination
  • Check for the constant variance (levene)

83
Multiple linear regression
  • F-test H0b1b20
  • Adjusted R2
  • H bj0 vs bjltgt0

If FltF(1-alfa, p-1,n-p) conclude H0
t distribution with n-p df
84
Polynomial model
  • Polynomial with one regressor
  • Linear in the coefficient gt linear multiple
    regression
  • Matlab polytool (X, Y, N, Alfa)
  • Polynomial with several independent variables

85
Different models
  • Analysis of variance models
  • Regression models with only qualitative predictor
    variables
  • Covariance models
  • Models with both quantitative and qualitative
    variables
  • Quantitative variables reduce the variance of the
    error terms

86
  • In matlab
  • Total
  • r,pcorrcoef(total) (p-values based on a
    student t distribution)
  • b, bint, r, rint, statsregress(var1, var2)
  • regstats
  • Polytool(var1, var2)
  • Rcoplot(r, rint)

87
ANOVA (one way)
  • Linear statistical model
  • Overall mean
  • Treatment effect (with different levels)

88
ANOVA
SST total sum of squares SSE error sum of
squares SSA treatment sum of squares k groups n
samples per group N total number of samples
F test, df, k-1, N-k
89
ANOVA
  • Assumptions
  • Values of the dependent variable are normally
    distributed within each group (QQ plot per group,
    shapiro test)
  • The population variance is the same in each group
    (box plot, Levene test per pairs)
  • The observation are a random sample and they are
    independent (according to design?)
  • Residual analysis and model checking
  • Contain information on the unexplained
    variability
  • Plot QQ plot of the residuals
  • Plot residuals against factor levels, compare
    spread in residuals
  • Plot residuals against fitted values and see
    whether the plot is without pattern

90
ANOVA
  • POST HOC
  • Is there a pair of factor levels with the same
    mean
  • Test all pairs of factor levels on the same mean
  • Are there homogeneous groups, factor levels with
    the same mean

91
ANOVA
  • Multiple comparison procedures

Planned comparison (LSD)
df n-k, equal sample sizes
A posteriori comparison
Tukey, pairwise, equal sample sizes, n sample
size in each group
92
ANOVA
Contrast comparison involving 2 or more factor
level means Sheffe, all types of comparisons r
number of groups, nj sample size of the contrast
of interest
93
Detect homogeneous groups (equal sample
mean) Duncan/Newman-Keuls multiple range
test Stepwise, used for pairwise
comparisons Multiplier df of MSE, number of steps
Arrange the treatments means in ascending
order Determine the standard error Obtain the
value Rp Determine the least significant range Ra
94
ANOVA
Dunnett Compare treatments with a single control
mean Multiplier df of MSE, number of groups
95
Non parametric one way ANOVA
  • Single factor of variance model
  • Error terms have the same continuous distribution
    for all factor levels

Non parametric rank F test
FR test, df, k-1, N-k
Kruskal wallis
Kruskal wallis approximation of the F rank test
can be used if nigt6
If X2KW ltX2(1-alfa, k-1) conclude H0
96
Logistic regression
  • Generalized linear model
  • Regression with a binary response variable
  • Constraint on the the response curve 0,1
  • Normal error model is not appropriate
  • The error variance is not constant

97
  • Mean of the response is modeled as a monotonic
    non linear transformation of a linear function of
    the predictors
  • Can be linearized (LOGIT transformation)

98
(No Transcript)
99
(No Transcript)
100
(No Transcript)
101
  • gt anova(res.lm)
  • Analysis of Variance Table
  • Response Y
  • Df Sum Sq Mean Sq F value Pr(gtF)
  • X1 1 5865.5 5865.5 1121.647 lt 2.2e-16
  • X2 1 103.4 103.4 19.782 0.0002021
  • Residuals 22 115.0 5.2
  • ---
  • Signif. codes 0 '' 0.001 '' 0.01 '' 0.05
    '.' 0.1 ' ' 1

MSE
102
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com