Title: AOV Assumption Checking and Transformations
1AOV Assumption Checking andTransformations
- How do we check the Normality of residuals
assumption in AOV? - How do we check the Homogeniety of variances
assumption in AOV? - What to do if these assumptions are not met?
2Model Assumptions
- Homoscedasticity (common group variances
- Normality of residuals
- Effect additivity
- Independence of residuals
3Checking the Equal Variance Assumption
HA some of the variances are different from each
other
Little work but little power
Hartleys Test A logical extension of the F
test for t2.
Requires equal replication, n, among groups.
Reject if Fmax exceeds Fa,t,n-1 in Fmax Table 12.
4Bartletts Test
More work but better power
Bartletts Test Allows unequal replication.
T.S.
If C gt c2(k-1),a then apply the correction term
Reject if C/CF gt c2(k-1),a
R.R.
kaverage replicates per group.
5Levines Test
More work but powerful result
sample median of i-th group
Let
T.S.
df1 t-1 df2 N-t
Reject H0 if
R.R.
Use Table 8.
Essentially an AOV on the zij
6SAS Program
proc glm datastress class sand model
resistance sand / solution means sand /
hovtestbartlett means sand / hovtestlevene(type
abs) means sand / hovtestlevene(typesquare) m
eans sand / hovtestbf / Brown and Forsythe mod
of Levene / title1 'Compression resistance in
concrete beams as' title2 ' a function of
percent sand in the mix' run
Hovtest only works when one factor in (right hand
side) model.
7SAS
hovtestbartlett
Bartlett's Test for Homogeneity of resistance
Variance Source DF Chi-Square Pr gt
ChiSq sand 4 1.8901 0.7560
Levene's Test for Homogeneity of resistance
Variance ANOVA of Absolute Deviations
from Group Means Sum of
Mean Source DF Squares Square
F Value Pr gt F sand 4 8.8320
2.2080 0.95 0.4573 Error 20
46.6080 2.3304
hovtestlevene(typeabs)
Levene's Test for Homogeneity of resistance
Variance ANOVA of Squared Deviations from
Group Means Sum of
Mean Source DF Squares Square
F Value Pr gt F sand 4 202.2
50.5504 0.85 0.5076 Error 20
1182.8 59.1400
hovtestlevene(typesquare)
Brown and Forsythe's Test for Homogeneity of
resistance Variance ANOVA of Absolute
Deviations from Group Medians
Sum of Mean Source DF Squares
Square F Value Pr gt F sand 4
7.4400 1.8600 0.46
0.7623 Error 20 80.4000 4.0200
hovtestbf
8Checking for Normality
Reminder Normality of the RESIDUALS is assumed.
The original data are assumed normal also, but
each group may have a different mean if HA is
true. Practice is to first fit the model, THEN
output the residuals, then test for normality of
the residuals. This APPROACH is always correct.
TOOLS
- Histogram of all residuals (eij).
- Normal probability (Q-Q) plot.
- Formal test for normality.
9Histogram of Residuals
proc glm datastress class sand model
resistance sand / solution output outresid
rr_resis pp_resis title1 'Compression
resistance in concrete beams as' title2 ' a
function of percent sand in the mix' run proc
capability dataresid histogram r_resis /
normal ppplot r_resis / normal square run
10Probability Plots
A scatter plot of the percentiles on the
residuals versus the percentiles of a standard
normal distribution. The basic idea is that if
the residuals are truly normally distributed,
values for these percentiles should lie on a
straight line.
- Compute and sort the residuals e(1), e(2),,
e(n). - Associate to each residual a standard normal
percentile. z(i) normsinv((i-.5)/n). - Plot z(i) versus e(i). Compare to straight line.
11Speadsheet
Percentile pi (i-0.5)/n
Normal percentile
NORMSINV(pi)
Use EXCEL for scatterplot of percentile
versus Normal percentile. Use AddLine option.
12Excel Probability Plot
13Excel Probability Plot
14Probability Plot
Minitab
SAS (note axes changed)
These look normal!
15Non Normal Residuals
Examples of non-normal looking residuals.
Note the strong deviations from a straight line.
16Formal Normality Tests
Many, many tests (a favorite pass-time of
statisticians is developing new tests for
normality.)
- Kolmogorov-Smirnov test.
- Shapiro-Wilks test (n lt 50).
- DAgostinos test (ngt50)
All quite conservative they reject the
hypothesis of normality more often than they
should.
17Shapiro-Wilks W test
e1, e2, , en represent data ranked from smallest
to largest.
H0 The population has a normal distribution. HA
The population does not have a normal
distribution.
T.S.
Coefficients ai come from a table.
If n is even
R.R. Reject H0 if W lt W0.05
If n is odd.
Critical values of Wa come from a table.
18Shipiro-Wilk Coefficients
19Shipiro-Wilk Coefficients
20Shipiro-Wilk W Table
21DAgostinos Test
e1, e2, , en represent data ranked from smallest
to largest.
H0 The population has a normal distribution. HA
The population does not have a normal
distribution.
T.S.
R.R. (two sided test) Reject H0 if
Y0.025 and Y0.975 come from a table of
percentiles of the Y statistic.
22(No Transcript)
23K-S test
- Too difficult to explain.
- Not as powerful as Shipiro-Wilks or DAgostino
tests.
What do we do if the residuals are not normal or
the variances not equal?
24Handling Heterogeneity
no
Regression?
ANOVA
yes
Fit Effect Model
Fit linear model
accept
OK
Test for Homoscedasticity
Plot residuals
reject
Transform
Not OK
OK
Box/Cox Family Power Family
Traditional
Transformed Data
25Transformations to Achieve Normality
no
Regression?
ANOVA
yes
Fit linear model
Estimate group means
Probability plot Formal Tests
yes
OK
Residuals Normal?
no
Different Model
Transform
26Square Root Transformation
Response is positive and continuous.
This transformation works when we notice the
variance changes as a linear function of the mean.
kgt0
- Useful for count data (Poisson Distributed).
- For small values of Y, use Y.5.
Typical use Counts of items when counts are
between 0 and 10.
27Logarithmic Transformation
Response is positive and continuous.
This transformation tends to work when the
variance is a linear function of the square of
the mean
kgt0
- Replace Y by Y1 if zero occurs.
- Useful if effects are multiplicative, or,
- If there is considerable heterogeneity
- in the data.
Typical use Growth over time. Concentrations.
Counts are greater than 10.
28ARCSINE SQUARE ROOT
Response is a proportion.
With proportions, the variance is a linear
function of the mean times (1-mean) where the
sample mean is the expected proportion.
- Y is a proportion (decimal between 0 and 1).
- Zero counts should be replaced by 1/4, and
- N by N-1/4 before converting to percentages
Typical use Proportion of seeds
germinating. Proportion responding.
29Reciprocal Transformation
Response is positive and continuous.
This transformation works when the variance is a
linear function of the fourth root of the mean.
- Use Y1 if zero occurs.
- Useful if the reciprocal of the original
- scale has meaning.
Typical use Survival time.
30Power Family of Transformations (1)
Suppose we apply the power transformation
Suppose the true situation is that the variance
is proportional to the mean.
In the transformed variable we will have
If p is taken as 1-k, then the variance of Z will
not depend on the mean.
31Power Family of Transformations (2)
With replicated data, k can sometimes be found
empirically by fitting
Estimate
k can be estimated by least squares (regression
Next Unit).
If is zero use the logarithmic
transformation.
32Box and Cox Transformations
suggested transformation
geometric mean of the original data.
Exponent, l, is unknown. Hence the model can be
viewed as having an additional parameter which
must be estimated.
33Box and Cox Transformations
Find the value of l that minimizes the residual
sum of squares.
If SSE0 denotes the minimum sums of squares,
then values of l corresponding to the critical
sum of squares
where n is the residual degrees of freedom,
provide an approximate 100(1-a) CI on power.
34Conclusions
- What have we learned?
- How to check ANOVA model assumptions?
- What are typical transformation that might be
used to correct for these assumptions? - Other models? next unit.