Title: Topic 19: Single Factor Analysis of Variance
1Topic 19 Single Factor Analysis of Variance
2Outline
- Analysis of Variance
- One set of treatments (i.e., single factor)
- Cell means model
- Factor effects model
- Link to linear regression using indicator
explanatory variables
3One-Way ANOVA
- The response variable Y is continuous
- The explanatory variable is categorical
- We call it a factor
- The possible values are called levels
- This approach is a generalization of the
independent two-sample t-test - In other words, it can be used when there are
more than two treatments
4Data for One-Way ANOVA
- Y is the response variable
- X is the factor (it is qualitative/discrete)
- r is the number of levels
- often refer to these levels as groups or
treatments - Yi,j is the jth observation in the ith group
5Notation
- For Yi,j we use
- i to denote the level of the factor
- j to denote the jth observation at factor level i
- i 1, . . . , r levels of factor X
- j 1, . . . , ni observations for level i of
factor X
6NKNW Example (p 676)
- Y is the number of cases of cereal sold
- X is the design of the cereal package
- there are 4 levels for X because there are 4
different package designs - i 1 to 4 levels
- j 1 to ni stores with design i (ni5,5,4,5)
- Will use n if ni is constant
7Data for one-way ANOVA
data a1 infile 'c../data/ch16ta01.dat'
input cases design store proc print dataa1
run
8The data
Obs cases design store 1 11 1
1 2 17 1 2 3 16
1 3 4 14 1 4 5
15 1 5 6 12 2 1
7 10 2 2
9The Model
- We assume that the response variable is
- Normally distributed with a
- mean that may depend on the level of the factor
- constant variance
- All observations assumed independent
- NOTE Same assumptions as linear regression
except there is no assumed linear relationship
between X and Y
10Cell Means Model
- Yij µi eij
- where µi is the theoretical mean or expected
value of all observations at level i (or cell i) - the eij are iid N(0, s2) which means
- Yij N(µi, s2) and independent
- This is called the cell means model
11Parameters
- The parameters of the model are
- µ1, µ2, , µr
- s2
- Question Does our explanatory variable help
explain Y? - Question - Does µi depend on i?
- H0 µ1 µ2 µr µ (a constant)
- Ha not all µs are the same
12Estimates
- Estimate µi by the mean of the observations at
level i, - ûi SYi,j/ni
- For each level i, get an estimate of the variance
- S(Yij- )2/(ni-1)
- We combine these to get an estimate of s2
- Same summaries computed for t-test
13Pooled estimate of s2
- If the ni are all the same we would average the
- Do not average the si
- In general we pool the , giving weights
proportional to the df, ni -1 - The pooled estimate is
- s2 (S(ni-1)si2) / S(ni-1)
- (S(ni-1)si2)/(nT-r)
14Run proc glm
proc glm dataa1 class design model
casesdesign means design run
15Output
The GLM Procedure Class Level Information Class
Levels Values design 4 1 2 3
4 Number of observations 19
16Means statement output
Level of design N Mean Std Dev 1 5
14.6 2.3 2 5 13.4 3.6 3 4
19.5 2.6 4 5 27.2 3.9
17Plot the data
symbol1 vcircle inone proc gplot dataa1
plot casesdesign run
18The plot
19Plot the means
proc means dataa1 var cases by design
output outa2 meanavcases proc print
dataa2 symbol1 vcircle ijoin proc gplot
dataa2 plot avcasesdesign run
20New Data Set
Obs design _FREQ_ avcases 1 1 5
14.6 2 2 5 13.4 3 3 4
19.5 4 4 5 27.2
21Plot of the means
22Notation
- SjYij / ni (trt sample mean)
- SiSjYij / nT (grand sample mean)
- nT is the total number of observations
- nT Si ni
23ANOVA Table
- Source df SS MS
- Model r-1 Sij( - )2 SSR/dfR
- Error nT-r Sij(Yij - )2 SSE/dfE
- Total nT-1 Sij(Yij - )2
SST/dfT
24ANOVA SAS Output
Source DF SS MS F P Model 3 588 196
18.59 lt.0001 Error 15 158 10 Total 18 746
25Expected Mean Squares
- E(MSE) s2
- E(MSR) s2 (Sini (µi - µ.)2)/(r-1))
- where µ. (Siniµi )/nT
- E(MSR) gt E(MSE) when the group means are
different - See NKNW p 685 689 for more details
- In more complicated models, these tell us how to
construct the F test
26F test
- F MSR/MSE
- H0 µ1 µ2 µr
- Ha not all of the µi are equal
- Under H0, F F(r-1, nT-r)
- Reject H0 when F is large
- Report the P-value
27More output
R-Square Coeff Var 0.788055
17.43042 Root MSE cases Mean 3.247563
18.63158
28Factor Effects Model
- A reparameterization of the cell means model
- Useful way at looking at more complicated models
- Null hypotheses are easier to state
- Yij µ ?i eij
- the eij are iid N(0, s2)
29Parameters
- The parameters of the model are
- µ, ?1, ?2, , ?r
- s2
- The cell means model had r 1 parameters
- r µs and s2
- The factor effects model has r 2 parameters
- µ, the r ?s, and s2
30An example
- Suppose r3 µ1 10, µ2 20, µ3 30
- What is an equivalent set of parameters for the
factor effects model? - We need to have µ ?i µi
- µ 0, ?1 10, ?2 20, ?3 30
- µ 20, ?1 -10, ?2 0, ?3 10
- µ 5000, ?1 -4990, ?2 -4980, ?3 -4970
31Problem with factor effects?
- These parameters are not estimable or not well
defined (i.e., unique) - There are many solutions to the least squares
problem - There is an X?X matrix for this parameterization
that does not have an inverse (perfect
multicollinearity) - The parameter estimators here are biased (SAS
proc glm)
32Factor effects solution
- Put a constraint on the ?i
- Common to assume Si ?i 0
- This effectively reduces the number of parameters
by 1 - The details are a bit more complicated when the
ni are not all equal see NKNW p 693-696
33Consequences
- We always have µi µ ?i
- The constraint Si ?i 0 implies
- µ (Si µi)/r (grand mean)
- ?i µi µ (group effect)
34Hypotheses
- H0 µ1 µ2 µr
- H1 not all of the µi are equal
-
- are translated into
- H0 ?1 ?2 ?r 0
- H1 at least one ?i is not 0
35Estimates of parameters
- With the constraint Si ?i 0
36Solution used by SAS
- Recall, X?X does not have an inverse
- We can use a generalized inverse in its place
- (X?X)- is the standard notation
- There are many generalized inverses, each
corresponding to a different constraint
37Solution used by SAS
- (X?X)- used in proc glm corresponds to the
constraint ?r 0 - Recall that µ and the ?i are not estimable
- But the linear combinations µ ?i are estimable
- These are estimated by the cell means
38NKNW Example p 676
- Y is the number of cases of cereal sold
- X is the design of the cereal package
- i 1 to 4 levels
- j 1 to ni stores with design i
39SAS coding for X
- Class statement generates r explanatory variables
- The ith explanatory variable is equal to 1 if the
observation is from the ith group - In other words, the rows are
- 1 1 0 0 0 for X1
- 1 0 1 0 0 for X2
- 1 0 0 1 0 for X3
- 1 0 0 0 1 for X4
40Some options
proc glm dataa1 class design model
casesdesign /xpx inverse solution run
41Output
The X'X Matrix Int d1 d2 d3 d4
cases Int 19 5 5 4 5 354 d1 5 5
0 0 0 73 d2 5 0 5 0 0 67 d3
4 0 0 4 0 78 d4 5 0 0 0 5
136 cases 354 73 67 78 136 7342
42Output
X'X Generalized Inverse (g2) Int d1
d2 d3 d4 cases Int 0.2 -0.2 -0.2 -0.2
0 27.2 d1 -0.2 0.4 0.2 0.2 0 -12.6 d2
-0.2 0.2 0.4 0.2 0 -13.8 d3 -0.2 0.2
0.2 0.45 0 -7.7 d4 0 0 0 0 0
0 cases 27.2 -12.6 -13.8 -7.7 0 158.2
43Output matrix
- Actually, this matrix is
- (X?X)- (X?X)- X?Y
- Y?X(X?X)- Y?Y-Y?X(X?X)- X?Y
- Parameter estimates are in upper right corner,
SSE is lower right corner
44Parameter estimates
St Par Est Err t
P Int 27.2 B 1.45 18.73 lt.0001 d1 -12.6 B
2.05 -6.13 lt.0001 d2 -13.8 B 2.05 -6.72
lt.0001 d3 -7.7 B 2.17 -3.53 0.0030 d4
0.0 B . . .
45Caution Message
NOTE The X'X matrix has been found to be
singular, and a generalized inverse was used to
solve the normal equations. Terms whose estimates
are followed by the letter 'B' are not uniquely
estimable.
46Interpretation
- If ?r 0 (in our case, ?4 0), then the
corresponding estimate should be zero - the intercept µ is estimated by the mean of the
observations in group 4 - since µ ?i is the mean of group i, the ?i are
the differences between the mean of group i and
the mean of group 4
47Means statement output
Level of design N Mean Std Dev 1 5
14.6 2.3 2 5 13.4 3.6 3 4
19.5 2.6 4 5 27.2 3.9
48Parameter estimates from means
Level of design Mean 27.2
27.2 1 14.6 14.6-27.2
-12.6 2 13.4 13.4-27.2 -13.8 3
19.5 19.5-27.2 -7.7 4 27.2
27.2-27.2 0
49Last slide
- Read NKNW Chapter 16 up to 16.10
- We used programs NKNW677.sas and NKNW677b.sas to
generate the output for today