Topic 19: Single Factor Analysis of Variance - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Topic 19: Single Factor Analysis of Variance

Description:

ANOVA Table. Source df SS MS. Model r-1 Sij( - )2 SSR/dfR. Error nT-r Sij(Yij - )2 SSE/dfE ... Root MSE cases Mean. 3.247563 18.63158. Factor Effects Model ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 50

Provided by: georgep56

Category:

more less

Transcript and Presenter's Notes

Title: Topic 19: Single Factor Analysis of Variance

1
Topic 19 Single Factor Analysis of Variance
2
Outline

Analysis of Variance
One set of treatments (i.e., single factor)
Cell means model
Factor effects model
Link to linear regression using indicator
explanatory variables

3
One-Way ANOVA

The response variable Y is continuous
The explanatory variable is categorical
We call it a factor
The possible values are called levels
This approach is a generalization of the
independent two-sample t-test
In other words, it can be used when there are
more than two treatments

4
Data for One-Way ANOVA

Y is the response variable
X is the factor (it is qualitative/discrete)
r is the number of levels
often refer to these levels as groups or
treatments
Yi,j is the jth observation in the ith group

5
Notation

For Yi,j we use
i to denote the level of the factor
j to denote the jth observation at factor level i
i 1, . . . , r levels of factor X
j 1, . . . , ni observations for level i of
factor X

6
NKNW Example (p 676)

Y is the number of cases of cereal sold
X is the design of the cereal package
there are 4 levels for X because there are 4
different package designs
i 1 to 4 levels
j 1 to ni stores with design i (ni5,5,4,5)
Will use n if ni is constant

7
Data for one-way ANOVA
data a1 infile 'c../data/ch16ta01.dat'
input cases design store proc print dataa1
run
8
The data
Obs cases design store 1 11 1
1 2 17 1 2 3 16
1 3 4 14 1 4 5
15 1 5 6 12 2 1
7 10 2 2
9
The Model

We assume that the response variable is
Normally distributed with a
mean that may depend on the level of the factor
constant variance
All observations assumed independent
NOTE Same assumptions as linear regression
except there is no assumed linear relationship
between X and Y

10
Cell Means Model

Yij µi eij
where µi is the theoretical mean or expected
value of all observations at level i (or cell i)
the eij are iid N(0, s2) which means
Yij N(µi, s2) and independent
This is called the cell means model

11
Parameters

The parameters of the model are
µ1, µ2, , µr
s2
Question Does our explanatory variable help
explain Y?
Question - Does µi depend on i?
H0 µ1 µ2 µr µ (a constant)
Ha not all µs are the same

12
Estimates

Estimate µi by the mean of the observations at
level i,
ûi SYi,j/ni
For each level i, get an estimate of the variance
S(Yij- )2/(ni-1)
We combine these to get an estimate of s2
Same summaries computed for t-test

13
Pooled estimate of s2

If the ni are all the same we would average the
Do not average the si
In general we pool the , giving weights
proportional to the df, ni -1
The pooled estimate is
s2 (S(ni-1)si2) / S(ni-1)
(S(ni-1)si2)/(nT-r)

14
Run proc glm
proc glm dataa1 class design model
casesdesign means design run
15
Output
The GLM Procedure Class Level Information Class
Levels Values design 4 1 2 3
4 Number of observations 19
16
Means statement output
Level of design N Mean Std Dev 1 5
14.6 2.3 2 5 13.4 3.6 3 4
19.5 2.6 4 5 27.2 3.9
17
Plot the data
symbol1 vcircle inone proc gplot dataa1
plot casesdesign run
18
The plot
19
Plot the means
proc means dataa1 var cases by design
output outa2 meanavcases proc print
dataa2 symbol1 vcircle ijoin proc gplot
dataa2 plot avcasesdesign run
20
New Data Set
Obs design _FREQ_ avcases 1 1 5
14.6 2 2 5 13.4 3 3 4
19.5 4 4 5 27.2
21
Plot of the means
22
Notation

SjYij / ni (trt sample mean)
SiSjYij / nT (grand sample mean)
nT is the total number of observations
nT Si ni

23
ANOVA Table

Source df SS MS
Model r-1 Sij( - )2 SSR/dfR
Error nT-r Sij(Yij - )2 SSE/dfE
Total nT-1 Sij(Yij - )2
SST/dfT

24
ANOVA SAS Output
Source DF SS MS F P Model 3 588 196
18.59 lt.0001 Error 15 158 10 Total 18 746
25
Expected Mean Squares

E(MSE) s2
E(MSR) s2 (Sini (µi - µ.)2)/(r-1))
where µ. (Siniµi )/nT
E(MSR) gt E(MSE) when the group means are
different
See NKNW p 685 689 for more details
In more complicated models, these tell us how to
construct the F test

26
F test

F MSR/MSE
H0 µ1 µ2 µr
Ha not all of the µi are equal
Under H0, F F(r-1, nT-r)
Reject H0 when F is large
Report the P-value

27
More output
R-Square Coeff Var 0.788055
17.43042 Root MSE cases Mean 3.247563
18.63158
28
Factor Effects Model

A reparameterization of the cell means model
Useful way at looking at more complicated models
Null hypotheses are easier to state
Yij µ ?i eij
the eij are iid N(0, s2)

29
Parameters

The parameters of the model are
µ, ?1, ?2, , ?r
s2
The cell means model had r 1 parameters
r µs and s2
The factor effects model has r 2 parameters
µ, the r ?s, and s2

30
An example

Suppose r3 µ1 10, µ2 20, µ3 30
What is an equivalent set of parameters for the
factor effects model?
We need to have µ ?i µi
µ 0, ?1 10, ?2 20, ?3 30
µ 20, ?1 -10, ?2 0, ?3 10
µ 5000, ?1 -4990, ?2 -4980, ?3 -4970

31
Problem with factor effects?

These parameters are not estimable or not well
defined (i.e., unique)
There are many solutions to the least squares
problem
There is an X?X matrix for this parameterization
that does not have an inverse (perfect
multicollinearity)
The parameter estimators here are biased (SAS
proc glm)

32
Factor effects solution

Put a constraint on the ?i
Common to assume Si ?i 0
This effectively reduces the number of parameters
by 1
The details are a bit more complicated when the
ni are not all equal see NKNW p 693-696

33
Consequences

We always have µi µ ?i
The constraint Si ?i 0 implies
µ (Si µi)/r (grand mean)
?i µi µ (group effect)

34
Hypotheses

H0 µ1 µ2 µr
H1 not all of the µi are equal
are translated into
H0 ?1 ?2 ?r 0
H1 at least one ?i is not 0

35
Estimates of parameters

With the constraint Si ?i 0

36
Solution used by SAS

Recall, X?X does not have an inverse
We can use a generalized inverse in its place
(X?X)- is the standard notation
There are many generalized inverses, each
corresponding to a different constraint

37
Solution used by SAS

(X?X)- used in proc glm corresponds to the
constraint ?r 0
Recall that µ and the ?i are not estimable
But the linear combinations µ ?i are estimable
These are estimated by the cell means

38
NKNW Example p 676

Y is the number of cases of cereal sold
X is the design of the cereal package
i 1 to 4 levels
j 1 to ni stores with design i

39
SAS coding for X

Class statement generates r explanatory variables
The ith explanatory variable is equal to 1 if the
observation is from the ith group
In other words, the rows are
1 1 0 0 0 for X1
1 0 1 0 0 for X2
1 0 0 1 0 for X3
1 0 0 0 1 for X4

40
Some options
proc glm dataa1 class design model
casesdesign /xpx inverse solution run
41
Output
The X'X Matrix Int d1 d2 d3 d4
cases Int 19 5 5 4 5 354 d1 5 5
0 0 0 73 d2 5 0 5 0 0 67 d3
4 0 0 4 0 78 d4 5 0 0 0 5
136 cases 354 73 67 78 136 7342
42
Output
X'X Generalized Inverse (g2) Int d1
d2 d3 d4 cases Int 0.2 -0.2 -0.2 -0.2
0 27.2 d1 -0.2 0.4 0.2 0.2 0 -12.6 d2
-0.2 0.2 0.4 0.2 0 -13.8 d3 -0.2 0.2
0.2 0.45 0 -7.7 d4 0 0 0 0 0
0 cases 27.2 -12.6 -13.8 -7.7 0 158.2
43
Output matrix

Actually, this matrix is
(X?X)- (X?X)- X?Y
Y?X(X?X)- Y?Y-Y?X(X?X)- X?Y
Parameter estimates are in upper right corner,
SSE is lower right corner

44
Parameter estimates
St Par Est Err t
P Int 27.2 B 1.45 18.73 lt.0001 d1 -12.6 B
2.05 -6.13 lt.0001 d2 -13.8 B 2.05 -6.72
lt.0001 d3 -7.7 B 2.17 -3.53 0.0030 d4
0.0 B . . .
45
Caution Message
NOTE The X'X matrix has been found to be
singular, and a generalized inverse was used to
solve the normal equations. Terms whose estimates
are followed by the letter 'B' are not uniquely
estimable.
46
Interpretation

If ?r 0 (in our case, ?4 0), then the
corresponding estimate should be zero
the intercept µ is estimated by the mean of the
observations in group 4
since µ ?i is the mean of group i, the ?i are
the differences between the mean of group i and
the mean of group 4

47
Means statement output
Level of design N Mean Std Dev 1 5
14.6 2.3 2 5 13.4 3.6 3 4
19.5 2.6 4 5 27.2 3.9
48
Parameter estimates from means
Level of design Mean 27.2
27.2 1 14.6 14.6-27.2
-12.6 2 13.4 13.4-27.2 -13.8 3
19.5 19.5-27.2 -7.7 4 27.2
27.2-27.2 0
49
Last slide