Topic 19: Single Factor Analysis of Variance - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Topic 19: Single Factor Analysis of Variance

Description:

ANOVA Table. Source df SS MS. Model r-1 Sij( - )2 SSR/dfR. Error nT-r Sij(Yij - )2 SSE/dfE ... Root MSE cases Mean. 3.247563 18.63158. Factor Effects Model ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 50
Provided by: georgep56
Category:

less

Transcript and Presenter's Notes

Title: Topic 19: Single Factor Analysis of Variance


1
Topic 19 Single Factor Analysis of Variance
2
Outline
  • Analysis of Variance
  • One set of treatments (i.e., single factor)
  • Cell means model
  • Factor effects model
  • Link to linear regression using indicator
    explanatory variables

3
One-Way ANOVA
  • The response variable Y is continuous
  • The explanatory variable is categorical
  • We call it a factor
  • The possible values are called levels
  • This approach is a generalization of the
    independent two-sample t-test
  • In other words, it can be used when there are
    more than two treatments

4
Data for One-Way ANOVA
  • Y is the response variable
  • X is the factor (it is qualitative/discrete)
  • r is the number of levels
  • often refer to these levels as groups or
    treatments
  • Yi,j is the jth observation in the ith group

5
Notation
  • For Yi,j we use
  • i to denote the level of the factor
  • j to denote the jth observation at factor level i
  • i 1, . . . , r levels of factor X
  • j 1, . . . , ni observations for level i of
    factor X

6
NKNW Example (p 676)
  • Y is the number of cases of cereal sold
  • X is the design of the cereal package
  • there are 4 levels for X because there are 4
    different package designs
  • i 1 to 4 levels
  • j 1 to ni stores with design i (ni5,5,4,5)
  • Will use n if ni is constant

7
Data for one-way ANOVA
data a1 infile 'c../data/ch16ta01.dat'
input cases design store proc print dataa1
run
8
The data
Obs cases design store 1 11 1
1 2 17 1 2 3 16
1 3 4 14 1 4 5
15 1 5 6 12 2 1
7 10 2 2
9
The Model
  • We assume that the response variable is
  • Normally distributed with a
  • mean that may depend on the level of the factor
  • constant variance
  • All observations assumed independent
  • NOTE Same assumptions as linear regression
    except there is no assumed linear relationship
    between X and Y

10
Cell Means Model
  • Yij µi eij
  • where µi is the theoretical mean or expected
    value of all observations at level i (or cell i)
  • the eij are iid N(0, s2) which means
  • Yij N(µi, s2) and independent
  • This is called the cell means model

11
Parameters
  • The parameters of the model are
  • µ1, µ2, , µr
  • s2
  • Question Does our explanatory variable help
    explain Y?
  • Question - Does µi depend on i?
  • H0 µ1 µ2 µr µ (a constant)
  • Ha not all µs are the same

12
Estimates
  • Estimate µi by the mean of the observations at
    level i,
  • ûi SYi,j/ni
  • For each level i, get an estimate of the variance
  • S(Yij- )2/(ni-1)
  • We combine these to get an estimate of s2
  • Same summaries computed for t-test

13
Pooled estimate of s2
  • If the ni are all the same we would average the
  • Do not average the si
  • In general we pool the , giving weights
    proportional to the df, ni -1
  • The pooled estimate is
  • s2 (S(ni-1)si2) / S(ni-1)
  • (S(ni-1)si2)/(nT-r)

14
Run proc glm
proc glm dataa1 class design model
casesdesign means design run
15
Output
The GLM Procedure Class Level Information Class
Levels Values design 4 1 2 3
4 Number of observations 19
16
Means statement output
Level of design N Mean Std Dev 1 5
14.6 2.3 2 5 13.4 3.6 3 4
19.5 2.6 4 5 27.2 3.9
17
Plot the data
symbol1 vcircle inone proc gplot dataa1
plot casesdesign run
18
The plot
19
Plot the means
proc means dataa1 var cases by design
output outa2 meanavcases proc print
dataa2 symbol1 vcircle ijoin proc gplot
dataa2 plot avcasesdesign run
20
New Data Set
Obs design _FREQ_ avcases 1 1 5
14.6 2 2 5 13.4 3 3 4
19.5 4 4 5 27.2
21
Plot of the means
22
Notation
  • SjYij / ni (trt sample mean)
  • SiSjYij / nT (grand sample mean)
  • nT is the total number of observations
  • nT Si ni

23
ANOVA Table
  • Source df SS MS
  • Model r-1 Sij( - )2 SSR/dfR
  • Error nT-r Sij(Yij - )2 SSE/dfE
  • Total nT-1 Sij(Yij - )2
    SST/dfT

24
ANOVA SAS Output
Source DF SS MS F P Model 3 588 196
18.59 lt.0001 Error 15 158 10 Total 18 746
25
Expected Mean Squares
  • E(MSE) s2
  • E(MSR) s2 (Sini (µi - µ.)2)/(r-1))
  • where µ. (Siniµi )/nT
  • E(MSR) gt E(MSE) when the group means are
    different
  • See NKNW p 685 689 for more details
  • In more complicated models, these tell us how to
    construct the F test

26
F test
  • F MSR/MSE
  • H0 µ1 µ2 µr
  • Ha not all of the µi are equal
  • Under H0, F F(r-1, nT-r)
  • Reject H0 when F is large
  • Report the P-value

27
More output
R-Square Coeff Var 0.788055
17.43042 Root MSE cases Mean 3.247563
18.63158
28
Factor Effects Model
  • A reparameterization of the cell means model
  • Useful way at looking at more complicated models
  • Null hypotheses are easier to state
  • Yij µ ?i eij
  • the eij are iid N(0, s2)

29
Parameters
  • The parameters of the model are
  • µ, ?1, ?2, , ?r
  • s2
  • The cell means model had r 1 parameters
  • r µs and s2
  • The factor effects model has r 2 parameters
  • µ, the r ?s, and s2

30
An example
  • Suppose r3 µ1 10, µ2 20, µ3 30
  • What is an equivalent set of parameters for the
    factor effects model?
  • We need to have µ ?i µi
  • µ 0, ?1 10, ?2 20, ?3 30
  • µ 20, ?1 -10, ?2 0, ?3 10
  • µ 5000, ?1 -4990, ?2 -4980, ?3 -4970

31
Problem with factor effects?
  • These parameters are not estimable or not well
    defined (i.e., unique)
  • There are many solutions to the least squares
    problem
  • There is an X?X matrix for this parameterization
    that does not have an inverse (perfect
    multicollinearity)
  • The parameter estimators here are biased (SAS
    proc glm)

32
Factor effects solution
  • Put a constraint on the ?i
  • Common to assume Si ?i 0
  • This effectively reduces the number of parameters
    by 1
  • The details are a bit more complicated when the
    ni are not all equal see NKNW p 693-696

33
Consequences
  • We always have µi µ ?i
  • The constraint Si ?i 0 implies
  • µ (Si µi)/r (grand mean)
  • ?i µi µ (group effect)

34
Hypotheses
  • H0 µ1 µ2 µr
  • H1 not all of the µi are equal
  • are translated into
  • H0 ?1 ?2 ?r 0
  • H1 at least one ?i is not 0

35
Estimates of parameters
  • With the constraint Si ?i 0

36
Solution used by SAS
  • Recall, X?X does not have an inverse
  • We can use a generalized inverse in its place
  • (X?X)- is the standard notation
  • There are many generalized inverses, each
    corresponding to a different constraint

37
Solution used by SAS
  • (X?X)- used in proc glm corresponds to the
    constraint ?r 0
  • Recall that µ and the ?i are not estimable
  • But the linear combinations µ ?i are estimable
  • These are estimated by the cell means

38
NKNW Example p 676
  • Y is the number of cases of cereal sold
  • X is the design of the cereal package
  • i 1 to 4 levels
  • j 1 to ni stores with design i

39
SAS coding for X
  • Class statement generates r explanatory variables
  • The ith explanatory variable is equal to 1 if the
    observation is from the ith group
  • In other words, the rows are
  • 1 1 0 0 0 for X1
  • 1 0 1 0 0 for X2
  • 1 0 0 1 0 for X3
  • 1 0 0 0 1 for X4

40
Some options
proc glm dataa1 class design model
casesdesign /xpx inverse solution run
41
Output
The X'X Matrix Int d1 d2 d3 d4
cases Int 19 5 5 4 5 354 d1 5 5
0 0 0 73 d2 5 0 5 0 0 67 d3
4 0 0 4 0 78 d4 5 0 0 0 5
136 cases 354 73 67 78 136 7342
42
Output
X'X Generalized Inverse (g2) Int d1
d2 d3 d4 cases Int 0.2 -0.2 -0.2 -0.2
0 27.2 d1 -0.2 0.4 0.2 0.2 0 -12.6 d2
-0.2 0.2 0.4 0.2 0 -13.8 d3 -0.2 0.2
0.2 0.45 0 -7.7 d4 0 0 0 0 0
0 cases 27.2 -12.6 -13.8 -7.7 0 158.2
43
Output matrix
  • Actually, this matrix is
  • (X?X)- (X?X)- X?Y
  • Y?X(X?X)- Y?Y-Y?X(X?X)- X?Y
  • Parameter estimates are in upper right corner,
    SSE is lower right corner

44
Parameter estimates
St Par Est Err t
P Int 27.2 B 1.45 18.73 lt.0001 d1 -12.6 B
2.05 -6.13 lt.0001 d2 -13.8 B 2.05 -6.72
lt.0001 d3 -7.7 B 2.17 -3.53 0.0030 d4
0.0 B . . .
45
Caution Message
NOTE The X'X matrix has been found to be
singular, and a generalized inverse was used to
solve the normal equations. Terms whose estimates
are followed by the letter 'B' are not uniquely
estimable.
46
Interpretation
  • If ?r 0 (in our case, ?4 0), then the
    corresponding estimate should be zero
  • the intercept µ is estimated by the mean of the
    observations in group 4
  • since µ ?i is the mean of group i, the ?i are
    the differences between the mean of group i and
    the mean of group 4

47
Means statement output
Level of design N Mean Std Dev 1 5
14.6 2.3 2 5 13.4 3.6 3 4
19.5 2.6 4 5 27.2 3.9
48
Parameter estimates from means
Level of design Mean 27.2
27.2 1 14.6 14.6-27.2
-12.6 2 13.4 13.4-27.2 -13.8 3
19.5 19.5-27.2 -7.7 4 27.2
27.2-27.2 0
49
Last slide
  • Read NKNW Chapter 16 up to 16.10
  • We used programs NKNW677.sas and NKNW677b.sas to
    generate the output for today
Write a Comment
User Comments (0)
About PowerShow.com