ANOVA - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

ANOVA

Description:

Problem of multiple comparisons ... The p-value for comparing 1-2 is 0.0082. For 1-3, PV=0.0003. For 2 ... For comparing multiple groups, we use Kruskal-Wallis ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 48
Provided by: johnt1
Category:
Tags: anova | comparing

less

Transcript and Presenter's Notes

Title: ANOVA


1
ANOVA
  • Analysis of Variance
  • Method to compare more than 2 means
  • As with comparing more than 2 ps, we cant use
    subtraction
  • Will use sum of squares

2
ANOVA
  • Think of an extension to 2 sample t problem
  • Need an estimate of SD
  • In 2 sample t, we always assumed the two distns
    had the same SD
  • Had to combine our 2 estimates (one from each
    sample) to get a single est

3
ANOVA
  • Review 2 sample t test
  • Given N, Avg, SD for 2 groups
  • SPpool(s1,n1,s2,n2)
  • PVtprob(0,SP?(1/n11/n2),av1-av2,999)
  • (double for two sided)

4
ANOVA
  • Pool(s1,n1,s2,n2)
  • ? (n1-1)s12 (n2-1) s22 / (n1-1 n2-1)
  • Concentrate on the numerator and ignore sqrt
  • ? (n-1)s2
  • SSE, Sum of Squares Error
  • (Error not in sense of mistake)

5
ANOVA
  • If we ignored the fact that our data came from
    (potentially) different groups, we might have
    started with
  • ? (x-avg)2
  • SSTotal
  • Note that ?SSTotal/(N-1) would be our estimator
    of SD using all the data (and ignoring grouping)

6
ANOVA
  • The Sum of Squares Identity
  • SSTotal SSE SSTreatment
  • Where
  • SSTr ? nj (avgj-AVG)2
  • Except for nj this is like the sum of squares of
    the averages

7
ANOVA
  • See Ex 62, p. 488
  • 3 levels of blockage
  • Measure flow rate where collapse occurs
  • In book as fig11.4
  • Not a good form for Excel
  • See ANOVA1.xls
  • First col tells which blockage level
  • Second col is the flowrate

8
ANOVA
  • To extract the values that correspond to flow
    rate1, use
  • Dt(,1)1
  • means all rows
  • Equality is
  • Single is assignment
  • Use above expression to select rows
  • Dt( Dt(,1)1, )

9
ANOVA
  • To count how many, use length()
  • Length(Dt( Dt(,1)1, ))
  • Average is mean()
  • SD is std()
  • Most convenient to store these in lists

10
ANOVA
  • For i13,
  • Ns(i) Length(Dt( Dt(,1)i, ))
  • Avs(i)
  • Sds(i)
  • End

11
ANOVA
  • These are rows
  • We can display conveniently by making them cols
    of a matrix
  • ns avs sds
  • 11.0000 11.2091 1.8987
  • 14.0000 15.0857 2.1504
  • 10.0000 17.3300 2.1680

12
ANOVA
  • Alternate plan is to compute Ns, Avs, SDs as a
    matrix in the first place
  • Write a function SUMSTATS(X, grp) which returns
    such a matrix
  • The number of rows of the answer will be the
    number of groups
  • Assume grp contains 1,groups

13
ANOVA
  • Notice that SDs are quite similar in our example
  • Good if we are going to assume the underlying SDs
    are all equal
  • Easy to get SSE from here
  • gtgt ssesum((ns-1).sds.2)
  • 138.4672
  • (Close to p. 497 in book)

14
ANOVA
  • Exercises
  • 1. Fig 11.5, p. 489, has data on roadway
    aggregate. Do sumstats.
  • 2. Fig 11.6, p. 491, has data on carpet fibers.
    Do sumstats.

15
ANOVA
  • To get SSTr, we need overall average
  • Generally NOT the avg of the avgs
  • Have to weight by the Ns
  • gtgt grandsum(ns.avs)/sum(ns)
  • 14.5086
  • gtgt sstrsum(ns.(avs-grand).2)
  • 204.0202

16
ANOVA
  • Dont actually need SST, but it would be SSTrSSE
    342.4874
  • gtgt (length(d114)-1)std(dt(,2)).2
  • 342.4874

17
ANOVA
  • H0 all means equal
  • Ha not all means equal
  • (Not the same as all means different)
  • It can be shown that when H0 is true, then each
    of the SSs is a (multiple of) ?2

18
ANOVA
  • Degrees of Freedom
  • Usual est of SD has dfN-1
  • So, if H0 true, Total df N-1
  • Each term of SSE has df ni-1
  • Fact Sums of (ind) Chi2 has df given by sum of
    df
  • So df for SSE S (ni-1)
  • Df for SSTr is difference
  • groups-1 G-1

19
ANOVA
20
ANOVA
  • Recall that the mean of a ?2 df
  • For comparison, we should divide SS/df
  • Called Mean Square
  • MSTr, MSE (dont usually do MST)

21
ANOVA
22
ANOVA
23
ANOVA
  • If H0 is true, then MSTr MSE
  • Consider their ratio (to get rid of the common
    multiple)
  • F MSTr / MSE
  • Near 1 -gt H0
  • F large -gt avgs are more spread out than expected
  • -gt Ha

24
ANOVA
  • F distn
  • Depends on df for numerator and df for
    denominator
  • FPROB(df1, df2, lo, hi)
  • Can calculate p-value

25
ANOVA
  • For our problem, F102/4.33 23.57
  • gtgt fprob(2,32,23.57,999)
  • 5.1058e-007
  • So if H0 were true, it would be nearly impossible
    to get F this large.
  • Can be quite sure that H0 is not true.
  • Means are not all exactly the same.

26
ANOVA
  • Write a function that takes SUMSTATS and returns
    ANOVA
  • Since ANOVA table is not a full matrix, suggest
    we return a matrix for SS, df, MS and then return
    F and PV as scalars
  • F,pv,SS ANOVA(sumstats)

27
ANOVA
  • Exercises
  • 1. Fig 11.5, p. 489, has data on roadway
    aggregate. Do ANOVA.
  • 2. Fig 11.6, p. 491, has data on carpet fibers.
    Do ANOVA.

28
ANOVA
  • If we only have 2 groups, should we do ANOVA or 2
    sample t test?
  • Consider the following
  • N18, Av112.3, SD15.3
  • N212, Av216.4, SD26.6

29
ANOVA
  • SP6.1273
  • For the two sided Ha
  • gtgtpvtprob(0,spsqrt(1/N11/N2),N1-1N2-1,Av2-Av1,
    999)2
  • 0.1599

30
ANOVA
  • For ANOVA
  • sm
  • 8.0000 12.3000 5.3000
  • 12.0000 16.4000 6.6000
  • gtgt f,pv,ssaanova1(sm)
  • f
  • 2.1492
  • pv
  • 0.1599
  • So the (two-sided) pvalue is exactly the same by
    either method

31
ANOVA
  • But theres more!!
  • ssa
  • 80.6880 1.0000 80.6880
  • 675.7900 18.0000 37.5439
  • 756.4780 19.0000 39.8146
  • and SP2 37.5439
  • So vMSE RMSE is the analog of SP

32
ANOVA
  • If we reject H0, then some means are different.
    Which ones?
  • Problem of multiple comparisons
  • If H0 were true (all means equal) and we did 20
    tests with alpha0.05, then we should expect that
    1 of the comparisons would appear to be different
  • If we are going to do a number of 2 sample t
    tests, we will have to use a lower ? than usual

33
ANOVA
  • The book uses the Studentized range method
  • There are a number of different methods for
    addressing this problem
  • We will use the Bonferroni method

34
ANOVA
  • The probability of a union of sets could be as
    high as the sum of the probabilities (if the sets
    are mutually disjoint)
  • In our cases, the sets are making a Type I
    error for each test we are doing
  • If we are doing N tests and each test has a
    Prob(Type I) 0.05/N, then the overall Prob(Type
    I) 0.05

35
ANOVA
  • Therefore, if we want to end up at an overall
    level of 5, say, then we need to conduct each
    test at a level of 0.05/ tests
  • For G groups, there are G(G-1)/2 pairs
  • For flowrate example, G3 and there are 32/2 3
    pairs to compare
  • Do 2 sample t tests with ?/3 0.05/3

36
ANOVA
  • If we use p-values, then we should multiply the
    (apparent) p-value by the of comparisons
  • If we do confidence intervals, we should get
    wider intervals
  • For 95 intervals, we usually solve for 0.025
  • Now we solve for 0.025 / ( groups)

37
ANOVA
  • Recall that in the 2 sample t test, we use
    SD?(1/n11/n2) for the diff of avgs
  • What do we use for the pooled SD?
  • This was the basis for SSE (or MSE)
  • For SD, use ?MSE
  • For df, use df for MSE

38
ANOVA
  • The p-value for comparing 1-2 is 0.0082
  • For 1-3, PV0.0003
  • For 2-3, PV0.0872
  • Using Bonferoni, we cannot conclude that 2 and 3
    are different, but 1 IS diff from both 2 and 3.
  • Note that on p. 504, he gets a slightly diff
    result using conf intervals and Studentized range
  • The Studentized range is a bit more efficient
    than Bonferroni, but note that his last CI comes
    quite close to 0

39
Rank procedures
  • The underlying assumption of ANOVA is that the
    data has a normal distn
  • When we are not comfortable with that assumption
    there are other methods available
  • Nonparametric methods

40
Rank procedures
  • A large class of NP methods are based on ranks
  • Replace the data with ranks and then do
    calculations
  • We should feel comfortable in knowing which
    values are larger than which other values, even
    if we dont know the probability of such a
    difference

41
Rank procedures
  • Function yranks(x)
  • Midranks to handle ties in the xs
  • For comparing multiple groups, we use
    Kruskal-Wallis
  • Based on the sum of ranks within each group

42
Rank procedures
  • Text has a formula on p. 713, but not very
    meaningful
  • Better formula is based on S(obs-exp)2/exp
  • If we have N total obs and n1 obs in group 1,
    then the avg rank of all the data is (N1)/2
  • We would expect the rank sum for group 1 to be
    n1(N1)/2, etc
  • Can then compute the diff between obs and exp
  • TRICK There is a fudge factor. Have to multiply
    by 6/N to get ?2

43
Rank procedures
  • For artery flow
  • rranks(dt(,2))
  • rs(1)sum(r(dt(,1)1)) etc
  • ns(1)sum(dt(,1)1) etc
  • exptns(n1)/2
  • gtgt c26/nsum((rs-expt).2./expt)
  • 20.4380
  • gtgt pvchiprob(2,c2,99)
  • 3.6471e-005
  • NOTE df groups-1 df for Treatment

44
Rank procedures
  • Can use pooling as we did for Chi2 to determine
    which differences are significant
  • But the pooling can be done on the rank sums (and
    expecteds)
  • As before, df do not change

45
Rank procedures
  • Carpet fiber example
  • Assume weve run sumstats
  • gtgt nssmst(,1)'
  • ns
  • 16 16 13 16 14 15
  • gtgt rranksums(x(,2),x(,1))
  • r
  • 495 1177.5 353
    854.5 288 927
  • gtgt nsum(ns)
  • n
  • 90

46
Rank procedures
  • gtgt exptns(n1)/2
  • expt
  • 728 728 591.5
    728 637 682.5
  • gtgt c26/nsum((r-expt).2 ./ max(1,expt))
  • c2
  • 49.937
  • Chi2 w/ df5, so extremely large

47
Rank procedures
  • gtgt (r-expt)./expt
  • ans
  • -0.32005 0.61745 -0.40321
    0.17376 -0.54788 0.35824
  • Note that 1 and 3 look similar
  • gtgt rcpool(r,1,3),exptcpool(expt,1,3),
  • r
  • 848 1177.5 0
    854.5 288 927
  • expt
  • 1319.5 728 0
    728 637 682.5
  • gtgt c26/nsum((r-expt).2 ./ max(1,expt))
  • c2
  • 49.787
  • Still large (unchanged), so diff is not due to a
    diff between 1 3
Write a Comment
User Comments (0)
About PowerShow.com