Goodness of Fit Tests - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Goodness of Fit Tests

Description:

Title: Relationship between variables Author: Harpal Singh Pooni Last modified by: lcq Created Date: 11/6/2002 9:28:53 AM Document presentation format – PowerPoint PPT presentation

Number of Views:292
Avg rating:3.0/5.0
Slides: 70
Provided by: HarpalSi
Category:
Tags: fit | goodness | square | test | tests

less

Transcript and Presenter's Notes

Title: Goodness of Fit Tests


1
Chapter 11
  • Goodness of Fit Tests

2
Categorical
  • Observations fall into one of a number of
    mutually exclusive categories
  • Binomial distribution ( AB)
  • multinomial distribution (ABC)
  • chi-square distribution

3
Goodness of Fit Tests
  • To determining whether a sample could have been
    drawn from a population with a specified
    distribution
  • Based on comparison of observed frequencies and
    expected frequencies under the specified
    condition.

4
Concepts
  • The Binomial Test
  • The Chi-Square Test for Goodness of Test
  • Kolomogorov-Smirnov Test
  • The Chi-Square Test for r x k Contingency Tables

5
Binomial Test
  • For data that can be grouped into exactly two
    categories
  • e.g. male versus female
  • diseased versus healthy
  • To determine whether the sample proportions of
    the two categories are what would be expected
    with a given binomial distribution

6
Binomial Test
  • Assumptions
  • Independent random sample of size n
  • Two mutually exclusive categories
  • Actual proportion p, corresponding hypothesized
    value p0
  • Hypotheses

7
Binomial Test
  • Test statistic X
  • the number of observations falling into the first
    category (successes), follow a binomial random
    variable with B (n, p0)
  • observed value of X

8
Binomial Test
  • Cumulative binomial distribution
  • Table C1 (p374)
  • Note that

9
Binomial Test
  • 15 of 20 trees have a 1987 growth ring that is
    less than half the size of other growth rings. Do
    these support the claim that the severe drought
    of 1987 in the U.S. have affected the growth rate
    of the majority of the established trees?
  • Hypothesis
  • Test statistic X15, given a0.05, p00.5
  • Inferences the majority of trees have growth
    rings for 1987 less than half their usual size.

10
One sample p test
  • In southeastern Queensland, Pardalotes race A 70
  • 18 pardalotes race A 10 vs race B 8
  • Hypothesis
  • Test statistic X10, given a0.05, p00.7
  • inference no change in the population
    proportions of the two races.

11
Normal approximation to Binomial distribution
A r.v. X B(n,p) has mean m np and variance s
2 np(1-p). If np(1-p) gt 3, then X N(m ,s 2)
But it should be noted that
12
One sample p test
  • Pardalotes race A 70
  • 180 pardalotes race A 100 vs race B 80
  • Hypothesis
  • Test statistic X100, given a0.05, p00.7
  • inference a significant change in the population
    proportions of the two races.

13
Concepts
  • The Binomial Test
  • The Chi-Square Test for Goodness of Test
  • Kolomogorov-Smirnov Test
  • The Chi-Square Test for r x k Contingency Tables

14
Chi-Square test
  • It is used when there are several categories.
  • It compare the observed frequencies of a
    discrete, ordinal, or categorical data set with
    those of some theoretically expected distribution
    (e.g. binomial, multinomial.)
  • It tests whether an observed set of data agree
    with expected values based on some hypothesis,
    H0.
  • The test gives us a Probability of getting such
    a value if the H0 applies to our data.

15
Test for goodness of fit
  • Assumptions
  • Independent random sample of size n
  • A set of k mutually exclusive categories
  • Specified the expected freq for each category
  • Hypotheses
  • H0 the observed frequency distribution is the
    same as the hypothesized frequency distribution
  • Ha the observed and hypothesized frequency
    distributions are different.

16
Test Statistic and Theory
  • Test statistic
  • Observed and expected freq equal ? small
  • Right tailed, approximate chi-square distribution
    when H0 is true, where
  • Table C.5. P385

Observed frequency
Expected frequency
The difference between the observed and expected
frequencies
17
Why chi-square?
  • YB(n,p) success Y (p),
  • failure n-Y (q)
  • n larger enough YN(np,npq)

Observed frequency
Observed frequency
Observed frequency
Observed frequency
18
Why chi-square?
  • Multinomial
  • Output A1 A2 Ak
  • Probability p1 p2 ... Pk
  • Observed Y1 Y2 Yk

19
Example an F2 population
  • Mirabilis jalapa, a self-pollination plant
  • Consider an F2 population in which a single
    incomplete dominant gene is segregating.
  • The numbers of the 3 genotypes AA, Aa, aa are
    counted and we want to know if they are
    segregating according to Mendels law.
  • i.e. we are testing the null hypothesis (H0) that
    we have a 121 ratio.

20
Analysis
  • Genotype AA Aa aa Total
  • red pink white
  • Expected freqs. ¼ ½ ¼ 1.0
  • Observed Nos.(O) 55 132 53 240
  • Expected Nos.(E) 60 120 60 240
  • (O-E) -5 12 -7 0

21
The extrinsic model
  • Example are the data of 240 progeny of
    self-pollination four-oclock reasonably
    consistent with the Mendelian model?

category
red ¼ 55 60 0.42
pink ½ 132 120 1.2
white ¼ 53 60 0.82
total 240 240 2.44
  • Hypotheses
  • H0 the data are consistent with a Mendelian
    model.
  • Ha the data are inconsistent with a Mendelian
    model.
  • Calculate expected frequencies
  • Test statistic
  1. Conclusion The data support for the Mendelian
    genetic model.

22
The intrinsic model
  • Example does the number of landfalling
    hurricanes/year in 1900-1997 in U.S. follow a
    Poisson distribution?

Hurricanes/year 0 1 2 3 4 5 6
Frequency 18 34 24 16 3 1 2
  • Hypotheses
  • H0 the annual number of U.S. landfalling
    hurricanes follow a Poisson distribution.
  • Ha the annual number of U.S. landfalling
    hurricanes is inconsistent with a Poisson
    distribution
  • Estimate parameters

23
probability
count
19.40
18
0.198
0
count
0 18 19.40 0.101
1 34 31.36 0.222
2 24 25.48 0.086
3 16 13.72 0.379
gt4 6 8.04 0.518
31.36
34
0.320
1
25.48
24
0.260
2
13.72
16
0.140
3
5.59
3
0.057
4
1.76
1
0.018
5
0.69
2
0.007
gt6
lt5
24
Concepts
  • The Binomial Test
  • The Chi-Square Test for Goodness of Test
  • Kolomogorov-Smirnov Test
  • The Chi-Square Test for r x k Contingency Tables

25
Kolomogorov-Smirnov Test
  • To determine whether a sample could come from a
    population with a particular specified
    distribution
  • -square for discrete or categorical data
  • Kolomogorov-Smirnov test for random samples from
    continuous (Normal) or discontinuous (Binomial)
    population.

26
  • The arm lengths (radii) of 67 Edmonds sea stars
    at Polka Point

2.5 5.0 6.0 6.5 7.0 8.0 9.5
4.0 5.0 6.0 6.5 7.0 8.0 10.0
4.0 5.0 6.0 7.0 7.5 8.0 10.0
4.5 5.5 6.5 7.0 7.5 8.5 10.5
4.5 5.5 6.5 7.0 7.5 8.5 11.0
4.5 5.5 6.5 7.0 7.5 8.5 13.0
4.5 5.5 6.5 7.0 7.5 8.5 13.5
4.5 6.0 6.5 7.0 7.5 8.5
5.0 6.0 6.5 7.0 8.0 9.0
5.0 6.0 6.5 7.0 8.0 9.0
Sufficiently close to normal distribution
27
Kolomogorov-Smirnov Test
  • Assumptions
  • Random sample of size n with some unknown
    distribution function G(x)
  • Specified the hypothesized distribution as F(x)
  • Hypotheses
  • H0 G(x)F(x) for all x
  • Ha G(x)?F(x) for at least one value of x

28
Kolomogorov-Smirnov Test
  • Statistic
  • The largest absolute value of the differences
    between the cumulative distribution of the sample
    and the expected distribution.
  • K?0, accept H0
  • Table C14

Intervals of data range
Observed CDF
expected CDF
X Cum.freq. S(x) zx F(zx) S(x)-F(zx)
2.75 1 0.015 -2.12 0.017 0.002
3.75 1 0.015 -1.62 0.053 0.038
4.75 8 0.119 -1.12 0.131 0.012
5.75 17 0.254 -0.62 0.268 0.014
6.75 32 0.478 -0.12 0.452 0.026
7.75 48 0.716 0.39 0.652 0.064
8.75 58 0.866 0.89 0.813 0.053
9.75 61 0.910 1.39 0.918 0.008
10.75 64 0.955 1.89 0.971 0.016
11.75 65 0.970 2.39 0.992 0.022
8 67 1.000 8 1.000 0.000
The sea star radii follow N(6.98,3.988)
29
  • Do the following data support that the number of
    males in a litter is a binomial random variable
    with p0.5?
  • H0 the number of males in each litter is a
    binomial random variable with p0.5 and n6.

No. of males 0 1 2 3 4 5 6
Frequency 3 16 53 78 53 18 0
X freq S(x) F(x) S(x)-F(x)
0 3 0.014 0.016 0.002
1 19 0.086 0.109 0.023
2 72 0.326 0.344 0.018
3 150 0.679 0.656 0.022
4 203 0.919 0.891 0.028
5 221 1.000 0.984 0.016
6 221 1.000 1.000 0.000
The number of males and females is described by a
binomial distribution with p0.5.
! advantage of KS test to chi-square test
neednt to calculate the density function for the
binomial.
30
probability
count
3.45
3
0.016
0
count
1 19 24.17 1.107
2 53 51.80 0.028
3 78 69.06 1.157
4 53 51.80 0.028
gt5 18 24.17 1.576
20.72
16
0.094
1
51.80
53
0.234
2
69.06
78
0.313
3
51.80
53
0.234
4
20.72
18
0.094
5
3.45
0
0.016
gt6
lt5
31
Concepts
  • The Binomial Test
  • The Chi-Square Test for Goodness of Test
  • Kolomogorov-Smirnov Test
  • The Chi-Square Test for r x k Contingency Tables

32
r x k Contingency tables
  • r the number of categories
  • k the number of populations or treatments
  • Oij the number of observations of category i in
    population j
  • To test whether the distribution of a categorical
    variable is the same in two or more populations
  • To test whether there is relationship or
    dependency between the row and column variables

33
  • Is the shell species independent to whether it is
    occupied?

Species Occupied Empty Total
Austrocochlea 47 42 89
Bembicium 10 41 51
Cirithid 125 49 174
Total 182 132 314
  • The expected number of observations for each
    category based on the assumption that the row and
    column variables are independent

Fraction of row i in the entire sample
Species Occupied Empty Total
Austrocochlea 51.59 37.41 89
Bembicium 29.56 21.44 51
Cirithid 100.85 73.15 174
Total 182 132 314
Fraction of column j in the entire sample
34
Test for contingency table
  • Assumptions (two different sampling method)
  • A random sample, categorized in two ways
  • k independent random samples, a categorical
    variable
  • Hypotheses
  • H0 the row and column variables are independent
  • Ha the row and column variables are not
    independent
  • H0 the distribution of the row categories is the
    same in all k populations
  • Ha the distribution of the row categories is not
    the same in all k populations

35
Test statistic and Theory
  • Test statistic
  • For large sample, n40, Eij5
  • Observed and expected freq equal ? small
  • For small sample

36
Solution to the example
  • H0 The status (occupied or not) is independent
    of the shell species
  • Ha The status is not independent of the shell
    species

observed
Species Occupied Empty Total
Austrocochlea 47 42 89
Bembicium 10 41 51
Cirithid 125 49 174
Total 182 132 314
expected
Reject H0, There is an association between
species of shell and those that hermit crabs
occupy.
Species Occupied Empty Total
Austrocochlea 51.59 37.41 89
Bembicium 29.56 21.44 51
Cirithid 100.85 73.15 174
Total 182 132 314
37
2x2 Contingency Tables
  • A special case of contingency table
  • Employ a correction factor for discontinuity

Category B Category B Category B
1 2 Total
Category A 1 n11 n12 n1.
Category A 2 n21 n22 n2.
Category A Total n.1 n.2 n..
Correction for discontinuity
38
Two sample proportion
group group group
1 2 Total
success 1 n11 n12 n1.
success 0 n21 n22 n2.
success Total n.1 n.2 n..
39
(No Transcript)
40
Exact Fisher test
category B category B category B
1 2 Total
category A 1 a b ab
category A 2 c d cd
category A Total ac bd n
expected value of one cell lt5
1 5
5 7
0 6
6 6
2 4
4 8
41
Enrichment analysis
  • Gene list
  • batch annotation
  • gene-GO term enrichment analysis
  • highlight the most relevant GO terms associated
    with a given gene list .

42
(No Transcript)
43
EASE Score, a modified Fisher Exact P-Value
  • In human genome background (30,000 gene total),
    40 genes are involved in p53 signalling pathway.
  • Fisher Exact P-Value   0.008
  • EASE Score is more conservative to exame the
    situation. EASE Score   0.06 (using 3-1 instead
    of 3).

User genes Genome
In pathway 3-1 40
Not in pathway 297 29960
Total 300 30000


1 42
299 29958
0 43
300 29957
2 41
298 29959
44
2x2 Contingency Tables
45
2x2 Contingency Tables
46
(No Transcript)
47
Extreme Hypothetical Example of Population
Stratification
  • Interested in the LD between allele A and
    disease.

case control
A 101 20
a 20 101
OR 25.5
case control
A 100 10
a 10 1
OR 1
case control
A 1 10
a 10 100
OR 1
48
Population Stratification
  • Confounding bias that may occur if ones sample
    is comprised of sub-populations with different
  • allele frequencies (?) and
  • disease rates (RpR)
  • Cases are more likely than controls to arise from
    the sub-population with the higher baseline
    disease rate.
  • Further, cases and controls will have different
    allele frequencies regardless of whether the
    locus is causal.

49
Example of Population Stratification
Cardon Palmer, 2003
50
(No Transcript)
51
Partitioning the Chi-Square Test
  • Repeat Mendels classic experiments with garden
    peas, Pisum sativum.

Category Oi Ei
A_B_ 180 157.5 3.214
A_bb 30 52.5 9.643
aaB_ 60 52.5 1.071
aabb 10 17.5 3.214
Total 280 280 17.143
Tall green pods (AaBb)
Tall green pods (AaBb)
180 tall, green pods (A_B_) 30 tall, yellow
pods (A_bb) 60 short, green pods (aaB_) 10
short, yellow pods (aabb)
  • H0 The result are in a 9331 phenotypic ratio.
  • Ha The result deviate significantly from a
    9331 ratio.

Reject H0
52
Test the gene locus for plant height produced
offspring in a 3A_1aa ratio
Category Oi Ei
Tall (A_B_ and A_bb) 210 210 0
Short (aaB_ and aabb) 70 70 0
Total 280 280 0
Accept H0
Test the gene locus for seed pod produced
offspring in a 3B_1bb ratio
Category Oi Ei
Green (A_B_ and aaB_) 240 210 4.286
Yellow (A_bb and aabb) 40 70 12.857
Total 280 280 17.143
Reject H0
Test the two gene loci, A and B are independent
of each other
B_ bb Total
A_ 180 30 210
aa 60 10 70
Total 240 40 280
Accept H0
unnecessary
53
Conclusion
  • From the overall chi-square test
  • The data deviate significantly form a 9331
    ratio.
  • From the 3 single degree of freedom chi-square
    test
  • Plant heights in offspring follow Mendel Law.
  • The seed pod color does not follow Mendel Law.
  • The Two loci are behaving independently.
  • The discrepancy in the overall chi-square test is
    due to a distortion in the green to yellow seed
    pod ratio.
  • Reasons Differential survival of the two
    phenotypes

54
To be summary
  • The Binomial Test
  • The Chi-Square Test for Goodness of Test
  • The Extrinsic Model
  • The Intrinsic Model
  • Kolomogorov-Smirnov Test
  • Normal
  • Binomial

55
Non parameter statistics
  • Sign test
  • Rank test

56
Sign test
  • One sample
  • Two paired samples

Let n n the number of Xi M0 , then S-
(or S) B(n,1/2)
57
McNemars test
control control control
0 1
case 0 a b
case 1 c d
  • Two paired samples (0,1)

Let n bc, then S- (or S) B(n,1/2) bcgt20
bcgt20, T X2(df1)
58
Kendall Correlation Coefficient t
  • Direct compare the n observations with each other
  • Concordant (C) ?positive
    correlation
  • Discordant (D) ?negative
    correlation
  • Tie (E)

Difference between the of concordant and
discordant pairs
The total of comparisons CDE
59
Public bias
60
Rank test
  • One sample or two paired samples
  • Wilcoxon singed-rank test
  • Two independent sample
  • Wilcoxon rank-sum test (Mann-Whitney U test)
  • Multiple samples (One way ANOVA)
  • Kruskal-Wallis test
  • Multiple paired samples (two way ANOVA)
  • Friendman k-sample test
  • Correlation and regression
  • Spearman correlation

61
Data
One Variable
Three Variables
Two Variables
62
Data
One Variable
Three Variables
Two Variables
Normal
Binomial
Poisson
Bernoulli
Ranked
Chi-square Test Infer s2
Infer p
t Test Infer µ
Sign Test
Infer µ
Wilcoxon Singed-Rank
Large sample Normal approximation
63
Data
One Variable
Three Variables
Two Variables
X Variable
01
Categories
Ranked
Normal
01
22 Table Chi-Square
Two Sample p Test
2K Table Chi-Square
Categories
RC Table Chi-Square
Y Variable
Ranked
Wilcoxon Rank Sum
Kruskal-Wallis Test
Spearman Correlation
Normal
Two Samplet Test
One WayANOVA
Spearman Regression
Pearson Regression
64
22 Table Chi-Square Two Sample p Test
65
Two sample T test
F test Infer s2
Welchs approximate t test
Unpaired t test
66
Data
One variable
Three variables (block, paired)
Two variables
X variable
01
Categories
Ranked
Normal
01
McNemarTest
Categories
Y variable
Ranked
Wilcoxon Signed-Rank
Friedman k-Sample Test
Normal
Two SamplePaired t Test
Two WayANOVA
67
Data
One variable
Three variables (quantity)
Two variables
X variable
01
Categories
Ranked
Normal
01
Categories
Y variable
Ranked
Partial Spearman
Normal
Partial Pearson Partial Regression
CovarianceAnalysis
68
Test for differences among A factor treatments
H0
DONE
H0
Mean separation techniques
Ha
Test for differences among B factor treatments
Test for interaction between factors
Ha
Look for the best combinations of A factor and B
factor
69
The End
Write a Comment
User Comments (0)
About PowerShow.com