Title: Chapter 26: Comparing Counts
1Chapter 26Comparing Counts
2Goodness-of-Fit Test
- Involves testing a hypothesis.
- There is no single parameter to estimate.
- Considers all categories to give an overall idea
of whether the observed distribution differs from
the hypothesized one.
- All creatures have their determined time for
giving birth and carrying fetus, only a man is
born all year long, not in determined time, one
in the seventh month, the other in the eighth,
and so on till the beginning of the eleventh
month. - Aristotle
3Assumptions and Conditions
- Counted Data Condition
- Check that the data are counts for the categories
of a categorical variable. - Independence Assumption
- Check that the individuals counted in the cells
are sampled independently from some population. - If not, check the randomization condition the
individuals who have been counted should be a
random sample from some population. - Sample Size Assumption
- Expected cell frequency condition expect to
observe at least 5 individuals in each cell.
4A Chi-Square Test for Goodness-of-Fit
- Compare the observed counts in each cell with the
expected counts. - Look at the differences between the observed and
expected counts. - The test is always one-sided.
- There is no direction to the rejection of the
null model we know it just doesnt fit.
- Chi- Square statistic refers to a family of
sampling distribution models. - Number of degrees of freedom is n 1, where n is
the number of categories.
5Whats Your Sign?
- Check Conditions
- Counted data condition there are counts of the
number of executives in categories. - Randomization condition this is a convenience
sample, but no expectation of bias. - Expected cell frequency condition the null
hypothesis expects that of the 256 should
occur in each sign.
- Hypothesis
- Ho Births are uniformly distributed over zodiac
signs. (pAriespTaurus) - HA Births are not uniformly distributed over
zodiac signs.
- The sampling distribution of the test statistic
is ?2 with 12 1 11 degrees of freedom. - Use a Chi-Square goodness-of-fit test.
6Whats Your Sign?
- The chi-square procedure
- Find the expected values.
- Values come from the null hypothesis.
- Multiply the total number of observations by the
hypothesized proportion. - Compute the residuals, Observed Expected.
- Square the residuals.
- Compute the component for each cell,
- Find the sum of the components.
- Find the degrees of freedom, the number of cells
minus 1. - Test the hypothesis find the P-value.
7Whats Your Sign?TI-84 Calculator for
chi-square goodness of fit test
- Enter counts in L1 and expected percentages in
L2. - Convert expected percentages to expected counts.
- Calculate chi-square in L3.
8Whats Your Sign?TI-84 Calculator for
chi-square goodness of fit test
- Find the sum of L3.
- Find the P-value
- The probability of finding a ?2 value at least as
high as the one calculated from the data. - DISTR menu, ?2 cdf
9Whats Your Sign?
- P-value
- Test is one-sided, only consider the right tail.
- Large ?2 values correspond to small P-values,
leading to rejection of the null hypothesis. - The P-value is the area in the upper tail of the
?2 model for 11 degrees of freedom above the
computed ?2 value. - Conclusion
- The P-value of 0.926 means that an observed
chi-square value of 5.08 or higher would occur
about 93 of the time. - There is virtually no evidence that the
distribution of zodiac signs among executives is
not uniform.
10Comparing Observed Distributions
- Chi-square test for homogeneity
- Assumptions and Conditions
- Counted data condition
- Check that the data are counts for the categories
of a categorical variable. - Independence Assumption Randomization condition
- When we test for homogeneity, we often are not
interested in some larger population so we dont
need to check the randomization condition. - Sample Size Assumption
- Expected cell frequency condition expected
count in each cell must be at least 5
individuals.
11Post-Graduation Plans
- Who High school graduates
- What Post-graduation activities
- When 1980, 1990, 2000
- Why Regular survey for general information
12Post-Graduation Plans
- Hypothesis
- Have the choices made by high school graduates in
what they do after graduation changed? - Ho The post-high school choices made by the
classes of 1980, 1990, and 2000 have the same
distribution (homogeneous). - HA The post-high school choices made by the
classes of 1980, 1990, and 2000 do not have the
same distribution.
- Check the conditions
- Counted data condition there are counts of the
number of students in categories. - Randomization condition No inference will be
drawn to other high schools or other classes, so
no need to check for a random sample. - Expected cell frequency condition The expected
values are all at least 5 (see table, later).
- Under these conditions, the sampling distribution
of the test statistic is ?2 with (4 1) X (3
1) 6 degrees of freedom. - Perform a chi-square test of homogeneity.
13Post-Graduation Plans
- TI-84 Steps
- Enter data in a matrix.
- Do the chi-square test of homogeneity.
- Matrix Edit B
- Note that all expected counts are at least 5.
14Post-Graduation Plans
- Conclusion
- The P-value is very small.
- Observed pattern is very unlikely to occur by
chance. - Reject the null hypothesis.
- The choices made by high school graduates have
changed over the two decades examined.
15Post-Graduation Plans
- Examine the Residuals
- Standardized Residuals
- Divide the cells residual by the square root of
its expected value. - Values are the square root of the components
calculated for each cell, with or to show
whether we observed more or less cases than
expected.
16Independence
- Chi-Square Test for Independence
- Data categorize subjects from a single group on
two categorical variables. - Contingency Tables
- Categorize counts on two or more variables.
- Decide whether the distribution of counts on one
variable is contingent on the other.
- Assumptions and Conditions
- Counted data condition
- Check that the data are counts for the categories
of a categorical variable. - Independence Assumption Randomization condition
- When we test for independence, we are interested
in generalizing to some larger population. - Sample Size Assumption
- Expected cell frequency condition expected
count in each cell must be at least 5
individuals.
17Hepatitis C Related to Tattoos?
- Who Patients being treated for non-blood-related
disorders - What Tattoo status and hepatitis C status
- When 1991, 1992
- Where Texas
18Hepatitis C Related to Tattoos?
- Hypothesis
- Are the categorical variables tattoo status and
hepatitis C status statistically independent? - H0 Tattoo status and hepatitis C status are
independent. - HA Tattoo status and hepatitis C status are not
independent.
- Check the conditions
- Counted data condition there are counts of
individuals in categories of two categorical
variables. - Randomization condition Although not an SRS, the
data were selected to avoid biases and should be
representative of the general population. - Expected cell frequency condition The expected
values do not meet the condition that all are
greater than 5. Continue with caution be sure
to check the residuals.
- Under these conditions, the sampling distribution
of the test statistic is ?2 with (3 1) X (2
1) 2 df. - Perform a chi-square test for independence.
19Hepatitis C Related to Tattoos?
- TI-84 Steps
- Enter data in a matrix.
- Do the chi-square test of independence.
- Matrix Edit B
- Note that not all expected counts are at least 5.
20Hepatitis C Related to Tattoos?
- Conclusion
- The P-Value is very small, indicating that if
these variables were independent, the pattern
seen would be very unlikely to occur by chance. - The hepatitis C status is not independent of the
tattoo status. - HOWEVER, check the two cells with the small
expected counts to determine if they did or did
not influence the result too greatly. - Remember A complete solution must include
additional analysis, recalculation, and a final
conclusion.
21Hepatitis C Related to Tattoos?
- Analysis of Residuals
- Too small an expected frequency can arbitrarily
inflate the residual, leading to an inflated
chi-square statistic. - In this case, the standardized residual for the
hepatitis C and Tattoo, Parlor cell is large ?
Inflated chi-square statistic?
22Hepatitis C Related to Tattoos?
- Options
- Based upon concerns, choose not to report the
results. - Include a warning when reporting the results.
- Combine the appropriate categories to larger
sample size and expected frequencies. - Recalculation
- Recalculation (continued)
- Conclusion
- The tattoo status and hepatitis C status are not
independent. The data suggest that tattoo parlors
may be a particular problem, but we do not have
enough data to draw that conclusion.
23What Can Go Wrong?
- A failure of independence between two categorical
variables does not show a cause-and-effect
relationship between them. - There is no way to differentiate the direction of
any possible causation from one variable to
another. - Lurking variables could be responsible for the
observed lack of independence.
- Dont use chi-square methods unless the data are
counts. - Data reported as proportions or percentages can
be used if they are converted to counts. - Just because data are reported in a two-way table
does not mean they are suitable for chi-square
procedures. - Beware large samples.
- The degrees of freedom for the chi-square tests
do not grow with sample size. - With a sufficiently large sample size, a
chi-square test can always reject the null
hypothesis. - There are no confidence intervals to help in
determining the effect size.