Title: Categorical Data
1Categorical Data
2Categorical Data Analysis
- To identify any association between two
categorical data.
3Chi-Square Test
- Commonly denoted as ?2
- Useful in testing for independence between
categorical variables (e.g. genetic association
between cases / controls - Assumptions
- Sufficiently large data in each cell in the
cross-tabulation table.
4Small Cell Counts
- In general, require(a) Smallest expected count
is 1 or more(b) At least 80 of the cells have
an expected count of 5 or more - Yates Continuity CorrectionProvides a better
approximation of the test statistic when the data
is dichotomous (2 ? 2)
5Goodness-of-fit Test
- Null hypothesis of a hypothesized distribution
for the data. - Expected frequencies calculated under the
hypothesized distribution. - For example The number of outbreaks of flu
epidemics is charted over the period 1500 to
1931, and the number of outbreaks each year is
tabulated. The variable of interest counts the
number of outbreaks occurring in each year of
that 432 year period. E.g. there were 223 years
with no flu outbreaks.
6Goodness-of-fit Test
- Hypotheses H0 Data follows a Poisson
distribution with mean 0.692 H1 Data does not
follow a Poisson distribution with mean 0.692 - Note Mean 0.692 is obtained from the sample
mean. -
- Expected frequency for X 0
- 432 ? P(X 0), where X Poisson(0.692)
- Test Statistic , with df (6 1).
- This yields a p-value of 0.99, indicating that
we will almost certainly be wrong if we reject
the null hypothesis.
7Test for Independence
- Most common usage for Pearsons Chi-square
statistic. - Expected frequencies calculated by
- Degrees of freedom (r 1) ? (c 1)
-
8Chi-Square Test
9Chi-Square Test
10Quantification of Effect
- ?2-test identifies whether there is significant
association between the two categorical
variables. - But does not quantify the strength and direction
of the association. - Need odds ratio to do this.
- Odds ratio defines how many times more likely
it is to be in one category compared to the
other - Example For the previous example on severe chest
pain, males are about 1.4 times more likely to
experience severe chest pains than females. -
11Odds Ratio
12Exegesis on Epidemiology
- Case-Control Study
- Compare affected and unaffected individuals
- Usually retrospective in nature
- Temporal sequence cannot be established (timing
for the onset of the disease) - No information on population incidence of the
disease - Cohort Study
- Usually random sampling of subjects within the
population - Prospective, retrospective or both
- Long follow-up loss to follow-up
- Costly to conduct
- Temporal sequence can be established
- Provides information on population incidence of
the disease
13Confidence Intervals of Odds Ratio
- Not straightforward to obtain confidence
intervals of odds ratio (due to complexity in
obtaining the variance) - Straightforward to obtain the variance of the
logarithm of odds ratio. - Odds ratio is always reported together with the
p-values (obtained from Pearsons Chi-square
test), and the corresponding confidence
intervals.
14Case Study on Lung Cancer and Smoking
Odds and Odds Ratio Odds Ratio (OR) (1301/56)/(1
205/152) 2.93 Pearsons Chi-square 47.985,
on df 1? p-value 0 Varlog(OR)
0.026 95 Confidence interval (2.14,
4.02)
15More Examples
16c2 TEST FOR TREND
ORsmoker 1.52 (0.88, 2.63), p
0.180 ORex-smoker 2.11 (1.00, 4.51), p
0.081 with non-smoker as reference category.
17Procedure for Categorical Data Analysis
- Summarise data using cross-tabulation tables,
with percentages - Perform a chi-square of independence to test for
association between the two categorical variables - Quantify any significant association using odds
ratios - Always report odds ratios with corresponding 95
confidence interval
18Case Study on Lung Cancer and Smoking
- Chi-square test statistic 46.991
- p-value 7.13 ? 10-12
- Odds ratio 2.93
- 95 CI (2.14, 4.02)