Chapter 10 Analyzing the Association Between Categorical Variables

About This Presentation

Title:

Chapter 10 Analyzing the Association Between Categorical Variables

Description:

Example: Is There an Association Between Happiness and Family Income? ... In a study of the two variables (Gender and Happiness), which one is the response variable? ... – PowerPoint PPT presentation

Number of Views:128

Avg rating:3.0/5.0

Slides: 91

Provided by: katemcla

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 10 Analyzing the Association Between Categorical Variables

1
Chapter 10Analyzing the Association Between
Categorical Variables

Learn .
How to detect and describe associations between
categorical variables

2
Section 10.1

What Is Independence and What is Association?

3
Example Is There an Association Between
Happiness and Family Income?
4
Example Is There an Association Between
Happiness and Family Income?
5
Example Is There an Association Between
Happiness and Family Income?

The percentages in a particular row of a table
are called conditional percentages
They form the conditional distribution for
happiness, given a particular income level

6
Example Is There an Association Between
Happiness and Family Income?
7
Example Is There an Association Between
Happiness and Family Income?

Guidelines when constructing tables with
conditional distributions
Make the response variable the column variable
Compute conditional proportions for the response
variable within each row
Include the total sample sizes

8
Independence vs Dependence

For two variables to be independent, the
population percentage in any category of one
variable is the same for all categories of the
other variable
For two variables to be dependent (or
associated), the population percentages in the
categories are not all the same

9
Example Happiness and Gender
10
Example Happiness and Gender
11
Example Belief in Life After Death
12
Example Belief in Life After Death

Are race and belief in life after death
independent or dependent?
The conditional distributions in the table are
similar but not exactly identical
It is tempting to conclude that the variables are
dependent

13
Example Belief in Life After Death

Are race and belief in life after death
independent or dependent?
The definition of independence between variables
refers to a population
The table is a sample, not a population

14
Independence vs Dependence

Even if variables are independent, we would not
expect the sample conditional distributions to be
identical
Because of sampling variability, each sample
percentage typically differs somewhat from the
true population percentage

15
Section 10.2

How Can We Test whether Categorical Variables are
Independent?

16
A Significance Test for Categorical Variables

The hypotheses for the test are
H0 The two variables are independent
Ha The two variables are dependent
(associated)
The test assumes random sampling and a large
sample size

17
What Do We Expect for Cell Counts if the
Variables Are Independent?

The count in any particular cell is a random
variable
Different samples have different values for the
count
The mean of its distribution is called an
expected cell count
This is found under the presumption that H0 is
true

18
How Do We Find the Expected Cell Counts?

Expected Cell Count
For a particular cell, the expected cell count
equals

19
Example Happiness by Family Income
20
The Chi-Squared Test Statistic

The chi-squared statistic summarizes how far the
observed cell counts in a contingency table fall
from the expected cell counts for a null
hypothesis

21
Example Happiness and Family Income
22
Example Happiness and Family Income

State the null and alternative hypotheses for
this test
H0 Happiness and family income are independent
Ha Happiness and family income are dependent
(associated)

23
Example Happiness and Family Income

Report the statistic and explain how it was
calculated
To calculate the statistic, for each cell,
calculate
Sum the values for all the cells
The value is 73.4

24
Example Happiness and Family Income

The larger the value, the greater the
evidence against the null hypothesis of
independence and in support of the alternative
hypothesis that happiness and income are
associated

25
The Chi-Squared Distribution

To convert the test statistic to a
P-value, we use the sampling distribution of the
statistic
For large sample sizes, this sampling
distribution is well approximated by the
chi-squared probability distribution

26
The Chi-Squared Distribution
27
The Chi-Squared Distribution

Main properties of the chi-squared distribution
It falls on the positive part of the real number
line
The precise shape of the distribution depends on
the degrees of freedom
df (r-1)(c-1)

28
The Chi-Squared Distribution

Main properties of the chi-squared distribution
The mean of the distribution equals the df value
It is skewed to the right
The larger the value, the greater the
evidence against H0 independence

29
The Chi-Squared Distribution
30
The Five Steps of the Chi-Squared Test of
Independence

1. Assumptions
Two categorical variables
Randomization
Expected counts 5 in all cells

31
The Five Steps of the Chi-Squared Test of
Independence

2. Hypotheses
H0 The two variables are independent
Ha The two variables are dependent (associated)

32
The Five Steps of the Chi-Squared Test of
Independence

3. Test Statistic

33
The Five Steps of the Chi-Squared Test of
Independence

4. P-value Right-tail probability above the
observed value, for the chi-squared
distribution with df (r-1)(c-1)
5. Conclusion Report P-value and interpret in
context
If a decision is needed, reject H0 when P-value
significance level

34
Chi-Squared is Also Used as a Test of
Homogeneity

The chi-squared test does not depend on which is
the response variable and which is the
explanatory variable
When a response variable is identified and the
population conditional distributions are
identical, they are said to be homogeneous
The test is then referred to as a test of
homogeneity

35
Example Aspirin and Heart Attacks Revisited
36
Example Aspirin and Heart Attacks Revisited

What are the hypotheses for the chi-squared test
for these data?
The null hypothesis is that whether a doctor has
a heart attack is independent of whether he takes
placebo or aspirin
The alternative hypothesis is that theres an
association

37
Example Aspirin and Heart Attacks Revisited

Report the test statistic and P-value for the
chi-squared test
The test statistic is 25.01 with a P-value of
0.000
This is very strong evidence that the population
proportion of heart attacks differed for those
taking aspirin and for those taking placebo

38
Example Aspirin and Heart Attacks Revisited

The sample proportions indicate that the aspirin
group had a lower rate of heart attacks than the
placebo group

39
Limitations of the Chi-Squared Test

If the P-value is very small, strong evidence
exists against the null hypothesis of
independence
But
The chi-squared statistic and the P-value tell us
nothing about the nature of the strength of the
association

40
Limitations of the Chi-Squared Test

We know that there is statistical significance,
but the test alone does not indicate whether
there is practical significance as well

41
Section 10.3

How Strong is the Association?

In a study of the two variables (Gender and
Happiness), which one is the response variable?
Gender
Happiness

What is the Expected Cell Count for Females who
are Pretty Happy?
898
801.5
902
521

Calculate the
1.75
0.27
0.98
10.34

At a significance level of 0.05, what is the
correct decision?
Gender and Happiness are independent
There is an association between Gender and
Happiness

46
Analyzing Contingency Tables

Is there an association?
The chi-squared test of independence addresses
this
When the P-value is small, we infer that the
variables are associated

47
Analyzing Contingency Tables

How do the cell counts differ from what
independence predicts?
To answer this question, we compare each observed
cell count to the corresponding expected cell
count

48
Analyzing Contingency Tables

How strong is the association?
Analyzing the strength of the association reveals
whether the association is an important one, or
if it is statistically significant but weak and
unimportant in practical terms

49
Measures of Association

A measure of association is a statistic or a
parameter that summarizes the strength of the
dependence between two variables

50
Difference of Proportions

An easily interpretable measure of association is
the difference between the proportions making a
particular response

51
Difference of Proportions
52
Difference of Proportions

Case (a) exhibits the weakest possible
association no association
Accept Credit Card
The difference of proportions is 0

53
Difference of Proportions

Case (b) exhibits the strongest possible
association
Accept Credit Card
The difference of proportions is 100

54
Difference of Proportions

In practice, we dont expect data to follow
either extreme (0 difference or 100
difference), but the stronger the association,
the large the absolute value of the difference of
proportions

55
Example Do Student Stress and Depression Depend
on Gender?
56
Example Do Student Stress and Depression Depend
on Gender?

Which response variable, stress or depression,
has the stronger sample association with gender?

57
Example Do Student Stress and Depression Depend
on Gender?
Example Do Student Stress and Depression Depend
on Gender?

Stress
The difference of proportions between females and
males was 0.35 0.16 0.19

58
Example Do Student Stress and Depression Depend
on Gender?

Depression
The difference of proportions between females and
males was 0.08 0.06 0.02

59
Example Do Student Stress and Depression Depend
on Gender?

In the sample, stress (with a difference of
proportions 0.19) has a stronger association
with gender than depression has (with a
difference of proportions 0.02)

60
The Ratio of Proportions Relative Risk

Another measure of association, is the ratio of
two proportions p1/p2
In medical applications in which the proportion
refers to an adverse outcome, it is called the
relative risk

61
Example Relative Risk for Seat Belt Use and
Outcome of Auto Accidents
62
Example Relative Risk for Seat Belt Use and
Outcome of Auto Accidents

Treating the auto accident outcome as the
response variable, find and interpret the
relative risk

63
Example Relative Risk for Seat Belt Use and
Outcome of Auto Accidents

The adverse outcome is death
The relative risk is formed for that outcome
For those who wore a seat belt, the proportion
who died equaled 510/412,878 0.00124
For those who did not wear a seat belt, the
proportion who died equaled 1601/164,128
0.00975

64
Example Relative Risk for Seat Belt Use and
Outcome of Auto Accidents

The relative risk is the ratio
0.00124/0.00975 0.127
The proportion of subjects wearing a seat belt
who died was 0.127 times the proportion of
subjects not wearing a seat belt who died

65
Example Relative Risk for Seat Belt Use and
Outcome of Auto Accidents

Many find it easier to interpret the relative
risk but reordering the rows of data so that the
relative risk has value above 1.0

66
Example Relative Risk for Seat Belt Use and
Outcome of Auto Accidents

Reversing the order of the rows, we calculate the
ratio
0.00975/0.00124 7.9
The proportion of subjects not wearing a seat
belt who died was 7.9 times the proportion of
subjects wearing a seat belt who died

67
Example Relative Risk for Seat Belt Use and
Outcome of Auto Accidents

A relative risk of 7.9 represents a strong
association
This is far from the value of 1.0 that would
occur if the proportion of deaths were the same
for each group
Wearing a set belt has a practically significant
effect in enhancing the chance of surviving an
auto accident

68
Properties of the Relative Risk

The relative risk can equal any nonnegative
number
When p1 p2, the variables are independent and
relative risk 1.0
Values farther from 1.0 (in either direction)
represent stronger associations

69
Large Does Not Mean Theres a Strong
Association

A large chi-squared value provides strong
evidence that the variables are associated
It does not imply that the variables have a
strong association
This statistic merely indicates (through its
P-value) how certain we can be that the variables
are associated, not how strong that association is

70
Section 10.4

How Can Residuals Reveal the Pattern of
Association?

71
Association Between Categorical Variables

The chi-squared test and measures of association
such as (p1 p2) and p1/p2 are fundamental
methods for analyzing contingency tables
The P-value for summarized the strength of
evidence against H0 independence

72
Association Between Categorical Variables

If the P-value is small, then we conclude that
somewhere in the contingency table the population
cell proportions differ from independence
The chi-squared test does not indicate whether
all cells deviate greatly from independence or
perhaps only some of them do so

73
Residual Analysis

A cell-by-cell comparison of the observed counts
with the counts that are expected when H0 is true
reveals the nature of the evidence against H0
The difference between an observed and expected
count in a particular cell is called a residual

74
Residual Analysis

The residual is negative when fewer subjects are
in the cell than expected under H0
The residual is positive when more subjects are
in the cell than expected under H0

75
Residual Analysis

To determine whether a residual is large enough
to indicate strong evidence of a deviation from
independence in that cell we use a adjusted form
of the residual the standardized residual

76
Residual Analysis

The standardized residual for a cell
(observed count expected count)/se
A standardized residual reports the number of
standard errors that an observed count falls from
its expected count
Its formula is complex
Software can be used to find its value
A large value provides evidence against
independence in that cell

77
Example Standardized Residuals for Religiosity
and Gender

To what extent do you consider yourself a
religious person?

78
Example Standardized Residuals for Religiosity
and Gender
79
Example Standardized Residuals for Religiosity
and Gender

Interpret the standardized residuals in the table

80
Example Standardized Residuals for Religiosity
and Gender

The table exhibits large positive residuals for
the cells for females who are very religious and
for males who are not at all religious.
In these cells, the observed count is much larger
than the expected count
There is strong evidence that the population has
more subjects in these cells than if the
variables were independent

81
Example Standardized Residuals for Religiosity
and Gender

The table exhibits large negative residuals for
the cells for females who are not at all
religious and for males who are very religious
In these cells, the observed count is much
smaller than the expected count
There is strong evidence that the population has
fewer subjects in these cells than if the
variables were independent

82
Section 10.5

What if the Sample Size is Small? Fishers Exact
Test

83
Fishers Exact Test

The chi-squared test of independence is a
large-sample test
When the expected frequencies are small, any of
them being less than about 5, small-sample tests
are more appropriate
Fishers exact test is a small-sample test of
independence

84
Fishers Exact Test

The calculations for Fishers exact test are
complex
Statistical software can be used to obtain the
P-value for the test that the two variables are
independent
The smaller the P-value, the stronger is the
evidence that the variables are associated

85
Example Tea Tastes Better with Milk Poured
First?

This is an experiment conducted by Sir Ronald
Fisher
His colleague, Dr. Muriel Bristol, claimed that
when drinking tea she could tell whether the milk
or the tea had been added to the cup first

86
Example Tea Tastes Better with Milk Poured
First?

Experiment
Fisher asked her to taste eight cups of tea
Four had the milk added first
Four had the tea added first
She was asked to indicate which four had the milk
added first
The order of presenting the cups was randomized

87
Example Tea Tastes Better with Milk Poured
First?

Results

88
Example Tea Tastes Better with Milk Poured
First?

Analysis

89
Example Tea Tastes Better with Milk Poured
First?

The one-sided version of the test pertains to the
alternative that her predictions are better than
random guessing
Does the P-value suggest that she had the ability
to predict better than random guessing?

90
Example Tea Tastes Better with Milk Poured
First?

The P-value of 0.243 does not give much evidence
against the null hypothesis
The data did not support Dr. Bristols claim that
she could tell whether the milk or the tea had
been added to the cup first

Write a Comment

User Comments (0)