Analysis of frequency counts with Chi square - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Analysis of frequency counts with Chi square

Description:

Analysis of frequency counts with Chi square 2 Dr David Field Summary Categorical data Frequency counts One variable chi-square testing the null hypothesis that ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 35

Provided by: DavidT109

Category:

more less

Transcript and Presenter's Notes

Title: Analysis of frequency counts with Chi square

1
Analysis of frequency counts with Chi square

Dr David Field

2
Summary

Categorical data
Frequency counts
One variable chi-square
testing the null hypothesis that frequencies in
the sample are equally divided among the
catgegories
varying the null hypothesis
Two variable chi-square
testing the null hypothesis that status on one
categorical variable is independent from status
on another categorical variable
Limitations and assumptions of chi-square
Andy Field chapter 18 covers chi-square
There is also a guide online at
http//davidmlane.com/hyperstat/
Chi-square is topic 16 in the list

3
Categorical data
18.2

Each participant is a member of a single
category, and the categories cannot be
meaningfully placed in order
e.g., nationality French, German, Italian
Sometimes chi-square is used with ordered
categories, e.g. age bands
To perform statistical tests with categorical
data each participant must be a member of only
one category
Category membership must be mutually exclusive
You cant be a smoker and a non-smoker
This allows frequency counts in each category to
be calculated

4
Chi square

If you can express the data as frequency counts
in several categories, then chi square can be
used to test for differences between the
categories
You will also see chi square written as a Greek
letter accompanied by the mathematical symbol
indicating that a number should be squared

5
Chi square with a single categorical variable

Suppose we are interested in which drink is most
popular
We ask a sample of 100 people if they prefer to
drink coffee, tea, or water
each respondent is only allowed to select one
answer
this is important if each person can have
membership of more than one category you cant
use Chi square
By default, the null hypothesis for chi-square is
that each of the categories is equally frequent
in the underlying population
it is possible to modify this (see later)

6
One variable chi-square example

Lets say that the preferences expressed by the
sample of 100 people result in the following
observed frequency counts
tea 39
coffee 30
Water 31
SUM 100
The null hypothesis assumes that each category is
equally frequent, and thus provides a model that
the data can be used to test
Based on the null hypothesis, the expected
frequency counts would 100 / 3 33.3 per
category
The Chi square statistic works out the
probability that the observed frequencies could
be obtained by random sampling from a population
where the null hyp is true

7
One variable chi-square example
Observed Expected Difference Difference squared Divide by expected
39 33.3
30 33.3
31 33.3
100 100

8
One variable chi-square example
Observed Expected Difference Difference squared Divide by expected
39 33.3 5.7
30 33.3 -3.3
31 33.3 -2.3
100 100

9
One variable chi-square example
Observed Expected Difference Difference squared Divide by expected
39 33.3 5.7 32.49
30 33.3 -3.3 10.89
31 33.3 -2.3 5.29
100 100

10
One variable chi-square example
Observed Expected Difference Difference squared Divide by expected
39 33.3 5.7 32.49 0.98
30 33.3 -3.3 10.89 0.33
31 33.3 -2.3 5.29 0.16
100 100

11
One variable chi-square example
Observed Expected Difference Difference squared Divide by expected
39 33.3 5.7 32.49 0.98
30 33.3 -3.3 10.89 0.33
31 33.3 -2.3 5.29 0.16
100 100 SUM 1.47

12
Converting Chi square to a p value

SPSS will do this for you
Chi square has degrees of freedom equal to the
number of categories minus 1
2 in the example this is because if you knew the
frequencies of preference for tea and coffee and
the sample size, the frequency of preference for
water would not be free to vary
The chi square value of 1.47, df 2 had an
associated p value of 0.48, so the null
hypothesis that preferences for drinking tea,
coffee and water in the population are equal
cannot be rejected.

13
One variable chi square with unequal expected
frequencies

By default, the expected frequencies are just the
sample size divided equally among the number of
categories.
But, sometimes this is inappropriate
For example, we know that the of the population
of the UK that smokes is less than 50
Lets assume for purposes of illustration that
25 of the UK population are smokers
We might hypothesise that the smoking rate is
higher in Glasgow than the UK average rate
The null hypothesis is that it is the same

14
One variable chi square with unequal expected
frequencies

We ask 200 adults in Glasgow if they smoke.
80 say yes
120 say no
We know that the UK average rate is 25, and 80
is rather more than 25 of 200
Chi square can be used to assess the probability
of the above frequencies being obtained by random
sampling if the real smoking rate in Glasgow was
actually 25

15
One variable chi-square example with unequal
expected frequencies
Observed Expected Difference Difference squared Divide by expected
120 150 -30 900 6
80 50 30 900 18
200 200 SUM 24
16
One variable chi square with unequal expected
frequencies

80 of the sample of 200 people from Glasgow
classified themselves as smokers. This resulted
in a chi square value of 24.0, df 1 with an
associated p value of lt 0.001, so the null
hypothesis that smoking rates in Glasgow are
equal to the UK average of 25 can be rejected.

17
Chi square with two variables
18.3

Usually, it is more interesting to use Chi square
to ask about the relationship between 2
categorical variables.
For example, what is the relationship between
gender and smoking?
gender can be male or female
smoking can be smoker or non-smoker
If you have smoking data from just men, you can
only use chi-square to ask if the proportion of
smokers and non-smokers is different
If you have smoking data from men and women you
can use chi-square to ask if the proportion of
men who smoke differs from the proportion of
women who smoke

18
What 22 chi square does not do

It is important to realise that in the 22 chi
square, having a big imbalance between the number
of men and the number of women will not increase
the value of the chi-square statistic
Also, having a big imbalance between the number
of smokers and non-smokers will not increase the
value of the chi-square statistic
This contrasts with the one variable chi-square,
where an imbalance in the numbers of men vs
women, or smokers vs. non-smokers does increase
the value of chi-square.
The value of chi-square for two variables is high
if smoking frequency is contingent on gender, and
low if smoking frequency is independent of gender

The key to understanding 22 chi square is how
the expected frequencies are calculated
The expected frequencies provide the null
hypothesis, or null model, that the chi square
statistic tests
If there are 200 participants, the simplest null
model would be to expect 50 female smokers, 50
male smokers, 50 female non smokers, and 50 male
non smokers
but we already know that it is implausible to
expect an equal split of smokers and non-smokers
the expected frequencies will have to allow for
the imbalance of smokers vs non smokers and a
possible imbalance of men vs women in the sample
A sample with 20 male smokers, 10 female smokers,
80 male non-smokers and 40 female non-smokers has
an imbalance of gender and smoking status, but
smoking status does not depend on gender and
there is no deviation from the null model

20
The contingency table of observed frequencies
Men Women Row totals
Smoke 13 31 44

Dont smoke 29 86 115

Column totals 42 117 159
21
Calculating the expected frequencies

The key step in the calculation of chi-square is
to estimate the frequency counts that would occur
in each cell if the null hypothesis that the row
frequencies and column frequencies do not depend
upon each other were true
To calculate the expected frequency of the male
smokers cell, we first need to calculate the
proportion of participants that are male, without
considering if they smoke or not
This proportion is 42 males out of 159 (the total
number of participants)
42 / 159 0.26

22
Calculating the expected frequencies

If the null hyp is true, and the proportion of
female smokers and male smokers is equal, then
the proportion of the smokers in the sample that
are male should be equal to the overall
proportion of the sample that is male
Total number of smokers in sample (44)
proportion of sample that is male (0.26)
44 0.26 11.62

23
Calculating the expected frequencies
Men Women Row totals
Smoke 13 31 44
Expected smokers 11.62
Dont smoke 29 86 115
Expected non smoke
Column totals 42 117 159
24
Calculating the expected frequencies
Men Women Row totals
Smoke 13 31 44
Expected smokers 11.62 32.37
Dont smoke 29 86 115
Expected non smoke
Column totals 42 117 159
0.74
25
Calculating the expected frequencies
Men Women Row totals
Smoke 13 31 44
Expected smokers 11.62 32.37
Dont smoke 29 86 115
Expected non smoke 30.37
Column totals 42 117 159
26
Calculating the expected frequencies
Men Women Row totals
Smoke 13 31 44
Expected smokers 11.62 32.37
Dont smoke 29 86 115
Expected non smoke 30.37 84.62
Column totals 42 117 159
27
Calculating the value of chi square

Each cell in the contingency table makes a
contribution to the total chi-square
For each cell you calculate
(Observed Expected) and square it
You then divide by the Expected
Do this for each cell individually and add up the
results

28
Calculating chi square
Men Women Row totals
Smoke 13 31 44
Expected smokers 11.62 32.37
Dont smoke 29 86 115
Expected non smoke 30.37 84.62
Column totals 42 117 159
(13-11.62)2 1.90 1.90 / 11.62 0.16
29
Converting chi-square to a p value

The degrees of freedom for a two way Chi square
depends upon the number of categories in the
contingency table
(num columns -1) (num rows -1)
SPSS will calculate the DF and p value for you
The chi square value of 0.31, df 1 had an
associated p value of 0.58, so the null
hypothesis that the proportion of men and women
that smoke is equal cannot be rejected.
Also see

18.5.7
30
Larger contingency tables

You can perform chi-square on larger contingency
tables
For example, we might be interested in whether
the proportion of smokers vs. non smokers differs
according to age, where age is a 3 level
categorical variable
20-29 years old
30-39 years old
40-49 years old
This results in a 2 3 contingency table
However, there is some uncertainty as to what a
significant chi-square means in this case

31
Partitioning chi-square

A statistically significant 2 3 chi-square
might have occurred for one of these 3 reasons
The proportion of 20-29 year olds who smoke
differs from the proportion of 30-39 year olds
that smoke
The proportion of 20-29 year olds that smoke
differs from the proportion of 40-49 year olds
that smoke
The proportion of 30-39 year olds that smoke
differs from the proportion of 40-49 year olds
that smoke
Or all 3 of the above might be true
Or 2 of the above might be true
As a researcher, you will want to distinguish
between these possibilities

32
Partitioning chi-square

The solution is to break the 2 3 contingency
table into smaller 2 2 contingency tables to
test each of the comparisons in the list
The proportion of 20-29 year olds who smoke
differs from the proportion of 30-39 year olds
that smoke
The proportion of 20-29 year olds that smoke
differs from the proportion of 40-49 year olds
that smoke
The proportion of 30-39 year olds that smoke
differs from the proportion of 40-49 year olds
that smoke
Run 3 separate 2 2 chi-square tests

33
Partitioning chi-square

However, running 3 tests results in 3 chances of
a type 1 error occurring
To maintain the probability of a type 1 error at
the conventional level of 5 you divide the alpha
level by the number of chi-square tests you run
Effectively, you share the 5 risk of rejecting
the null hypothesis due to sampling error equally
among the tests you perform
For a single chi-square, it is significant if
SPSS reports that p is less than 0.05
For two chi-square tests, they are significant at
the 0.05 level individually if SPSS reports that
p is less than 0.025
For three chi-square tests, they are significant
at the 0.05 level individually if SPSS reports
that p is less than 0.0166

34
Warnings about chi-square

The expected frequency count in any cell must not
be less than 5
If this occurs then chi-square is not reliable
If the contingency table is 2 2 or 2 3 you
can use the Fisher exact probability test instead
SPSS will report this
For bigger contingency tables the only solution
is to collapse across categories, but only
where this is meaningful
If you began with age categories 0-4, 5-10,
11-15, 16-20 you could collapse to 0-10 and
11-20, which would increase the expected
frequencies in each cell
Finally, remember that the total of frequencies
is equal to the number of participants you have
each person must only be a member of one cell in
the table