Title: Cross Tabs and ChiSquared
1Cross Tabs and Chi-Squared
- Testing for a Relationship Between Nominal (or
Ordinal) Variables
2Cross Tabs and Chi-Squared
- The test you choose depends on level of
measurement - Independent Dependent Test
- Dichotomous Interval-ratio Independent Samples
t-test - Dichotomous
- Nominal Interval-ratio ANOVA
- Dichotomous Dichotomous
- Nominal (Ordinal) Nominal (Ordinal) Cross Tabs
- Dichotomous Dichotomous
3Cross Tabs and Chi-Squared
- We are asking whether there is a relationship
between two nominal (or ordinal) variablesthis
includes dichotomous variables. - One may use cross tabs for ordinal variables, but
it is generally better to use more powerful
statistical techniques if you can treat them as
interval-ratio variables.
4Cross Tabs and Chi-Squared
- Cross tabs and Chi-Squared will tell you whether
classification on one nominal variable is related
to classification on a second nominal variable. - For Example
- Are rural Americans more likely to vote
Republican in presidential races than urban
Americans? - Classification of Region Party Vote
- Are white people more likely to drive SUVs than
blacks or Latinos? - Classification on Race Type of Vehicle
5Cross Tabs and Chi-Squared
- The statistical focus will be on the number or
count of people in a sample who are classified
in patterned ways on two variables. - Or
- The number or count of people classified in
each category created when considering both
variables at the same time such as - White SUV Black SUV
- White Car Black Car
Race
Vehicle Type
6Cross Tabs and Chi-Squared
- Number in Each Joint Group?
- Why?
- Means and standard deviations are meaningless for
nominal variables. - So we need statistics that allow us to work
categorically.
7Cross Tabs and Chi-Squared
- The procedure starts with a cross
classification of the cases in categories of
each variable. - Example
- Data on male and female support for keeping SJSU
football from 650 students put into a matrix - Yes No Maybe Total
- Female 185 200 65 450
- Male 80 65 55 200
- Total 265 265 120 650
8Cross Tabs and Chi-Squared
- In the example, you can see that the campus is
divided on the issue. But is there an
association between sex and attitudes? - Example
- Data on male and female support for SJSU football
from 650 students put into a matrix - Yes No Maybe Total
- Female 185 200 65 450
- Male 80 65 55 200
- Total 265 265 120 650
9Cross Tabs and Chi-Squared
- But is there an association between sex and
attitudes? - An easy way to get more information is to convert
the frequencies (or counts in each cell) to
percentages - Data on male and female support for SJSU football
from 650 students put into a matrix - Yes No Maybe Total
- Female 185 (41) 200 (44) 65 (14) 450 (99)
- Male 80 (40) 65 (33) 55 (28) 200 (101)
- Total 265 (41) 265 (41) 120 (18) 650 (100)
- percentages d not add to 100 due to rounding
10Cross Tabs and Chi-Squared
- We can see that in the sample men are less likely
to oppose football, but no more likely to say
yes than womenmen are more likely to say
maybe - Data on male and female support for SJSU football
from 650 students put into a matrix - Yes No Maybe Total
- Female 185 (41) 200 (44) 65 (14) 450 (99)
- Male 80 (40) 65 (33) 55 (28) 200 (101)
- Total 265 (41) 265 (41) 120 (18) 650 (100)
- percentages d not add to 100 due to rounding
11Cross Tabs and Chi-Squared
- Using percentages to describe relationships is
valid statistical analysis These are
descriptive statistics! However, they are not
inferential statistics. - What can we say about the population using this
sample (inferential statistics)? - Thinking about random variations in who would be
selected from random sample to random sample - Could we have gotten sample statistics like
these from a population where there is no
association between sex and attitudes about
keeping football? - The Chi-Squared Test of Independence allows us to
answer questions like those above.
12Cross Tabs and Chi-Squared
- The whole idea behind the Chi-Squared test of
independence is to determine whether the patterns
of frequencies (or counts) in your cross
classification table could have occurred by
chance, or whether they represent systematic
assignment to particular cells. - For example, were women more likely to answer
no than men or could the deviation in responses
by sex have occurred because of random sampling
or chance alone?
13Cross Tabs and Chi-Squared
- A number called Chi-Squared, ?2, tells us whether
the numbers in each cross classification cell in
our sample deviate from the kind of random
fluctuations you would get if our two variables
were not associated with each other (independent
of each other). - Its formula
- fo observed frequency in each cell fe expected
frequency in each cell - The crux of ?2 is that it gets larger as observed
data deviate more from the data we would expect
if our variables were unrelated. - From sample to sample, one would expect
deviations from what is expected even when
variables are unrelated. But when ?2 gets really
big it grows beyond the numbers that random
variation in samples would produce. - A big ?2 will imply that there is a relationship
between our two nominal variables.
?2 ? ((fo - fe)2 / fe)
14Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Calculating ?2 begins with the concept of a
deviation of observed data from what is expected
by unrelated variables. - Deviation in ?2 Observed frequency Expected
frequency - Observed frequency is just the number of cases in
each cell of the cross classification table for
your sample. For example, 185 women said yes,
they support football at SJSU. 185 is the
observed frequency. - Expected frequency is the number of cases that
would be in a cell of the cross classification
table if people in each group of one variable
were classified in the second variables groups
in the same ways.
15Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Data on male and female support for SJSU football
from 650 students - Yes No Maybe Total
- Female ? ? ? 450 69.2
- Male ? ? ? 200 30.8
- Total 265 265 120 650 100
- Expected frequency (if our variables were
unrelated) - Females comprise 69.2 of the sample, so wed
expect 69.2 of the Yes answers to come from
females, and 69.2 of No and Maybe answers
to come from females. - On the other hand, 30.8 of the Yes, No, and
Maybe answers should come from Men. - Therefore, to calculate expected frequency for
each cell you do this - fe cells row total / table total cells
column total or - fe cells column total / table total cells
row total - The idea 1. Find the percent of persons in one
category on the first variable then - 2. Expect to find that percent of those
people in each of the other variables
categories.
16Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Data on male and female support for SJSU football
from 650 students - Yes No Maybe Total
- Female fe1 183.5 fe2 183.5 fe3
83.1 450 69.2 - Male fe4 81.5 fe5 81.5 fe6
36.9 200 30.8 - Total 265 265 120 650 100
- Now you know how to calculate the expected
frequencies - fe1 (450/650) 265 183.5 fe4 (200/650)
265 81.5 - fe2 (450/650) 265 183.5 fe5 (200/650)
265 81.5 - fe3 (450/650) 120 83.1 fe6 (200/650)
120 36.9 - and the observed frequencies are obvious
17Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Data on male and female support for SJSU football
from 650 students - Yes fo (Yes fe) No fo (No
fe) Maybe fo (Maybe fe) Total - Female 185 (183.5) 200
(183.5) 65 (83.1)
450 69.2 - Male 80 (81.5) 65 (81.5)
55 (36.9) 200 30.8 - Total 265 265
120 650 100 - You already know how to calculate the deviations
too. - Dc fo fe
- D1 185 183.5 1.5 D4 80 81.5
-1.5 - D2 200 183.5 16.5 D5 65 81.5 -16.5
- D3 65 83.1 -18.1 D4 55 36.9
18.1
18Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Data on male and female support for SJSU football
from 650 students - Yes fo (Yes fe) No fo (No
fe) Maybe fo (Maybe fe) Total - Female 185 (183.5) 200
(183.5) 65 (83.1) 450 69.2 - Male 80 (81.5) 65
(81.5) 55 (36.9)
200 30.8 - Total 265 265
120 650 100 - Deviations
- Dc fo fe
- D1 185 183.5 1.5 D4 80 81.5
-1.5 - D2 200 183.5 16.5 D5 65 81.5 -16.5
- D3 65 83.1 -18.1 D4 55 36.9
18.1 - Now, we want to add up the deviations
- What would happen if we added these deviations
together? - To get rid of negative deviations, we square each
one (like in computing variance and standard
deviation).
19Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Data on male and female support for SJSU football
from 650 students - Yes fo (Yes fe) No fo (No fe)
Maybe fo (Maybe fe) Total - Female 185 (183.5) 200
(183.5) 65 (83.1) 450 69.2 - Male 80 (81.5) 65
(81.5) 55 (36.9)
200 30.8 - Total 265 265
120 650 100 - Deviations
- Dc fo fe
- D1 185 183.5 1.5 D4 80 81.5
-1.5 - D2 200 183.5 16.5 D5 65 81.5 -16.5
- D3 65 83.1 -18.1 D4 55 36.9
18.1 - To get rid of negative deviations, we square each
one (like for variance and standard deviation). - (D1)2 (1.5)2 2.25 (D4)2 (-1.5)2
2.25 - (D2)2 (16.5)2 272.25 (D5)2 (-16.5)2
272.25 - (D3)2 (-18.1)2 327.61 (D6)2 (18.1)2
327.61
20Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Just how large is each of these squared
deviations? - What do these numbers really mean?
- Squared Deviations
- (D1)2 (1.5)2 2.25 (D4)2 (-1.5)2
2.25 - (D2)2 (16.5)2 272.25 (D5)2 (-16.5)2
272.25 - (D3)2 (-18.1)2 327.61 (D6)2 (18.1)2
327.61
21Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- The next step is to give the deviations a
metric. The deviations are compared relative
to the what was expected. In other words, we
divide by what was expected. - Squared Deviations
- (D1)2 (1.5)2 2.25 (D4)2 (-1.5)2
2.25 - (D2)2 (16.5)2 272.25 (D5)2 (-16.5)2
272.25 - (D3)2 (-18.1)2 327.61 (D6)2 (18.1)2
327.61 - Youve already calculated what was expected in
each cell - fe1 (450/650) 265 183.5 fe4 (200/650)
265 81.5 - fe2 (450/650) 265 183.5 fe5 (200/650)
265 81.5 - fe3 (450/650) 120 83.1 fe6 (200/650)
120 36.9
22Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Squared Deviations
- (D1)2 (1.5)2 2.25 (D4)2 (-1.5)2
2.25 - (D2)2 (16.5)2 272.25 (D5)2 (-16.5)2
272.25 - (D3)2 (-18.1)2 327.61 (D6)2 (18.1)2
327.61 - Expected Frequencies
- fe1 (450/650) 265 183.5 fe4 (200/650)
265 81.5 - fe2 (450/650) 265 183.5 fe5 (200/650)
265 81.5 - fe3 (450/650) 120 83.1 fe6 (200/650)
120 36.9 - Relative Deviations-squaredSmall values indicate
little deviation from what was expected, while
larger values indicate much deviation from what
was expected - (D1)2 / fe1 2.25 / 183.5 0.012 (D4)2 /
fe4 2.25 / 81.5 0.028 - (D2)2 / fe2 272.25 / 183.5 1.484 (D5)2 /
fe5 272.25 / 81.5 3.340 - (D3)2 / fe3 327.61 / 83.1 3.942 (D6)2 /
fe6 327.61 / 36.9 8.878
23Cross Tabs and Chi-Squared
?2 ? ((fo - fe)2 / fe)
- Relative Deviations-squaredSmall values indicate
little deviation from what was expected, while
larger values indicate much deviation from what
was expected - (D1)2 / fe1 2.25 / 183.5 0.012 (D4)2 /
fe4 2.25 / 81.5 0.028 - (D2)2 / fe2 272.25 / 183.5 1.484 (D5)2 /
fe5 272.25 / 81.5 3.340 - (D3)2 / fe3 327.61 / 83.1 3.942 (D6)2 /
fe6 327.61 / 36.9 8.878 - The next step will be to see what the total
relative deviations-squared are - Sum of
- Relative Deviations-squared 0.012 1.484
3.942 0.028 3.340 8.878 17.684 - This number is also what we call Chi-Squared or
?2. - So
- Of what good is knowing this number?
?2 ? ((fo - fe)2 / fe)
24Cross Tabs and Chi-Squared
- This value, ?2, would form an identifiable shape
in repeated sampling if the two variables were
unrelated to each otherthe chance variation
that we should expect among samples. - That shape depends only on the number of rows and
columns (or the nature of your variables). We
technically refer to this as the degrees of
freedom. - For ?2, df (rows 1)(columns 1)
25Cross Tabs and Chi-Squared
- For ?2, df (rows 1)(columns 1)
- ?2 distributions
df 5
FYI This should remind you of the normal
distribution, except that, it changes shape
depending on the nature of your variables.
df 10
df 20
df 1
1 5 10 20
26Cross Tabs and Chi-Squared
Think of the Power!!!!
- We can use the known properties of the ?2
distribution to identify the probability that we
would get our samples ?2 if our variables were
not related to each other! - This is exciting!
27Cross Tabs and Chi-Squared
- ?2
- If my ?2 in a particular analysis were under the
shaded area or beyond, what could we say about
the population given our sampleusing a null
hypothesis that our variables are unrelated?
My Chi-squared
5 of ?2 values
28Cross Tabs and Chi-Squared
- ?2
- Answer Wed reject the null, saying that it is
highly unlikely that we could get such a large
chi-squared value from a population where the two
variables are unrelated.
My Chi-squared
5 of ?2 values
29Cross Tabs and Chi-Squared
- ?2
- So, what does the critical ?2 value equal?
My Chi-squared
5 of ?2 values
30Cross Tabs and Chi-Squared
- That depends on the particular problem because
the distribution changes depending on the number
of rows and columns in your cross classification
table.
df 5
df 10
df 20
df 1
?2
1 5 10 20
Critical ?2 s
31Cross Tabs and Chi-Squared
- According to Appendix D in Warner,
- with ?-level .05, if df 1, critical ?2
3.84 - df 5, critical ?2 11.07
- df 10, critical ?2 18.31
- df 20, critical ?2 31.41
df 5
df 10
df 20
df 1
?2
1 5 10 20
32Cross Tabs and Chi-Squared
- In our football problem above, we had a
chi-squared of 17.68 in a cross classification
table with 2 rows and 3 columns. - Our chi-squared distribution for that table would
have - df (2 1) (3 1) 2. According to
Appendix D, with ?-level .05, Critical
Chi-Squared is 5.99. - Since 17.68 gt 5.99, we reject the null.
- We reject that our sample could have come from a
population where sex was not related to attitudes
toward football.
My Chi-squared
df 2
5 of ?2 values
?2
5.99 17.68
33Cross Tabs and Chi-Squared
- Now lets get formal
- 7 steps to Chi-squared test of independence
- Set ?-level (e.g., .05)
- Find Critical ?2 (depends on df and ?-level)
- The null and alternative hypotheses
- Ho The two nominal variables are independent
- Ha The two variables are dependent on each
other - Collect Data
- Calculate ?2 ?2 ? ((fo - fe)2 / fe)
- Make decision about the null hypothesis
- Report the p-value
34Cross Tabs and Chi-Squared
- Afterwards, what have you found?
- If Chi-Squared is not significant, your variables
are unrelated. - If Chi-Squared is significant, your variables are
related. - Thats All!
- Chi-Squared cannot tell you anything like the
strength or direction of association. For purely
nominal variables, there is no direction of
association. - Chi-Squared is a large-sample test. If dealing
with small samples, look up appropriate tests. (A
condition of the test no expected frequency
lower than 5 in each cell) - The larger the sample size, the easier it is for
Chi-Squared to be significant. - 2 x 2 table Chi-Square gives same result as
Independent Samples t-test for proportion and
ANOVA.
35Cross Tabs and Chi-Squared
- If you want to know how you depart from
independence, you may - Check percentages (conditional distributions) in
your cross classification table. - Do a residual analysis
- The difference between observed and expected
counts in a cell behaves like a significance test
when divided by a standard error for the
difference. - That s.e. ?fe(1-cells row ?)(1 cells
column ?) - fo fe
- Z s.e.
36Cross Tabs and Chi-Squared
- Residual Analysis
- Lets do cell 5! s.e. ?fe(1-cells row ?)(1
cells column ?) - fo fe 5 row ? 200/650
.308, column ? 265/650 .408 - Z s.e. s.e.
?81.5 (.692) (.592) 5.78 - Z 65 81.5 / 5.78 -2.85 2.85 gt 1.96, there
is a significant difference in cell 5 - Data on male and female support for SJSU football
from 650 students - Yes No Maybe Total
- Female 185 200 65 450
- Male 80 65 55 200
- Total 265 265 120 650
- fe1 (450/650) 265 183.5 fe4 (200/650)
265 81.5 - fe2 (450/650) 265 183.5 fe5 (200/650)
265 81.5 - fe3 (450/650) 120 83.1 fe6 (200/650)
120 36.9 - Deviations
- Dc fo fe
- D1 185 183.5 1.5 D4 80 81.5
-1.5
37Cross Tabs and Chi-Squared
- Further topics you could explore
- Strength of Association
- Discussing outcomes in terms of difference of
proportions - Reporting Odds Ratios (likelihood of a group
giving one answer versus other answers or the
group giving an answer relative to other groups
giving that answer) - Yules Q and Phi for 2x2 tables, ranging from
-1 to 1, with 0 indicating no relationship and 1
a strong relationship - Strength and Direction of Association for
Ordinal--not nominal--Variables - Gamma (an inferential statistic, so check for
significance) - Ranges from -1 to 1
- Valence indicates direction of relationship
- Magnitude indicates strength of relationship
- Chi-squared and Gamma can disagree when there is
a nonrandom pattern that has no direction.
Chi-squared will catch it, gamma wont. - Tau c
- Kendalls tau-b
- Somers d
38Cross Tabs and Chi-Squared
- Controlling for a third variable.
- One can see the relationship between two
variables for each level of a third variable. - E.g., Sex and Football by Lower or Upper
Division. - Yes No Maybe
- Upper F
- M
- Yes No Maybe
- Lower F
- M
39Cross Tabs and Chi-Squared
40Cross Tabs and Chi-Squared
- Sex and
- Pornlaw by
- Sex Education