Title: Crosstabulation and Measures of Association
1Crosstabulation and Measures of Association
2- Investigating the relationship between two
variables - Generally a statistical relationship exists if
the values of the observations for one variable
are associated with the values of the
observations for another variable - Knowing that two variables are related allows us
to make predictions. - If we know the value of one, we can predict the
value of the other.
3- Determining how the values of one variable are
related to the values of another is one of the
foundations of empirical science. - In making such determinations we must consider
the following features of the relationship.
4- 1.) The level of measurement of the variables.
Difference varibles necessitate different
procedures. - 2.) The form of the relationship. We can ask if
changes in X move in lockstep with changes in Y
or if a more sophisticated relationship exists. - 3.)The strength of the relationship. Is it
possible that some levels of X will always be
associated with certain levels of Y?
5- 4.) Numerical Summaries of the relationship.
Social scientists strive to boil down the
different aspects of a relationship to a single
number that reveals the type and strength of the
association. - 5.) Conditional relationships. The variables X
and Y may seem to be related in some fashion but
appearances can be deceiving. Spuriousness for
example. So we need to know if the introduction
of any other variables into the analysis changes
the relationship.
6Types of Association
- 1.) General Association simply associated in
some way. - 2.) Positive Monotonic Correlation when the
variables have order (ordinal or continuous) high
values of one var are associated with high values
of the other. Converse is also true. - 3.) Negative Monotonic Correlation Low values
are associated with high values.
7Types of Association Cont.
- 4.) Positive Linear Association A particular
type of positive monotonic relationship where the
plotted values of X-Y fall on a straight line
that slopes upward. -
- 5.) Negative Linear Relaionship Straight line
that slopes downward.
8Strength of Relationships
- Virtually no relationships between variables in
Social Science (and largely in natural science as
well) have a perfect form. - As a result it makes sense to talk about the
strength of relationships.
9Strength Cont.
- The strength of a relationship between variables
can be found by simply looking at a graph of the
data. - If the values of X and Y are tied together
tightly then the relationship is strong. - If the X-Y points are spread out then the
relationship is weak.
10Direction of Relationship
- We can also infer direction from a graph by
simply observing how the values for our variables
move across the graph. - This is only true, however, when our variables
are ordinal or continuous.
11Types of Bivariate Relationships and Associated
Statistics
- Nominal/Ordinal (including dichotomous)
- Crosstabulation (Lamda, Chi-Square Gamma, etc.)
- Interval and Dichotomous
- Difference of means test
- Interval and Nominal/Ordinal
- Analysis of Variance
- Interval and Ratio
- Regression and correlation
12Assessing Relationships between Variables
- 1. Calculate appropriate statistic to measure the
magnitude of the relationship in the sample - 2. Calculate additional statistics to determine
if the relationship holds for the population of
interest (statistical significance) - Substantive significance vs. Statistical
significance
13What is a Crosstabulation?
- Crosstabulations are appropriate for examining
relationships between variables that are nominal,
ordinal, or dichotomous. - Crosstabs show values for variables categorized
by another variable. - They display the joint distribution of values of
the variables by listing the categories for one
along the x-axis and the other along the y-axis
14- Each case is then placed in a cell of the table
that represents the combination of values that
corresponds to its scores on the variables.
15What is a Crosstabulation?
- Example We would like to know if presidential
vote choice in 2000 was related to race. - Vote choice Gore or Bush
- Race White, Hispanic, Black
16Are Race and Vote Choice Related? Why?
17Are Race and Vote Choice Related? Why?
18Measures of Association for Crosstabulations
- Purpose to determine if nominal/ordinal
variables are related in a crosstabulation - At least one nominal variable
- Lamda
- Chi-Square
- Cramers V
- Two ordinal variables
- Tau
- Gamma
19Measures of Association for Crosstabulations
- These measures of association provide us with
correlation coefficients that summarize data from
a table into one number . - This is extremely useful when dealing with
several tables or very complex tables. - These coefficients measure both the strength and
direction of an association.
20Coefficients for Nominal Data
- When one or both of the variables are nominal,
ordinal coefficients cannot be used because there
is no underlying ordering. - Instead we use PRE tests
21Lambda (PRE coefficient)
- PRE Proportional Reduction in Error
- Two Rules
- 1.) Make a prediction on the value of an
observation in the absence of no prior
information - 2.) Given information on a second variable and
take it into account in making the prediction.
22Lambda PRE
- If the two variables are associated then the use
of rule two should lead to fewer errors in your
predictions than rule one. - How many fewer errors depends upon how closely
the variables are associated. - PRE (E1 E2) / E1
- Scale goes from 0 -1
23Lambda
- Lambda is a PRE coefficient and it relies on
rules 1 2 above. - When applying rule one all we have to go on is
what proportion of the population fit into one
category as opposed to another. - So, without any other information, guessing that
every observation is in the modal category would
give you the best chance of getting the most
correct.
24Why?
- Think of it like this. If you knew that I tended
to make exams where the most often used answer
was B, then, without any other information, you
would be best served to pick B every time.
25- But, if you know information about each cases
value on another variable, rule two directs you
to only look at the members of that new category
(variable) and find the modal category (only on
that var).
26Example
- Suppose a sample of 100 voters and you need to
predict how they will vote in the general
election. - Assume we know that overall 30 voted democrat
and 30 voted republican and 40 were
independent. - Now suppose we take one person out of the group
(John Smith), our best guess would be that he
would vote independent.
27- Now suppose we take another person (Larry Mendez)
and again we would assume he voted independent. - As a result our best guess is to predict that all
of the voters (all 100) were independent. - We are sure to get some wrong but its the best
we can do over the long run.
28- How many do we get wrong? 60.
- Suppose now that we know something about the
voters regions (where they are from) and we know
what proportions various regions voted in the
election. - NE-30 , MW 20, SO 30 , WE - 20
29Lamda
30Lamda Rule 1 (prediction based solely on
knowledge of marginal distribution of dependent
variable partisanship)
31Lamda Rule 2(prediction based on knowledge
provided by independent variable )
32Lamda Calculation of Errors
- Errors w/Rule 1 18 12 14 16 60
- Errors w/Rule 2 16 10 14 10 50
- Lamda (Errors R1 Errors R2)/Errors R1
- Lamda (60-50)/6010/60.17
33Lamda
- PRE measure
- Ranges from 0-1
- Potential problems with Lamda
- Underestimates relationship when variables (one
or both) are highly skewed - Always 0 when modal category of Y is the same
across all categories of X
34Chi Square (c2)
- Also appropriate for any crosstabulation with at
least one nominal variable (and another
nominal/ordinal variable) - Based on the difference between the empirically
observed crosstab and what we would expect to
observe if the two variables are statistically
independent
35Background for c2
- Statistical Independence A property of two
variables in which the probability that an
observation is in a particular category of on
variable and also in a particular category of the
other variable equals the simple or marginal
probability of being in those categories. - Plays a large role in data analysis
- Is another way to view the strength of a
relaitionship
36Example
- Suppose we have two nominal or categorical
variables, X and Y. We label the categories for
the first category (a,b,c) and those of the
second (r,s,t). - Let P(X a) stand for the probability that a
randomly selected case has property a on variable
X and P(Y r) stand for the probability that a
randomly selected case has property r on variable
Y.
37- These two probabilities are called marginal
distributions and simply refers to the chance
that an observation has a particular value on a
particular variable irrespective of its value on
another variable.
38- Finally, let us assume that P(X a, Y r)
stands for the joint probability that a randomly
selected observation has both property a and
property r simultaneously. - Statistical Independence The two variables are
therefore statisitically independent only if the
chances of observing a combination of categories
is equal to the marginal probability of choosing
one category times the marginal probability of
the other.
39Background for c2
- P(X a, Y r) P(X a) P(Y r)
- For example, if men are as likely to vote as
women, then the two variables (gender and voter
turnout) are statistically independent because
the probability of observing a male nonvoter in
the sample is equal to the probability of
observing a male times the probability of
obseving a nonvoter.
40Example
- If 100/300 are men 210/300 voted then
- The marginal probabilities are
- P(Xm)100/300 .33 and P(Yv) 210/300 .7
- .33 x .7 .23 and is our marginal probability
41- If we know that 70 of the voters are male and
take that proportion and divide by the total
number of voters (70/300) we also get .23. - We can therefore say that the two variables are
independent.
42- The chi-squared statistic essentially compares an
observed result (the table produced by the
sample) with a hypothetical table that would
occur if (in the population) the variables were
statistically independent. - A value of 0 implies statistical independence
which means no association.
43- Chi-squared increases as the departures of
observed and expected values grows. There is no
upper limit to how big the difference can become
but if it is past a critical value then there is
reason to reject the null hypothesis that the two
variables are independent.
44How do we Calc. Chi2
- The observed frequencies are already in the
crosstab. - The expected frequencies in each table cell are
found by multiplying the row and the column
marginal totals and dividing by the sample size.
45Chi Square (c2)
46Calculating Expected Frequencies
- To calculate the expected cell frequency for NE
Republicans - E/30 30/100, therefore E(3030)/100 9
47Calculating the Chi-Square Statistic
- The chi-square statistic is calculated as
- ? (Obs. Frequencyik - Exp. Frequencyik)2 / Exp.
Frequencyik - (25/9)(16/6)(9/9)(16/6)(0)(0)(16/12)(16/8)
(25/9)16/6)(1/9)(0) 18
48- The value 9, is the expected frequency in the
first cell of the table and is what we would
expect in a sample of 100 (with 30 Republicans
and 30 north easterners) if there is statistical
independence in the population. - This is more than we have in our sample so there
is a difference.
49Just Like the Hyp. Test
- Null Statistical Independence between x and Y
- Alt X and Y are not independent.
50Interpreting the Chi-Square Statistic
- The Chi-Square statistic ranges from 0 to
infinity - 0 perfect statistical independence
- Even though two variables may be statistically
independent in the population, in a sample the
Chi-Square statistic may be gt 0 - Therefore it is necessary to determine
statistical significance for a Chi-Square
statistic (given a certain level of confidence)
51Cramers V
- Problem with Chi-Square not comparable across
different sample sizes (and their associated
crosstab) - Cramers V is a standardization of the Chi-Square
statistic
52Calculating Cramers V
- V
- Where R rows and C columns
- V ranges from 0-1
- Example (region and partisanship)
- v.09 .30
53Relationships between Ordinal Variables
- There are several measures of association
appropriate for relationships between ordinal
variables - Gamma, Tau-b, Tau-c, Somers d
- All are based on identifying concordant,
discordant, and tied pairs of observations
54Concordant PairsIdeology and Voting
- Ideology - conserv (1), moderate (2), liberal (3)
- Voting - never (1), sometimes (2), often (3)
- Consider two hypothetical individuals in the
sample with scores - Individual A Ideology1, Voting1
- Individual B Ideology2, Voting2
- Pair AB are considered a concordant pair because
Bs ideology score is greater than As score, and
Bs voting score is greater than As score
55Concordant Pairs (contd)
- All of the following are concordant pairs
- A(1,1) B(2,2)
- A(1,1) B(2,3)
- A(1,1) B(3,2)
- A(1,2) B(2,3)
- A(2,2) B(3,3)
- Concordant pairs are consistent with a positive
relationship between the IV and the DV (ideology
and voting)
56Discordant Pairs
- All of the following are discordant pairs
- A(1,2) B(2,1)
- A(1,3) B(2,2)
- A(2,2) B(3,1)
- A(1,2) B(3,1)
- A(3,1) B(1,2)
- Discordant pairs are consistent with a negative
relationship between the IV and the DV (ideology
and voting)
57Identifying Concordant Pairs
- Concordant Pairs for Never - Conserv (1,1)
- Concordant 8070 8010 8020 8080
- 14,400
58Identifying Concordant Pairs
- Concordant Pairs for Never - Moderate (1,2)
- Concordant 1010 1080 900
59Identifying Discordant Pairs
- Discordant Pairs for Often - Conserv (1,3)
- Discordant 010 010 070 010 0
60Identifying Discordant Pairs
- Discordant Pairs for Often - Moderate (2,3)
- Discordant 2010 2010
61Gamma
- Gamma is calculated by identifying all possible
pairs of individuals in the sample and
determining if they are concordant or discordant - Gamma (C - D) / (C D)
62Interpreting Gamma
- Gamma 21400/24400 .88
- Gamma ranges from -1 to 1
- Gamma does not account for tied pairs
- Tau (b and c) and Somers d account for tied
pairs in different ways
63Square tables
Non-Square tables
64Example
- NES 2004 What explains variation in ones
political Ideology? - Income?
- Education?
- Religion?
- Race?
65Bivariate Relationships and Hypothesis Testing
(Significance Testing)
- 1. Determine the null and alternative hypotheses
- Null There is no relationship between X and Y (X
and Y are statistically independent and test
statistic 0). - Alternative There IS a relationship between X
and Y (test statistic does not equal 0).
66Bivariate Relationships and Hypothesis Testing
- 2. Determine Appropriate Test Statistic (based on
measurement levels of X and Y) - 3. Identify the type of sampling distribution for
test statistic, and what it would look like if
the null hypothesis were true.
67Bivariate Relationships and Hypothesis Testing
- 4. Calculate the test statistic from the sample
data and determine the probability of observing a
test statistic this large (in absolute terms) if
the null hypothesis is true. - P-value (significance level) probability of
observing a test statistic at least as large as
our observed test statistic, if in fact the null
hypothesis is true
68Bivariate Relationships and Hypothesis Testing
- 5. Choose an alpha level a decision rule to
guide us in determining which values of the
p-value lead us to reject/not reject the null
hypothesis - When the p-value is extremely small, we reject
the null hypothesis (why?). The relationship is
deemed statistically significant, - When the p-value is not small, we do not reject
the null hypothesis (why?). The relationship is
deemed statistically insignificant. - Most common alpha level .05
69Bottom Line
- Assuming we will always use an alpha level of
.05 - Reject the null hypothesis if P-valuelt.05
- Do not reject the null hypothesis if P-valuegt.05
70An Example
- Dependent variable Vote Choice in 2000
- (Gore, Bush, Nader)
- Independent variable Ideology
- (liberal, moderate, conservative)
71An Example
- 1. Determine the null and alternative hypotheses.
72An Example
- Null Hypothesis There is no relationship between
ideology and vote choice in 2000. - Alternative (Research) Hypothesis There is a
relationship between ideology and vote choice
(liberals were more likely to vote for Gore,
while conservatives were more likely to vote for
Bush).
73An Example
- 2. Determine Appropriate Test Statistic (based on
measurement levels of X and Y) - 3. Identify the type of sampling distribution for
test statistic, and what it would look like if
the null hypothesis were true.
74Sampling Distributions for the Chi-Squared
Statistic(under assumption of perfect
independence)df (rows-1)(columns-1)
75Bivariate Relationships and Hypothesis Testing
- 4. Calculate the test statistic from the sample
data and determine the probability of observing a
test statistic this large (in absolute terms) if
the null hypothesis is true. - P-value (significance level) probability of
observing a test statistic at least as large as
our observed test statistic, if in fact the null
hypothesis is true
76Bivariate Relationships and Hypothesis Testing
- 5. Choose an alpha level a decision rule to
guide us in determining which values of the
p-value lead us to reject/not reject the null
hypothesis - When the p-value is extremely small, we reject
the null hypothesis (why?). The relationship is
deemed statistically significant, - When the p-value is not small, we do not reject
the null hypothesis (why?). The relationship is
deemed statistically insignificant. - Most common alpha level .05
77In-Class Exercise
- For some years now, political commentators have
cited the importance of a gender gap in
explaining election outcomes. What is the source
of the gender gap? - Develop a simple theory and corresponding
hypothesis (where gender is the independent
variable) which seeks to explain the source of
the gender gap. - Specifically, determine
- Theory
- Null and research hypothesis
- Test statistic for a cross-tabulation to test
your hypothesis