Crosstabulation and Measures of Association - PowerPoint PPT Presentation

About This Presentation
Title:

Crosstabulation and Measures of Association

Description:

Title: Measures of Association for Crosstabulations Author: rfordin Last modified by: Shane Stevens Created Date: 11/12/2002 2:29:13 PM Document presentation format – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 78
Provided by: rfo51
Category:

less

Transcript and Presenter's Notes

Title: Crosstabulation and Measures of Association


1
Crosstabulation and Measures of Association
2
  • Investigating the relationship between two
    variables
  • Generally a statistical relationship exists if
    the values of the observations for one variable
    are associated with the values of the
    observations for another variable
  • Knowing that two variables are related allows us
    to make predictions.
  • If we know the value of one, we can predict the
    value of the other.

3
  • Determining how the values of one variable are
    related to the values of another is one of the
    foundations of empirical science.
  • In making such determinations we must consider
    the following features of the relationship.

4
  • 1.) The level of measurement of the variables.
    Difference varibles necessitate different
    procedures.
  • 2.) The form of the relationship. We can ask if
    changes in X move in lockstep with changes in Y
    or if a more sophisticated relationship exists.
  • 3.)The strength of the relationship. Is it
    possible that some levels of X will always be
    associated with certain levels of Y?

5
  • 4.) Numerical Summaries of the relationship.
    Social scientists strive to boil down the
    different aspects of a relationship to a single
    number that reveals the type and strength of the
    association.
  • 5.) Conditional relationships. The variables X
    and Y may seem to be related in some fashion but
    appearances can be deceiving. Spuriousness for
    example. So we need to know if the introduction
    of any other variables into the analysis changes
    the relationship.

6
Types of Association
  • 1.) General Association simply associated in
    some way.
  • 2.) Positive Monotonic Correlation when the
    variables have order (ordinal or continuous) high
    values of one var are associated with high values
    of the other. Converse is also true.
  • 3.) Negative Monotonic Correlation Low values
    are associated with high values.

7
Types of Association Cont.
  • 4.) Positive Linear Association A particular
    type of positive monotonic relationship where the
    plotted values of X-Y fall on a straight line
    that slopes upward.
  • 5.) Negative Linear Relaionship Straight line
    that slopes downward.

8
Strength of Relationships
  • Virtually no relationships between variables in
    Social Science (and largely in natural science as
    well) have a perfect form.
  • As a result it makes sense to talk about the
    strength of relationships.

9
Strength Cont.
  • The strength of a relationship between variables
    can be found by simply looking at a graph of the
    data.
  • If the values of X and Y are tied together
    tightly then the relationship is strong.
  • If the X-Y points are spread out then the
    relationship is weak.

10
Direction of Relationship
  • We can also infer direction from a graph by
    simply observing how the values for our variables
    move across the graph.
  • This is only true, however, when our variables
    are ordinal or continuous.

11
Types of Bivariate Relationships and Associated
Statistics
  • Nominal/Ordinal (including dichotomous)
  • Crosstabulation (Lamda, Chi-Square Gamma, etc.)
  • Interval and Dichotomous
  • Difference of means test
  • Interval and Nominal/Ordinal
  • Analysis of Variance
  • Interval and Ratio
  • Regression and correlation

12
Assessing Relationships between Variables
  • 1. Calculate appropriate statistic to measure the
    magnitude of the relationship in the sample
  • 2. Calculate additional statistics to determine
    if the relationship holds for the population of
    interest (statistical significance)
  • Substantive significance vs. Statistical
    significance

13
What is a Crosstabulation?
  • Crosstabulations are appropriate for examining
    relationships between variables that are nominal,
    ordinal, or dichotomous.
  • Crosstabs show values for variables categorized
    by another variable.
  • They display the joint distribution of values of
    the variables by listing the categories for one
    along the x-axis and the other along the y-axis

14
  • Each case is then placed in a cell of the table
    that represents the combination of values that
    corresponds to its scores on the variables.

15
What is a Crosstabulation?
  • Example We would like to know if presidential
    vote choice in 2000 was related to race.
  • Vote choice Gore or Bush
  • Race White, Hispanic, Black

16
Are Race and Vote Choice Related? Why?
17
Are Race and Vote Choice Related? Why?
18
Measures of Association for Crosstabulations
  • Purpose to determine if nominal/ordinal
    variables are related in a crosstabulation
  • At least one nominal variable
  • Lamda
  • Chi-Square
  • Cramers V
  • Two ordinal variables
  • Tau
  • Gamma

19
Measures of Association for Crosstabulations
  • These measures of association provide us with
    correlation coefficients that summarize data from
    a table into one number .
  • This is extremely useful when dealing with
    several tables or very complex tables.
  • These coefficients measure both the strength and
    direction of an association.

20
Coefficients for Nominal Data
  • When one or both of the variables are nominal,
    ordinal coefficients cannot be used because there
    is no underlying ordering.
  • Instead we use PRE tests

21
Lambda (PRE coefficient)
  • PRE Proportional Reduction in Error
  • Two Rules
  • 1.) Make a prediction on the value of an
    observation in the absence of no prior
    information
  • 2.) Given information on a second variable and
    take it into account in making the prediction.

22
Lambda PRE
  • If the two variables are associated then the use
    of rule two should lead to fewer errors in your
    predictions than rule one.
  • How many fewer errors depends upon how closely
    the variables are associated.
  • PRE (E1 E2) / E1
  • Scale goes from 0 -1

23
Lambda
  • Lambda is a PRE coefficient and it relies on
    rules 1 2 above.
  • When applying rule one all we have to go on is
    what proportion of the population fit into one
    category as opposed to another.
  • So, without any other information, guessing that
    every observation is in the modal category would
    give you the best chance of getting the most
    correct.

24
Why?
  • Think of it like this. If you knew that I tended
    to make exams where the most often used answer
    was B, then, without any other information, you
    would be best served to pick B every time.

25
  • But, if you know information about each cases
    value on another variable, rule two directs you
    to only look at the members of that new category
    (variable) and find the modal category (only on
    that var).

26
Example
  • Suppose a sample of 100 voters and you need to
    predict how they will vote in the general
    election.
  • Assume we know that overall 30 voted democrat
    and 30 voted republican and 40 were
    independent.
  • Now suppose we take one person out of the group
    (John Smith), our best guess would be that he
    would vote independent.

27
  • Now suppose we take another person (Larry Mendez)
    and again we would assume he voted independent.
  • As a result our best guess is to predict that all
    of the voters (all 100) were independent.
  • We are sure to get some wrong but its the best
    we can do over the long run.

28
  • How many do we get wrong? 60.
  • Suppose now that we know something about the
    voters regions (where they are from) and we know
    what proportions various regions voted in the
    election.
  • NE-30 , MW 20, SO 30 , WE - 20

29
Lamda
30
Lamda Rule 1 (prediction based solely on
knowledge of marginal distribution of dependent
variable partisanship)
31
Lamda Rule 2(prediction based on knowledge
provided by independent variable )
32
Lamda Calculation of Errors
  • Errors w/Rule 1 18 12 14 16 60
  • Errors w/Rule 2 16 10 14 10 50
  • Lamda (Errors R1 Errors R2)/Errors R1
  • Lamda (60-50)/6010/60.17

33
Lamda
  • PRE measure
  • Ranges from 0-1
  • Potential problems with Lamda
  • Underestimates relationship when variables (one
    or both) are highly skewed
  • Always 0 when modal category of Y is the same
    across all categories of X

34
Chi Square (c2)
  • Also appropriate for any crosstabulation with at
    least one nominal variable (and another
    nominal/ordinal variable)
  • Based on the difference between the empirically
    observed crosstab and what we would expect to
    observe if the two variables are statistically
    independent

35
Background for c2
  • Statistical Independence A property of two
    variables in which the probability that an
    observation is in a particular category of on
    variable and also in a particular category of the
    other variable equals the simple or marginal
    probability of being in those categories.
  • Plays a large role in data analysis
  • Is another way to view the strength of a
    relaitionship

36
Example
  • Suppose we have two nominal or categorical
    variables, X and Y. We label the categories for
    the first category (a,b,c) and those of the
    second (r,s,t).
  • Let P(X a) stand for the probability that a
    randomly selected case has property a on variable
    X and P(Y r) stand for the probability that a
    randomly selected case has property r on variable
    Y.

37
  • These two probabilities are called marginal
    distributions and simply refers to the chance
    that an observation has a particular value on a
    particular variable irrespective of its value on
    another variable.

38
  • Finally, let us assume that P(X a, Y r)
    stands for the joint probability that a randomly
    selected observation has both property a and
    property r simultaneously.
  • Statistical Independence The two variables are
    therefore statisitically independent only if the
    chances of observing a combination of categories
    is equal to the marginal probability of choosing
    one category times the marginal probability of
    the other.

39
Background for c2
  • P(X a, Y r) P(X a) P(Y r)
  • For example, if men are as likely to vote as
    women, then the two variables (gender and voter
    turnout) are statistically independent because
    the probability of observing a male nonvoter in
    the sample is equal to the probability of
    observing a male times the probability of
    obseving a nonvoter.

40
Example
  • If 100/300 are men 210/300 voted then
  • The marginal probabilities are
  • P(Xm)100/300 .33 and P(Yv) 210/300 .7
  • .33 x .7 .23 and is our marginal probability

41
  • If we know that 70 of the voters are male and
    take that proportion and divide by the total
    number of voters (70/300) we also get .23.
  • We can therefore say that the two variables are
    independent.

42
  • The chi-squared statistic essentially compares an
    observed result (the table produced by the
    sample) with a hypothetical table that would
    occur if (in the population) the variables were
    statistically independent.
  • A value of 0 implies statistical independence
    which means no association.

43
  • Chi-squared increases as the departures of
    observed and expected values grows. There is no
    upper limit to how big the difference can become
    but if it is past a critical value then there is
    reason to reject the null hypothesis that the two
    variables are independent.

44
How do we Calc. Chi2
  • The observed frequencies are already in the
    crosstab.
  • The expected frequencies in each table cell are
    found by multiplying the row and the column
    marginal totals and dividing by the sample size.

45
Chi Square (c2)
46
Calculating Expected Frequencies
  • To calculate the expected cell frequency for NE
    Republicans
  • E/30 30/100, therefore E(3030)/100 9

47
Calculating the Chi-Square Statistic
  • The chi-square statistic is calculated as
  • ? (Obs. Frequencyik - Exp. Frequencyik)2 / Exp.
    Frequencyik
  • (25/9)(16/6)(9/9)(16/6)(0)(0)(16/12)(16/8)
    (25/9)16/6)(1/9)(0) 18

48
  • The value 9, is the expected frequency in the
    first cell of the table and is what we would
    expect in a sample of 100 (with 30 Republicans
    and 30 north easterners) if there is statistical
    independence in the population.
  • This is more than we have in our sample so there
    is a difference.

49
Just Like the Hyp. Test
  • Null Statistical Independence between x and Y
  • Alt X and Y are not independent.

50
Interpreting the Chi-Square Statistic
  • The Chi-Square statistic ranges from 0 to
    infinity
  • 0 perfect statistical independence
  • Even though two variables may be statistically
    independent in the population, in a sample the
    Chi-Square statistic may be gt 0
  • Therefore it is necessary to determine
    statistical significance for a Chi-Square
    statistic (given a certain level of confidence)

51
Cramers V
  • Problem with Chi-Square not comparable across
    different sample sizes (and their associated
    crosstab)
  • Cramers V is a standardization of the Chi-Square
    statistic

52
Calculating Cramers V
  • V
  • Where R rows and C columns
  • V ranges from 0-1
  • Example (region and partisanship)
  • v.09 .30

53
Relationships between Ordinal Variables
  • There are several measures of association
    appropriate for relationships between ordinal
    variables
  • Gamma, Tau-b, Tau-c, Somers d
  • All are based on identifying concordant,
    discordant, and tied pairs of observations

54
Concordant PairsIdeology and Voting
  • Ideology - conserv (1), moderate (2), liberal (3)
  • Voting - never (1), sometimes (2), often (3)
  • Consider two hypothetical individuals in the
    sample with scores
  • Individual A Ideology1, Voting1
  • Individual B Ideology2, Voting2
  • Pair AB are considered a concordant pair because
    Bs ideology score is greater than As score, and
    Bs voting score is greater than As score

55
Concordant Pairs (contd)
  • All of the following are concordant pairs
  • A(1,1) B(2,2)
  • A(1,1) B(2,3)
  • A(1,1) B(3,2)
  • A(1,2) B(2,3)
  • A(2,2) B(3,3)
  • Concordant pairs are consistent with a positive
    relationship between the IV and the DV (ideology
    and voting)

56
Discordant Pairs
  • All of the following are discordant pairs
  • A(1,2) B(2,1)
  • A(1,3) B(2,2)
  • A(2,2) B(3,1)
  • A(1,2) B(3,1)
  • A(3,1) B(1,2)
  • Discordant pairs are consistent with a negative
    relationship between the IV and the DV (ideology
    and voting)

57
Identifying Concordant Pairs
  • Concordant Pairs for Never - Conserv (1,1)
  • Concordant 8070 8010 8020 8080
  • 14,400

58
Identifying Concordant Pairs
  • Concordant Pairs for Never - Moderate (1,2)
  • Concordant 1010 1080 900

59
Identifying Discordant Pairs
  • Discordant Pairs for Often - Conserv (1,3)
  • Discordant 010 010 070 010 0

60
Identifying Discordant Pairs
  • Discordant Pairs for Often - Moderate (2,3)
  • Discordant 2010 2010

61
Gamma
  • Gamma is calculated by identifying all possible
    pairs of individuals in the sample and
    determining if they are concordant or discordant
  • Gamma (C - D) / (C D)

62
Interpreting Gamma
  • Gamma 21400/24400 .88
  • Gamma ranges from -1 to 1
  • Gamma does not account for tied pairs
  • Tau (b and c) and Somers d account for tied
    pairs in different ways

63
Square tables
Non-Square tables
64
Example
  • NES 2004 What explains variation in ones
    political Ideology?
  • Income?
  • Education?
  • Religion?
  • Race?

65
Bivariate Relationships and Hypothesis Testing
(Significance Testing)
  • 1. Determine the null and alternative hypotheses
  • Null There is no relationship between X and Y (X
    and Y are statistically independent and test
    statistic 0).
  • Alternative There IS a relationship between X
    and Y (test statistic does not equal 0).

66
Bivariate Relationships and Hypothesis Testing
  • 2. Determine Appropriate Test Statistic (based on
    measurement levels of X and Y)
  • 3. Identify the type of sampling distribution for
    test statistic, and what it would look like if
    the null hypothesis were true.

67
Bivariate Relationships and Hypothesis Testing
  • 4. Calculate the test statistic from the sample
    data and determine the probability of observing a
    test statistic this large (in absolute terms) if
    the null hypothesis is true.
  • P-value (significance level) probability of
    observing a test statistic at least as large as
    our observed test statistic, if in fact the null
    hypothesis is true

68
Bivariate Relationships and Hypothesis Testing
  • 5. Choose an alpha level a decision rule to
    guide us in determining which values of the
    p-value lead us to reject/not reject the null
    hypothesis
  • When the p-value is extremely small, we reject
    the null hypothesis (why?). The relationship is
    deemed statistically significant,
  • When the p-value is not small, we do not reject
    the null hypothesis (why?). The relationship is
    deemed statistically insignificant.
  • Most common alpha level .05

69
Bottom Line
  • Assuming we will always use an alpha level of
    .05
  • Reject the null hypothesis if P-valuelt.05
  • Do not reject the null hypothesis if P-valuegt.05

70
An Example
  • Dependent variable Vote Choice in 2000
  • (Gore, Bush, Nader)
  • Independent variable Ideology
  • (liberal, moderate, conservative)

71
An Example
  • 1. Determine the null and alternative hypotheses.

72
An Example
  • Null Hypothesis There is no relationship between
    ideology and vote choice in 2000.
  • Alternative (Research) Hypothesis There is a
    relationship between ideology and vote choice
    (liberals were more likely to vote for Gore,
    while conservatives were more likely to vote for
    Bush).

73
An Example
  • 2. Determine Appropriate Test Statistic (based on
    measurement levels of X and Y)
  • 3. Identify the type of sampling distribution for
    test statistic, and what it would look like if
    the null hypothesis were true.

74
Sampling Distributions for the Chi-Squared
Statistic(under assumption of perfect
independence)df (rows-1)(columns-1)
75
Bivariate Relationships and Hypothesis Testing
  • 4. Calculate the test statistic from the sample
    data and determine the probability of observing a
    test statistic this large (in absolute terms) if
    the null hypothesis is true.
  • P-value (significance level) probability of
    observing a test statistic at least as large as
    our observed test statistic, if in fact the null
    hypothesis is true

76
Bivariate Relationships and Hypothesis Testing
  • 5. Choose an alpha level a decision rule to
    guide us in determining which values of the
    p-value lead us to reject/not reject the null
    hypothesis
  • When the p-value is extremely small, we reject
    the null hypothesis (why?). The relationship is
    deemed statistically significant,
  • When the p-value is not small, we do not reject
    the null hypothesis (why?). The relationship is
    deemed statistically insignificant.
  • Most common alpha level .05

77
In-Class Exercise
  • For some years now, political commentators have
    cited the importance of a gender gap in
    explaining election outcomes. What is the source
    of the gender gap?
  • Develop a simple theory and corresponding
    hypothesis (where gender is the independent
    variable) which seeks to explain the source of
    the gender gap.
  • Specifically, determine
  • Theory
  • Null and research hypothesis
  • Test statistic for a cross-tabulation to test
    your hypothesis
Write a Comment
User Comments (0)
About PowerShow.com