ASSOCIATION BETWEEN VARIABLES: SCATTERGRAMS - PowerPoint PPT Presentation

About This Presentation
Title:

ASSOCIATION BETWEEN VARIABLES: SCATTERGRAMS

Description:

ASSOCIATION BETWEEN VARIABLES: SCATTERGRAMS LIKE FATHER, LIKE SON – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 38
Provided by: umbc
Category:

less

Transcript and Presenter's Notes

Title: ASSOCIATION BETWEEN VARIABLES: SCATTERGRAMS


1
ASSOCIATION BETWEEN VARIABLES SCATTERGRAMS
  • Topic 11

2
Like Father, Like Son
  • Though it is not especially relevant to
    political science, suppose we want to research
    the following bivariate hypothesis that involves
    two interval and continuous variables.
  • FATHERS HEIGHT ADULT SONS
    HEIGHT father-
  • (actual height in inches) gt (actual
    height in inches) adult son
    pairs

3
Like Father, Like Son (cont.)
  • We select a random sample of n 1078 pairs of
    fathers and their adult sons and collect the
    relevant data, the first five cases, as well as
    the last case, of which appears below.
  • Note that the unit of analysis is father-adult
    son pairs and observed values have been very
    precisely measured and recorded, so probably each
    case has a unique recorded value on each
    variable.
  • Pair ID Fathers Height (inches) Sons Height
    (inches)
  • 1 66.67 68.42
  • 2 69.83 70.32
  • 3 65.19 69.76
  • 4 65.15 73.85
  • 5 64.66 70.17
  • . . .
  • 1078 62.31 62.09

4
Like Father, Like Son (cont.)
  • We cannot straightforwardly crosstabulate these
    variables, because the variables are continuous,
    each having an infinite number of possible
    values, so each case would be in a unique row and
    column.
  • So what should we do? One possibility is to
    create class intervals for both variables (as
    discussed in Topic 5 on histograms) in effect,
    to turn them into discrete variables and then
    proceed as before.
  • For example, we might create class intervals for
    both variables with this recoding scheme
  • Short less than 65 inches
  • Medium 65-70 inches
  • Tall greater than 70 inches
  • Pair ID FH SH
  • 1 66.67 Med 68.42 Med
  • 2 69.83 Med 70.32 Tall
  • 3 65.19 Med 69.76 Med
  • 4 65.15 Med 73.85 Tall
  • 5 64.66 Short 70.17 Tall

5
Like Father, Like Son (cont.)
  • We then can set up a crosstabulation worksheet
    and begin to tally cases as shown above.
  • If hypothesis is true (and if on average fathers
    and sons are about the same height), we would
    expect most cases to fall in the main diagonal,
    i.e., in the SS, MM, and TT cells.

6
Like Father, Like Son (cont.)
  • But note that this approach is not very
    satisfactory.
  • Pair ID Fs Height Ss Height Cell
  • 1 66.67 68.42 MM
  • 2 69.83 70.32 MT
  • 3 65.19 69.76 MM
  • 4 65.15 73.85 MT
  • 5 64.66 70.17 ST

7
Like Father, Like Son (cont.)
  • Creating class intervals when you have gone to
    the trouble of measuring continuous variable
    quite precisely entails throwing away valuable
    information that bears on the hypothesis of
    interest.
  • This problem can be mitigated by creating more
    refined class intervals.
  • But what we really should do is have infinitely
    refined class intervals that match the very
    precise information we have collected.
  • A very nice analytical device called a
    scattergram (or scatterdiagram or scatterplot),
    similar in its basic logic to a crosstabulation,
    allows us to do just this.

8
Like Father, Like Son (cont.)
  • First, we need to set up a scattergram template
    or worksheet, which is similar in logic to that
    for a cross-tabulation but reflects the
    continuous character of both variables.
  • Figure 1 shows the general template for such a
    scattergram.
  • We draw a horizontal interval scale just as in a
    histogram representing values of the independent
    variable (corresponding to the column variable in
    a crosstab).
  • This scale should be appropriately labeled and
    calibrated to encompass the full range of
    observed values found in the data (but it
    neednt, and probably shouldnt, be much wider
    than this).
  • We then erect a vertical interval scale that
    similarly represents values of the dependent
    variable (corres-ponding to the row variable in a
    crosstab).

9
(No Transcript)
10
Like Father, Like Son (cont.)
  • Just as cases are placed in cells of a
    crosstabulation defined by the intersection of
    the row and column corresponding to the
    particular combination of (discrete) variable
    values that characterizes the case, each case in
    a scattergram is plotted at the point defined by
    the intersecting of horizontal and vertical lines
    corresponding to the particular combination of
    (continuous) variable values that characterizes
    the case.
  • In a sense, each case falls (almost always) in
    its own unique and tiny cell.
  • Figure 2 shows the scattergram worksheet and the
    plotted points for each of the five father-son
    pairs listed in the previous slides.
  • To facilitate locating each point, I have put a
    (1" x 1") grid over the scattergram.

11
(No Transcript)
12
Some Guidelines for Creating Scattergrams
  • Graph paper is very useful for making hand-drawn
    scattergrams, as you will do in Problem Set 11.
  • Draw the interval measurement scales on the left
    and bottom margins of the scattergram.
  • The end points of each scale must accommodate the
    maximum and minimum (or range of) values observed
    in the data for each variable.
  • But (as a general rule) the end points should not
    be much more extreme than these maximums and
    minimums.
  • In other words, when the data has been plotted,
    there should not be a lot of unnecessary white
    space in the finished (presentation grade)
    scattergram.
  • Scale the axes so that the scattergram is either
    approx-imately square or somewhat wider than tall
    (SPSS does the latter by default).

13
Like Father, Like Son (cont.)
  • A statistician named Karl Pearson actually
    conducted such a study of father-son pairs over
    100 years ago in England.
  • Having collected height data on 1078 father-son
    pairs, Pearson realized that a list of 1078 pairs
    of numbers would be impossible to grasp as raw
    data and that a crosstabulation using class
    intervals would have the problems discussed
    previously.
  • Pearson therefore developed the scattergram, an
    alternate analytic device appropriate for
    analyzing association in such data.
  • Figure 3 shows the scattergram of Pearsons data.
  • This scattergram is taken from David Freedman et
    al., Statistics, p. 110.

14
(No Transcript)
15
Questions Pertaining to the Pearson Scattergram
  • What is the significance of 45 line in the
    scattergram?
  • On average, are the two generations about the
    same height? If not, which generation is taller
    on average?
  • What is the approximate average height of the
    sons? Of the fathers?
  • What is the significance of the two vertical
    dotted lines?
  • What is the average height of all sons whose
    fathers are about 72 inches tall?
  • How does the average height of sons vary with the
    height of their fathers?
  • Is the there an association between the
    variables?
  • Is it positive or negative?
  • How strong is the association between the two
    variables?

16
Questions Pertaining to the Pearson Scattergram
(cont.)
  • Points that lie (approximately) on the 45 are
    cases in which it is (approximately) true that FH
    SH.
  • Also note that a clear majority (maybe 60) of
    points lie northwest of the 45 line,
    indicating that most sons are taller than their
    fathers.
  • Move a vertical line left and right until it
    appears that half of the points lie on either
    side. This line corresponds to the median height
    of father their mean height is about the same.
  • There is a distinct football-shaped cloud of
    points running from southwest (Short-Short) to
    northeast (Tall-Tall), indicating a positive
    association.

17
(No Transcript)
18
Questions Pertaining to the Pearson Scattergram
(cont.)
  • The points inside the two vertical dotted lines
    represent all cases in which the father is about
    72 inches tall.
  • On average these 72 fathers have sons who are
    shorter than their fathers, though at the same
    time taller than the average of all sons.
  • We can draw other vertical strips and find the
    average height of the sons of these fathers.
  • The line of averages indicates that the average
    height of sons increases with fathers height
    (positive association).
  • But there is still a lot of dispersion in sons
    height within each vertical strip, so the
    association is far from perfect.

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Association Visualized
  • The following figures show other scattergrams for
    small hypothetical data sets.
  • These were produced by the Statistical Applet
    on Correlation and Regression Demo available at
    the course web site.
  • These scattergrams correspond directly to Tables
    1B-1E near the beginning of the previous handout
    on crosstabulations, in that they show differing
    degrees of association running from high positive
    through zero to moderate negative.
  • The (Pearson) correlation coefficient is the
    standard measure of association between two
    interval variables.

23
(No Transcript)
24
Crosstabs vs. Scattergram
  • It may at first blush seem puzzling that
  • in a standard crosstabulation a concentration of
    cases running from northwest to southeast
    (the main diagonal) reflects a positive
    association, while
  • in a scattergram the same pattern reflects a
    negative association (and vice versa).

25
Crosstabs vs. Scattergram (cont.)
  • This reflects only a cosmetic difference in
    setting things up
  • in a crosstabulation, the values of the row
    (vertical) variables are usually placed in
    descending order (lowest value on top, highest
    values at the bottom), while
  • in a scattergram the values of the vertical (row)
    variable are invariably (and more sensibly)
    placed in ascending order (lowest value at the
    bottom, highest value on top).

26
Crosstabs vs. Scattergram (cont.)
  • But remember that scattergrams and
    crosstabulations are essentially similar devices.
  • We can show how directly they are logically
    connected, and also
  • how the former is much more informative that the
    latter.
  • Suppose that, having constructed Pearsons
    scattergram, we want also to construct a full
    crosstabulation of the data using the
    short-medium-tall class intervals we previously
    worked with.

27
Crosstabs vs. Scattergram (cont.)
  • This can be accomplished simply by
  • super-imposing the appro-priate 3x3 grid (table)
    on the scatter-gram, and then
  • counting the number of plotted points in each
    resulting cell.

28
Test 1 Blue Book Score by MC Score
29
  • Here is some more topical (and politically
    relevant) bivariate data
  • Data as of 11/05/08

30
Obama Vote in 2008 Vote By Kerry Vote in 2004
  • A scattergram of more topical interest is shown
    on the following slide.
  • Since there are only 50 cases (states), the
    plotted points can be individually labeled.
  • What is the significance of the four quadrants of
    the scattergram?
  • What is the significance of the blue diagonal
    line?
  • What is the significance of the green diagonal
    line?
  • Note DC not included in scattergram

31
(No Transcript)
32
SPSS Scattergrams
  • SPSS can readily create scattergrams.
  • You can use SPSS (or Excel) it for PS 11 if you
    wish (see Note 5).
  • For example, lets open the PRESIDENTIAL ELECTION
    data file, giving state by state vote totals for
    each Presidential candidate in each election, and
    then
  • find the variables dem2000, rep2000, dem2004, and
    rep2004
  • compute the DEMOCRATIC PERCENT OF THE TWO-PARTY
    VOTE in each election,
  • by clicking on Transform gt Compute and entering
    this expression in the Compute Variables dialog
    box
  • d2pc2004 100 dem2000/(dem2000 rep2004)
  • and likewise for d2pc2004 and then
  • produce the following scattergram by clicking on
    Graphs gt Scatter... gt Simple/Define and then in
    the Simple Scatterplot dialog box put d2pc2004 on
    the Y Axis and d2pc2000 on the X Axis.

33
(No Transcript)
34
Histogram of Scores on Test 2
35
Scattergram of P2 by MC2
36
Scattergram of Score2 by Score 1
37
Correlation Matrix
Write a Comment
User Comments (0)
About PowerShow.com