Title: ASSOCIATION BETWEEN VARIABLES: SCATTERGRAMS
1ASSOCIATION BETWEEN VARIABLES SCATTERGRAMS
2Like Father, Like Son
- Though it is not especially relevant to
political science, suppose we want to research
the following bivariate hypothesis that involves
two interval and continuous variables. - FATHERS HEIGHT ADULT SONS
HEIGHT father- - (actual height in inches) gt (actual
height in inches) adult son
pairs
3Like Father, Like Son (cont.)
- We select a random sample of n 1078 pairs of
fathers and their adult sons and collect the
relevant data, the first five cases, as well as
the last case, of which appears below. - Note that the unit of analysis is father-adult
son pairs and observed values have been very
precisely measured and recorded, so probably each
case has a unique recorded value on each
variable. - Pair ID Fathers Height (inches) Sons Height
(inches) - 1 66.67 68.42
- 2 69.83 70.32
- 3 65.19 69.76
- 4 65.15 73.85
- 5 64.66 70.17
- . . .
- 1078 62.31 62.09
4Like Father, Like Son (cont.)
- We cannot straightforwardly crosstabulate these
variables, because the variables are continuous,
each having an infinite number of possible
values, so each case would be in a unique row and
column. - So what should we do? One possibility is to
create class intervals for both variables (as
discussed in Topic 5 on histograms) in effect,
to turn them into discrete variables and then
proceed as before. - For example, we might create class intervals for
both variables with this recoding scheme - Short less than 65 inches
- Medium 65-70 inches
- Tall greater than 70 inches
- Pair ID FH SH
- 1 66.67 Med 68.42 Med
- 2 69.83 Med 70.32 Tall
- 3 65.19 Med 69.76 Med
- 4 65.15 Med 73.85 Tall
- 5 64.66 Short 70.17 Tall
5Like Father, Like Son (cont.)
- We then can set up a crosstabulation worksheet
and begin to tally cases as shown above. - If hypothesis is true (and if on average fathers
and sons are about the same height), we would
expect most cases to fall in the main diagonal,
i.e., in the SS, MM, and TT cells.
6Like Father, Like Son (cont.)
- But note that this approach is not very
satisfactory. - Pair ID Fs Height Ss Height Cell
- 1 66.67 68.42 MM
- 2 69.83 70.32 MT
- 3 65.19 69.76 MM
- 4 65.15 73.85 MT
- 5 64.66 70.17 ST
7Like Father, Like Son (cont.)
- Creating class intervals when you have gone to
the trouble of measuring continuous variable
quite precisely entails throwing away valuable
information that bears on the hypothesis of
interest. - This problem can be mitigated by creating more
refined class intervals. - But what we really should do is have infinitely
refined class intervals that match the very
precise information we have collected. - A very nice analytical device called a
scattergram (or scatterdiagram or scatterplot),
similar in its basic logic to a crosstabulation,
allows us to do just this.
8Like Father, Like Son (cont.)
- First, we need to set up a scattergram template
or worksheet, which is similar in logic to that
for a cross-tabulation but reflects the
continuous character of both variables. - Figure 1 shows the general template for such a
scattergram. - We draw a horizontal interval scale just as in a
histogram representing values of the independent
variable (corresponding to the column variable in
a crosstab). - This scale should be appropriately labeled and
calibrated to encompass the full range of
observed values found in the data (but it
neednt, and probably shouldnt, be much wider
than this). - We then erect a vertical interval scale that
similarly represents values of the dependent
variable (corres-ponding to the row variable in a
crosstab).
9(No Transcript)
10Like Father, Like Son (cont.)
- Just as cases are placed in cells of a
crosstabulation defined by the intersection of
the row and column corresponding to the
particular combination of (discrete) variable
values that characterizes the case, each case in
a scattergram is plotted at the point defined by
the intersecting of horizontal and vertical lines
corresponding to the particular combination of
(continuous) variable values that characterizes
the case. - In a sense, each case falls (almost always) in
its own unique and tiny cell. - Figure 2 shows the scattergram worksheet and the
plotted points for each of the five father-son
pairs listed in the previous slides. - To facilitate locating each point, I have put a
(1" x 1") grid over the scattergram.
11(No Transcript)
12Some Guidelines for Creating Scattergrams
- Graph paper is very useful for making hand-drawn
scattergrams, as you will do in Problem Set 11. - Draw the interval measurement scales on the left
and bottom margins of the scattergram. - The end points of each scale must accommodate the
maximum and minimum (or range of) values observed
in the data for each variable. - But (as a general rule) the end points should not
be much more extreme than these maximums and
minimums. - In other words, when the data has been plotted,
there should not be a lot of unnecessary white
space in the finished (presentation grade)
scattergram. - Scale the axes so that the scattergram is either
approx-imately square or somewhat wider than tall
(SPSS does the latter by default).
13Like Father, Like Son (cont.)
- A statistician named Karl Pearson actually
conducted such a study of father-son pairs over
100 years ago in England. - Having collected height data on 1078 father-son
pairs, Pearson realized that a list of 1078 pairs
of numbers would be impossible to grasp as raw
data and that a crosstabulation using class
intervals would have the problems discussed
previously. - Pearson therefore developed the scattergram, an
alternate analytic device appropriate for
analyzing association in such data. - Figure 3 shows the scattergram of Pearsons data.
- This scattergram is taken from David Freedman et
al., Statistics, p. 110.
14(No Transcript)
15Questions Pertaining to the Pearson Scattergram
- What is the significance of 45 line in the
scattergram? - On average, are the two generations about the
same height? If not, which generation is taller
on average? - What is the approximate average height of the
sons? Of the fathers? - What is the significance of the two vertical
dotted lines? - What is the average height of all sons whose
fathers are about 72 inches tall? - How does the average height of sons vary with the
height of their fathers? - Is the there an association between the
variables? - Is it positive or negative?
- How strong is the association between the two
variables?
16Questions Pertaining to the Pearson Scattergram
(cont.)
- Points that lie (approximately) on the 45 are
cases in which it is (approximately) true that FH
SH. - Also note that a clear majority (maybe 60) of
points lie northwest of the 45 line,
indicating that most sons are taller than their
fathers. - Move a vertical line left and right until it
appears that half of the points lie on either
side. This line corresponds to the median height
of father their mean height is about the same. - There is a distinct football-shaped cloud of
points running from southwest (Short-Short) to
northeast (Tall-Tall), indicating a positive
association.
17(No Transcript)
18Questions Pertaining to the Pearson Scattergram
(cont.)
- The points inside the two vertical dotted lines
represent all cases in which the father is about
72 inches tall. - On average these 72 fathers have sons who are
shorter than their fathers, though at the same
time taller than the average of all sons. - We can draw other vertical strips and find the
average height of the sons of these fathers. - The line of averages indicates that the average
height of sons increases with fathers height
(positive association). - But there is still a lot of dispersion in sons
height within each vertical strip, so the
association is far from perfect.
19(No Transcript)
20(No Transcript)
21(No Transcript)
22Association Visualized
- The following figures show other scattergrams for
small hypothetical data sets. - These were produced by the Statistical Applet
on Correlation and Regression Demo available at
the course web site. - These scattergrams correspond directly to Tables
1B-1E near the beginning of the previous handout
on crosstabulations, in that they show differing
degrees of association running from high positive
through zero to moderate negative. - The (Pearson) correlation coefficient is the
standard measure of association between two
interval variables.
23(No Transcript)
24Crosstabs vs. Scattergram
- It may at first blush seem puzzling that
- in a standard crosstabulation a concentration of
cases running from northwest to southeast
(the main diagonal) reflects a positive
association, while - in a scattergram the same pattern reflects a
negative association (and vice versa).
25Crosstabs vs. Scattergram (cont.)
- This reflects only a cosmetic difference in
setting things up - in a crosstabulation, the values of the row
(vertical) variables are usually placed in
descending order (lowest value on top, highest
values at the bottom), while - in a scattergram the values of the vertical (row)
variable are invariably (and more sensibly)
placed in ascending order (lowest value at the
bottom, highest value on top).
26Crosstabs vs. Scattergram (cont.)
- But remember that scattergrams and
crosstabulations are essentially similar devices.
- We can show how directly they are logically
connected, and also - how the former is much more informative that the
latter. - Suppose that, having constructed Pearsons
scattergram, we want also to construct a full
crosstabulation of the data using the
short-medium-tall class intervals we previously
worked with.
27Crosstabs vs. Scattergram (cont.)
- This can be accomplished simply by
- super-imposing the appro-priate 3x3 grid (table)
on the scatter-gram, and then - counting the number of plotted points in each
resulting cell.
28Test 1 Blue Book Score by MC Score
29-
- Here is some more topical (and politically
relevant) bivariate data - Data as of 11/05/08
30Obama Vote in 2008 Vote By Kerry Vote in 2004
- A scattergram of more topical interest is shown
on the following slide. - Since there are only 50 cases (states), the
plotted points can be individually labeled. - What is the significance of the four quadrants of
the scattergram? - What is the significance of the blue diagonal
line? - What is the significance of the green diagonal
line? - Note DC not included in scattergram
31(No Transcript)
32SPSS Scattergrams
- SPSS can readily create scattergrams.
- You can use SPSS (or Excel) it for PS 11 if you
wish (see Note 5). - For example, lets open the PRESIDENTIAL ELECTION
data file, giving state by state vote totals for
each Presidential candidate in each election, and
then - find the variables dem2000, rep2000, dem2004, and
rep2004 - compute the DEMOCRATIC PERCENT OF THE TWO-PARTY
VOTE in each election, - by clicking on Transform gt Compute and entering
this expression in the Compute Variables dialog
box - d2pc2004 100 dem2000/(dem2000 rep2004)
- and likewise for d2pc2004 and then
- produce the following scattergram by clicking on
Graphs gt Scatter... gt Simple/Define and then in
the Simple Scatterplot dialog box put d2pc2004 on
the Y Axis and d2pc2000 on the X Axis.
33(No Transcript)
34Histogram of Scores on Test 2
35Scattergram of P2 by MC2
36Scattergram of Score2 by Score 1
37Correlation Matrix