Title: Scatterplots and Correlation
1Chapter 4
- Scatterplots and Correlation
2Variable (X) and Variable (Y)
- Prior chapters ? one variable at a time
- This chapter ? relationship between two variables
- One variable is an outcome response variable
(Y) - The other variable is a predictor explanatory
variable (X) - Are X and Y related? X ? Y?
3Question
A study investigates whether the there is a
relationship between gross domestic product and
life expectancy Which is the explanatory
variable (X)? Which is the response variable
(Y)? All other variables that may influence life
expectancy are lurking and may confound the
relation between X and Y. Are there lurking
variables in this analysis?
4Scatterplot
- This chapter considers the case in which both X
and Y are quantitative variables - Bivariate data points (xi, yi) are plotted on
graph paper to form a scatterplot
5Example of a scatterplot
- X percent of students taking SAT
- Y mean SAT verbal score
- What is the relationship between X and Y?
6Interpreting scatterplots
- Form
- Can data be described by straight line?
Linearity - Direction
- Does the line slope upward or downward
- Positive association above-average values of Y
accompany above-average values of X (and vice
versa) - Negative association above-average values of Y
accompany below-average values of X (and vice
versa) - Strength
- Do data point adhere to imaginary line?
7Form discuss
8Strength and direction
- Direction positive, negative or flat
- Strength How closely does a non-horizontal
straight line fit the points of a scatterplot? - Close fitting ? strong
- Loose fitting ? weak
9Strength cannot be reliably judged visually
- These two scatterplots are of the same data (they
have the exact same correlation) - The second scatter plot looks like a stronger
correlation, but this is an artifact of the axis
scaling
10Correlation coefficient (r)
- Let r denote the correlation coefficient
- r is always between -1 and 1, inclusive
- Sign of r denotes direction of association
- Special values for r
- r 1? all points on upward sloping line
- r -1 ? all points on downward sloping line
- r 0 ? no line or horizontal line
- The closer r is to 1 or 1, the better the fit
of points to the line
11Examples of Correlations
- Husbands versus Wifes ages
- r .94
- Husbands versus Wifes heights
- r .36
- Professional Golfers Putting Success Distance
of putt in feet versus percent success - r -.94
12Correlation Coefficient r
- Data on variables X and Y for n individuals
- x1, x2, , xn and y1, y2, , yn
- Each variable has a mean and std dev
13Correlation coefficient r
The formula for r can be understood by converting
data points to standardized scores
where
14Illustrative example (gdp_life.sav)
Per Capita Gross Domestic Product and Average
Life Expectancy for Countries in Western Europe
Does GDP predict life expectancy?
15Illustrative example (gdp_life.sav)
Country Per Capita GDP (X) Life Expectancy (Y)
Austria 21.4 77.48
Belgium 23.2 77.53
Finland 20.0 77.32
France 22.7 78.63
Germany 20.8 77.17
Ireland 18.6 76.39
Italy 21.5 78.51
Netherlands 22.0 78.15
Switzerland 23.8 78.99
United Kingdom 21.2 77.37
16Illustrative example (gdp_life.sav) Scatterplot
17Illustrative example (gdp_life.sav)
x y
21.4 77.48 -0.078 -0.345 0.027
23.2 77.53 1.097 -0.282 -0.309
20.0 77.32 -0.992 -0.546 0.542
22.7 78.63 0.770 1.102 0.849
20.8 77.17 -0.470 -0.735 0.345
18.6 76.39 -1.906 -1.716 3.271
21.5 78.51 -0.013 0.951 -0.012
22.0 78.15 0.313 0.498 0.156
23.8 78.99 1.489 1.555 2.315
21.2 77.37 -0.209 -0.483 0.101
21.52 77.754 sum 7.285 sum 7.285
sx 1.532 sy 0.795 sum 7.285 sum 7.285
18Illustrative example (gdp_life.sav)
19Interpretation of r
Direction of association positive or
negative Strength of association the closer r
is to 1, the stronger the correlation. Here are
guidelines 0.0 ? r lt 0.3 ? weak
correlation 0.3 ? r lt 0.7 ? moderate
correlation 0.7 ? r lt 1.0 ? strong correlation
r 1.0 ? perfect correlation
20Interpretation of r
For GDP / life expectancy example, r 0.809.
This indicates a strong positive correlation
21Problems with Correlations
- Not all relations are linear
- Outliers can have large influence on r
- Lurking variables confound relations
22Not all Relationships are Linear Miles per
Gallon versus Speed
- r ? 0 (flat line)
- But there is a non-linear relation
23Not all Relationships are Linear Miles per
Gallon versus Speed
- Curved relationship.
- r was misleading.
24Outliers and Correlation
The outlier in the above graph decreases r If we
remove the outlier ? strong relation
25Exercise 4.15 Calories and sodium content of hot
dogs
- What are the lowest and highest calorie counts?
lowest and highest sodium levels? - Positive or negative association?
- Any outliers? If we ignore outlier,is relation
still linear? Does the correlation become
stronger?
26Exercise 4.13 IQ and school grades
- Positive or negative association?
- Is form linear? Does it appear strong?
- What is the IQ and GPA for the outlier on the
bottom there?