Title: Correlation and Regression Wisdom
1Lesson 3 - 3
- Correlation and Regression Wisdom
2Knowledge Objectives
- Recall the three limitations on the use of
correlation and regression. - Explain what is meant by an outlier in bivariate
data. - Explain what is meant by an influential
observation and how it relates to regression. - Define a lurking variable.
- Give an example of what it means to say
association does not imply causation.
3Construction Objectives
- Given a scatterplot in a regression setting,
identify outliers and influential observations - Explain how correlations based on averages differ
from correlations based on individuals
4Vocabulary
- Influential Observation an observation that if
removed would markedly change the result of the
regression calculation
5Limitations
- Correlation and regression describe only linear
relationships - Extrapolation (using model outside range of the
data) often produces unreliable predications - Correlation is not resistant (to outliers!)
6Outliers vs Influential Observation
- Outlier is an observation that lies outside the
overall pattern of the other observations - Outliers in the Y direction will have large
residuals. but may not influence the slope of the
regression line - Outliers in the X direction are often influential
observations - Influential observation is one that if by
removing it, it would markedly change the result
of the regression calculation
7Example 1
- Does the age at which a child begins to talk
predict later score on a test of metal ability?
A study of the development of 21 children
recorded the age in months at which they spoke
their first word and their later Gesell Adaptive
Score (GAS).
Child Age GAS Child Age GAS Child Age GAS
1 15 95 8 11 100 15 11 102
2 26 71 9 8 104 16 10 100
3 10 83 10 20 94 17 12 105
4 9 91 11 7 113 18 42 57
5 15 102 12 9 96 19 17 121
6 20 87 13 10 83 20 11 86
7 18 93 14 11 84 21 10 100
8Example 1 cont
- What is the equation of the LS regression line
used to model this data? - What is the interpretation of this data?
y-hat 109.8738 1.127x r -0.64
The scatter plot and the slope of the regression
line indicates a negative association. Children
who begin to speak later tend to have lower test
scores than early talkers. The slope suggests
that for every month older a child is when they
begin to speak, their score on the Gesell test
will decrease by about 1.13 points. The
y-intercept has no real meaning in this case.
9Example 1 cont
- Are there any outliers?
- Are there any influential observations?
Child 19 is an outlier in the Y-direction and
child 18 is an outlier in the X-direction.
Child 18 is an outlier in the X-direction and
also an influential observation because it has a
strong influence on the positioning of the
regression line.
10Example 1 cont
Scatterplot w/ Regression Line
Residual Plot
11Lurking or Extraneous Variable
- The relationship between two variables can often
be misunderstood unless you take other variables
into account - Association does not imply causation!
- Instances of Rocky Mt spotted fever and drownings
reported per month are highly correlated, but
completely without causation
12Remember Sampling Distributions
- When we looked at individual values, they had
much broader spreads (variances) than when we
looked at the distributions of x-bar - Same is true with correlations based on averaged
data strong correlations may exist between
averages, but individuals will have much greater
variances - Correlations based on averages are usually too
high when applied to individuals.
13Summary and Homework
- Summary
- Correlation and regression must be interpreted
with caution - Plot data to be sure that the relationship is
roughly linear and to detect outliers - Check for influential observations that
substantially change the regression line - Lurking variables may explain the relationship
between the explanatory and response variables - Homework
- pg 242-3 3.63-67