Title: Chapters 10 and 11: Using Regression to Predict
1Chapters 10 and 11 Using Regression to Predict
2Overview
- Predicting Values
- The Regression Line
- The RMS Error
- The Regression Effect
- A Second Regression Line
- Summary
3Predicting Values
- We have previously seen that a pair of data sets,
X and Y, can be characterized by their
five-statistic summary - µX, the average value in X
- SDX, the standard deviation of X
- µY, the average value in Y
- SDY, the standard deviation of Y
- r, the correlation coefficient
- Often, we want to predict a y-value given a
particular x-value - Want to use only the five-statistic summary to
make prediction
4Predicting Values
- Suppose we have the following five-number summary
stats for the height (X) and weight (Y) of men in
the US - µX 70 inches, SDX 3 inches
- µY 162 lbs, SDY 30 lbs
- r 0.47
- If you had to guess what the weight of any man
would be, what is your best bet?
5Predicting Values
- Suppose we have the following five-number summary
stats for the height (X) and weight (Y) of men in
the US - µX 70 inches, SDX 3 inches
- µY 162 lbs, SDY 30 lbs
- r 0.47
- Suppose you know the man is 1 SD above average
- Would your best guess for his weight be 1 SD
above average?
6- The SD line is the dashed line running through
the scatter plot - If we guessed 1 SD above average weight, where
would we be on the plot? - What would a better guess be?
7The Regression Line
- Suppose we have the following five-number summary
stats for the height (X) and weight (Y) of men in
the US - µX 70 inches, SDX 3 inches
- µY 162 lbs, SDY 30 lbs
- r 0.47
- It turns out that the correlation coefficient
determines the best guess - For every SD we move in X, we should move r SDs
in Y
8The Regression Line
- The regression line from X to Y
- Runs through the point of averages
- Has a slope of r time the slope of the SD line
- The regression line predicts the average value
for y within the narrowed-down range specified by
a given x
9The Regression Line
- The formula for the regression line from X to Y
is - Or, alternately,
- When is the regression line the same as the SD
line?
When r 1 or -1
10- The regression line is the solid line running
through the scatter plot - If we looked at heights 1 SD above the average,
the regression line runs through the point 0.47
SDs above average in weight
11The Regression Line
- Suppose we have the following five-number summary
stats for the height (X) and weight (Y) of men in
the US - µX 70 inches, SDX 3 inches
- µY 162 lbs, SDY 30 lbs
- r 0.47
- What is the average weight of all the men who are
73 inches tall? - For a man 73 inches tall, what weight should we
predict?
176.1 lbs
12The Regression Line
- Suppose we have the following five-number summary
stats for the height (X) and weight (Y) of men in
the US - µX 70 inches, SDX 3 inches
- µY 162 lbs, SDY 30 lbs
- r 0.47
- What is the average weight of all the men who are
64 inches tall? - For a man 64 inches tall, what weight should we
predict?
133.8 lbs
13The Regression Line
- To use the regression line from X to Y
- Standardize the given x-value to get zx
- Use the regression equation to go from X to Y
- zY rzX
- Unstandardize zY to get y
14The Regression Line
- Suppose we have the following five-number summary
stats for the height (X) and weight (Y) of men in
the US - µX 70 inches, SDX 3 inches
- µY 162 lbs, SDY 30 lbs
- r 0.47
- Predict the weight of a man who is 64
190.2 lbs
15The Regression Line
- Suppose we have the following five-number summary
stats for the height (X) and weight (Y) of men in
the US - µX 70 inches, SDX 3 inches
- µY 162 lbs, SDY 30 lbs
- r 0.47
- Predict the weight of a man who is 56
143.2 lbs
16The Regression Line
- Important notes about the regression line from X
to Y - It predicts the average value for y given an x
value - If the scatter plot is football shaped, this
prediction will be above about half of the sample
and below the other half - This is because the variables are approximately
normal - The slope of the regression line will always be
17The RMS Error
- Recall that an average alone did not uniquely
describe a data set - A spread measure was needed
- Since the regression method only gives us an
average value as its prediction, we cant really
tell by this alone how good a guess it is
18- The prediction given by the regression line for a
height of 73 inches is at (73 in, 176 lbs) - How much does the heaviest 73 tall man weigh?
- How much does the lightest 73 tall man weigh?
19The RMS Error
- If we are given a specific man to predict, we are
likely to be a little off with the regression
prediction - You can think of the prediction error as being
the vertical distance from the point to the
regression line - That is, error actual predicted
- If we want to get a good sense of what the
typical error for a given x-value is, we can find
the RMS of all the errors for all the points - This value is called the RMS error for the
regression line
20The RMS Error
- The RMS error is to the regression line what the
SD is to the average - The RMS error measures the spread around a
prediction from the regression line - Recall we are generally assuming the data sets
are approximately normal - About 68 of the points on a scatter plot will
fall within the strip that runs from one RMS
error below to one RMS error above the regression
line
21The RMS Error
22The RMS Error
- The RMS error for regression from X to Y (denoted
R) can be calculated from the five-statistic
summary by - What units would R have?
- What happens when r gets close to 0?
- What happens when r gets close to 1 or -1?
23The RMS Error
- The RMS error allows us to give a range around
our prediction - If the scatter plot is football-shaped, the RMS
error is roughly constant across the entire range
of the data set - The vertical spread around one part is about the
same as the vertical spread around other parts
24The RMS Error
- Suppose we have the following five-number summary
stats for the height (X) and weight (Y) of men in
the US - µX 70 inches, SDX 3 inches
- µY 162 lbs, SDY 30 lbs
- r 0.47
- Predict and give the RMS error for the weight of
a man who is 62
180.8 26.5 lbs
25The RMS Error
- Suppose we have the following five-number summary
stats for the height (X) and weight (Y) of men in
the US - µX 70 inches, SDX 3 inches
- µY 162 lbs, SDY 30 lbs
- r 0.47
- Predict and give the RMS error for the weight of
a man who is 54
133.8 26.5 lbs
26The Regression Effect
- A preschool program attempts to boost students
IQ scores - The children are tested when they enter the
program (pretest) - The children are retested when they leave the
program (post-test)
27The Regression Effect
- On both occasions, the average IQ score was 100,
with an SD of 15 - Also, students with below-average IQs on the
pretest had scores that went up on the average by
5 points - Students with above average scores on the pretest
had their scores drop by an average of 5 points
28The Regression Effect
- Does the program equalize intelligence?
No. If the program really equalized
intelligence, then the SD for the post-test
results should be smaller than that of the
pre-test results. This is an example of the
regression effect.
29The Regression Effect
- The regression effect is a byproduct of the fact
that predictions from a regression line are
average values - Some of the people who did very well on the
pre-test may simply have had a good test day - Their scores shouldnt necessarily be as high on
the post-test as they were on the pretest - Similarly, some of the people who did poorly on
the pre-test may simply have had a bad test day - Their scores shouldnt necessarily be as low on
the post-test as they were on the pretest
30The Regression Effect
- Sometimes researchers mistake the regression
effect for some important underlying cause in the
study (regression fallacy) - Tall fathers tend to have tall sons who are
slightly shorter than the father - There is no biological cause for this reduction
- It is strictly statistical
31The Regression Effect
- As part of their training, air force pilots make
practice landings with instructors, and are rated
on performance - The instructors discuss the ratings with the
pilots after each landing - Statistical analysis shows that pilots who make
poor landings the first time tend to do better
the second time - Conversely, pilots who make good landings the
first time tend to do worse the second time
32The Regression Effect
- The conclusion is that criticism helps the pilots
while praise makes them do worse - As a result, instructors were ordered to
criticize all landings, good or bad - Was this warranted by the facts?
No. This is an example of regression fallacy.
33The Regression Effect
- An instructor gives a midterm
- She asks the students who score 20 points below
average to see her regularly during her office
hours for special tutoring - They all score at class average or above on the
final - Can this improvement be attributed to the
regression effect? Why/why not?
No. If it was only the regression effect, most
of the students still would have scored below
average. The fact that everyone in the tutoring
group scored above average indicated that the
tutoring had the proper effect.
34A Second Regression Line
- The focus so far has been on the regression line
from X to Y - Note, however, that there is also a regression
line from Y to X - What would the difference between the two lines
be?
The regression line from X to Y is given by zY
rzX, while the regression line from Y to X is
given by zX rzY
35A Second Regression Line
- A study of 1,000 families gives the following
- The husbands average height was 68 inches with
an SD of 2.7 inches - The wives average height was 63 inches with an
SD of 2.5 inches - The correlation between them was 0.25
- Predict and give the RMS error for the husbands
height when his wifes height is 68 inches
69.35 inches, give or take 2.61 inches
36A Second Regression Line
- A study of 1,000 families gives the following
- The husbands average height was 68 inches with
an SD of 2.7 inches - The wives average height was 63 inches with an
SD of 2.5 inches - The correlation between them was 0.25
- Predict and give the RMS error for the wifes
height when her husbands height is 69.35 inches
63.31 inches, give or take 2.42 inches
37A Second Regression Line
Regression Line from X to Y
38A Second Regression Line
Regression Line from X to Y
39A Second Regression Line
Regression Line from X to Y
40Summary
- When trying to make predictions from a
football-shaped plot, a good predictor is the
average value for one variable within a
restricted range in the other - The regression line runs through all of these
averages - For every SD moved in the independent variable,
the regression line predicts a move of r SDs in
the dependent variable - The prediction from the regression line is likely
to be off by the RMS error - The RMS error can be calculated as
41Summary
- The regression effect is purely statistical
- It does not reflect a significant underlying
trend in the data - There are two regression lines for a scatter plot
- Which one to use depends on which variable you
are predicting