Title: Lecture 4: Data Relationships part 2
1Lecture 4 Data Relationships(part 2)
- Least square regression (Section 2.3)
- Goodness of fit measures
- (R2, Residuals, RMS error)
2Regression line Fitting a line to data
If the scatter plot shows a clear linear pattern
a straight line through the points can describe
the overall pattern Fitting a line means drawing
a line that is as close as possible to the
points the best straight line is the
regression line.
Birth rate (1,000 pop)
Log G.N.P.
3Prediction errors
For a given x, use the regression line to predict
the response The accuracy of the prediction
depends on how much spread out the observations
are around the line.
Y
Observed value y Error Predicted value
?
?
?
?
?
?
x
4Example Productivity level
To see how productivity was related to level of
maintenance, a firm randomly selected 5 of its
high speed machines for an experiment. Each
machine was randomly assigned a different level
of maintenance X and then had its average number
of stoppage Y recorded.
1.8 ? 1.6 ? 1.4 ? 1.2 ?
1.0 ? 0.8 ? 0.6 ? 0.4 ? 0.2
? 0
?
r 0.94
?
?
interruptions
?
?
2 4 6
8 10 12 14 16 X
Hours of maintenance
5Least squares regression line
Definition The regression line of y on x is the
line that makes the sum of the squares of the
vertical distances (deviations) of the data
points from the line as small as possible
It is defined as
Note a has the same sign as r
Where a r x sdy/sdx b Avey a x Avex
We use to distinguish between the values
predicted from the regression line and the
observed values
6Example (continued)
The regression line of the number of
interruptions and the hours of maintenance per
week is calculated as follows. The descriptive
statistics for x and y are Avex8 sdx3.16
Avey1 sdy0.45 and r 0.94
Slope Intercept bAvey a ? Avex1 (0.135)
? 82.08 Regression Line 0.135 x
2.08
0.135 hours 2.08
7Regression line 0.135 hours 2.08 To
draw a line find two points that satisfy the
regression equation and connect them! ? Point
of averages (8,1) ? Point on the line (6,1.27)
found by plugging x6 into the regression
equation y-0.1356 2.081.27
8Example CPU Usage
A study was conducted to examine what factors
affect the CPU usage. A set of 38 processes
written in a programming language was considered.
For each program, data were collected on the CPU
usage (time) in seconds of time, and the number
of lines (line) in thousands generated by the
program execution.
CPU usage
The scatter plot shows a clear positive
association. Well fit a regression line to
model the association!
Number of lines
9Variable N Mean Std Dev
Sum Minimum Maximum Y time
38 0.15710 0.13129
5.96980 0.01960 0.46780 X linet
38 3.16195 3.96094
120.15400 0.10200 14.87200 Pearson
Correlation Coefficients 0.89802
The regression line is
10Coefficient of determination
Goodness of fit measures
- R2 (correlation coefficient)2 r 2
- describes how good the regression line is in
explaining the response y. - fraction of the variation in the values of y
that is explained by the regression line of y on
x. - Varies between 0 and 1. Values close to 1, then
the regression line provides a good explanation
of the data close to zero, then the regression
line is not able to capture the variability in
the data - Example The correlation coefficient is r
0.94. - The regression line is able to capture 88.3 of
the variability in the data.
112. Residuals
The vertical distances between the observed
points and the regression line can be regarded as
the left-over variation in the response after
fitting the regression line. A residual is the
difference between an observed value of the
response variable y and the value predicted by
the regression line. Residual
e observed y predicted y A
special property the average of the residuals is
always zero.
A prediction error in statistics is called
residual
12Example Residuals for the regression line
2.08 0.135 x for the number of interruptions Y
on the hours of maintenance X.
133. Accuracy of the predictions
If the cloud of points is football-shaped, the
prediction errors are similar along the
regression line. One possible measure of the
accuracy of the regression predictions is given
by the root mean square error (r.m.s.
error). The r.m.s. error is defined as the
square root of the average squared residual
This is an estimate of the variation of y about
the regression line.
141 r.m.s. error
Roughly 68 of the points
2 r.m.s. errors
Roughly 95 of the points
15Computing the r.m.s.error
The r.m.s. error is ? (0.0911/4) 0.151
If the company will schedule 7 hours of
maintenance per week, the predicted weekly number
of interruptions of the machine will be 2.08
0.135?71.135 on average. Using the r.m.s.
error, more likely the number of interruptions
will be between 1.13520.1510.833 and
1.13520.1511.437.
16Looking at vertical strips
When all the vertical strips in a scatter plot
show similar amount of spread then the diagram is
said to be homoscedastic. A football-shaped
cloud of points is homoscedastic!! Consider the
data on the birth rate and the GNP index in 97
countries.
Birth rate (1,000 pop)
?
- predicted points in corresponding strips
?
?
Log G.N.P.
17In a football-shaped scatter diagram, consider
the points in a vertical strip. The value
predicted by the regression line can be regarded
as the average of their y-values. Their standard
deviation is about equal to the r.m.s. error of
the regression line.
is the average of the y-values in the strip
?
?
?
s.d. is roughly the r.m.s error
Log G.N.P.
18Computing the r.m.s. error
In large data sets, the r.m.s. error is
approximately equal to
Consider the example on birth rate GNP index
r 0.74
- For x8 the predicted birth rate is
How accurate is this prediction?
19The r.m.s.error is sqrt(1 0.742)13.559.11 Thu
s 68 of the countries with log GNP8, equal to
about 3000 per capita have birth rate between
26.35 9.1117.24 and 26.359.1135.46. Most
likely the countries with log GNP 8 have birth
rate between (8.13, 44.57) since 26.35
29.118.13 and 26.1329.1144.57
Birth rate
8.13
95 of the points in the strip for logGNP8
?
44.57
Log G.N.P.
20A tool to detect possible problems in the
regression analysis the residual plot
- The analysis of the residuals is useful to
detect possible problems and anomalies in the
regression - A residual plot is a scatter plot of the
regression residuals against the explanatory
variable. - Points should be randomly scattered inside a
band centered around the horizontal line at zero
(the mean of the residuals).
21Good case
Residual
X
Bad cases
Variation of y changing with x
Non linear relationship
22Anomalies in the regression analysis
- If the residual plot displays a curve ? the
straight line is not a good description of the
association between x and y - If the residual plot is fan-shaped ? the
variation of y is not constant. In the figure
above, predictions on y will be less precise as x
increases, since y shows a higher variability for
higher values of x. - Be careful if you use r.m.s. error!!
23Example of CPU usage data Residual plot
Do you see any striking pattern?
24Example 100 meter dash
At the 1987 World Championship in Rome, Ben
Johnson set a new world record in the 100-meter
dash.
Scatter plot for Johnsons times
The data Ythe elapsed time from the start of
the race in 10-meter increments for Ben
Johnson, X meters
Elapsed time
Meters Johnson's
time Average 55 5.83 St.
dev. 30.27 2.52 Correlation 0.999
Meters
25Regression Line
Elapsed time
Meters
The fitted regression line is 1.110.09
meters. The value of R2 is 0.999, therefore
99.9 of the variability in the data is explained
by the regression line.
26Residual Plot
Residual
Meters
Does the graph show any anomaly?
27Confounding factor
A confounding factor is a variable that has an
important effect on the relationship among the
variables in a study but it is not included in
the study. Example The mathematics department
of a large university must plan the timetable for
the following year. Data are collected on the
enrollment year, the number x of first-year
students and the number y of students enrolled in
elementary math courses.
The fitted regression line has equation 2491.6
91.0663 x R20.694.
28Residual Analysis
Do the residuals have a random pattern?
29Scatter plot of residuals vs year
- 1991 1992 1993 1994 1995 1996 1997
- Enrollment year
The plot of the residuals against the year
suggests that a change took place between 1994
and 1995. This caused a higher number of students
to take math courses.
30Outliers and Influential points
An outlier is an observation that lies outside
the overall pattern of the other observation
?
Large residual
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
outlier
?
?
?
?
?
?
?
31Influential Point
An observation is influential for the regression
line, if removing it would change considerably
the fitted line. An influential point pulls the
regression line towards it.
Regression line if ? is omitted
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Influential point
?
?
?
?
?
?
32Example house tax and prices in Albuquerque.
365.660.5488 price. The coefficient
of determination is R20.4274.
Annual tax
What does the value of R2 say?
Are there any possible outliers /or influential
point?
Selling price
33New analysis omitting the influential points
Previous regression line
The regression line is
-55.3640.8483 price The coefficient of
determination is R20.8273
Annual tax
Selling price
The new regression line explains 82 of the
variation in y .
34Extrapolation
- Extrapolation is when we use a regression
equation to predict values outside of the
observation range. This is dangerous and often
inappropriate, and may produce unreasonable
answers. - Example1 A linear model which relates weight
gain to age for young children. Applying such a
model to adults, or even teenagers, would be
absurd. - Example2 Selling price of houses. The
regression line should not be used to predict
the annual taxes for expensive houses that cost
over 500,000 dollars.
35Summary
- Correlation measures linear association,
regression line should be used only when the
association is linear. - Check residual plots to detect anomalies and
hidden patterns which are not captured by the
regression line. - Correlation and regression line are sensitive to
influential / extreme points. - Extrapolation do not use the regression line to
predict values outside the observed range
predictions are not reliable