Title: Chapter 4 Describing the Relation Between Two Variables
1Chapter 4Describing the Relation Between Two
Variables
- 4.3
- Diagnostics on the Least-squares Regression Line
2The coefficient of determination, R2, measures
the percentage of total variation in the response
variable that is explained by least-squares
regression line.
The coefficient of determination is a number
between 0 and 1, inclusive. That is, 0 lt R2 lt 1.
If R2 0 the line has no explanatory value If
R2 1 means the line variable explains 100 of
the variation in the response variable.
3The following data are based on a study for
drilling rock. The researchers wanted to
determine whether the time it takes to dry drill
a distance of 5 feet in rock increases with the
depth at which the drilling begins. So, depth at
which drilling begins is the predictor variable,
x, and time (in minutes) to drill five feet is
the response variable, y. Source Penner, R., and
Watts, D.G. Mining Information. The American
Statistician, Vol. 45, No. 1, Feb. 1991, p. 6.
4(No Transcript)
5(No Transcript)
6Sample Statistics Mean Standard
Deviation Depth 126.2 52.2
Time 6.99 0.781 Correlation Between Depth and
Time 0.773
Regression Analysis The regression equation
is Time 5.53 0.0116 Depth
7Suppose we were asked to predict the time to
drill an additional 5 feet, but we did not know
the current depth of the drill. What would be
our best guess?
8Suppose we were asked to predict the time to
drill an additional 5 feet, but we did not know
the current depth of the drill. What would be
our best guess? ANSWER The mean time to drill
an additional 5 feet 6.99 minutes.
9Now suppose that we are asked to predict the time
to drill an additional 5 feet if the current
depth of the drill is 160 feet?
10Now suppose that we are asked to predict the time
to drill an additional 5 feet if the current
depth of the drill is 160 feet? ANSWER
Our guess increased from 6.99 minutes to 7.39
minutes based on the knowledge that drill depth
is positively associated with drill time.
11The difference between the predicted drill time
of 6.99 minutes and the predicted drill time of
7.39 minutes is due to the depth of the drill.
In other words, the difference in our guess is
explained by the depth of the drill. The
difference between the predicted value of 7.39
minutes and the observed drill time of 7.92
minutes is explained by factors other than drill
time.
12(No Transcript)
13The difference between the observed value of the
response variable and the mean value of the
response variable is called the total deviation
and is equal to
14The difference between the predicted value of the
response variable and the mean value of the
response variable is called the explained
deviation and is equal to
15The difference between the observed value of the
response variable and the predicted value of the
response variable is called the unexplained
deviation and is equal to
16(No Transcript)
17Total Deviation Unexplained Deviation
Explained Deviation
18Total Variation Unexplained Variation
Explained Variation
19Total Variation Unexplained Variation
Explained Variation
Unexplained Variation
Explained Variation
1
Total Variation
Total Variation
Explained Variation
Unexplained Variation
1
Total Variation
Total Variation
20To determine R2 for the linear regression model
simply square the value of the linear correlation
coefficient.
21EXAMPLE Determining the Coefficient of
Determination
Find and interpret the coefficient of
determination for the drilling data.
22EXAMPLE Determining the Coefficient of
Determination
Find and interpret the coefficient of
determination for the drilling data.
Because the linear correlation coefficient, r, is
0.773, we have that R2 0.7732 0.5975
59.75. So, 59.75 of the variability in
drilling time is explained by the least-squares
regression line.
23Draw a scatter diagram for each of these data
sets. For each data set, the variance of y is
17.49.
24Data Set A, R2 100
Data Set B, R2 94.7
Data Set C, R2 9.4
25- Residuals play an important role in determining
the adequacy of the linear model. In fact,
residuals can be used for the following purposes - To determine whether a linear model is
appropriate to describe the relation between the
predictor and response variables. - To determine whether the variance of the
residuals is constant. - To check for outliers.
26If a plot of the residuals against the predictor
variable shows a discernable pattern, such as
curved, then the response and predictor variable
may not be linearly related.
27(No Transcript)
28(No Transcript)
29A chemist as a 1000-gram sample of a radioactive
material. She records the amount of radioactive
material remaining in the sample every day for a
week and obtains the following data.
Day Weight (in grams) 0 1000.0 1 897.1 2 802.5 3 7
19.8 4 651.1 5 583.4 6 521.7 7 468.3
30Linear correlation coefficient -0.994
31(No Transcript)
32Linear model not appropriate
33If a plot of the residuals against the predictor
variable shows the spread of the residuals
increasing or decreasing as the predictor
increases, then a strict requirement of the
linear model is violated. This requirement is
called constant error variance. The statistical
term for constant error variance is
homoscedasticity
34(No Transcript)
35(No Transcript)
36A plot of residuals against the predictor
variable may also reveal outliers. These values
will be easy to identify because the residual
will lie far from the rest of the plot.
37(No Transcript)
38 0
-5
39We can also use a boxplot of residuals to
identify outliers.
40EXAMPLE Residual Analysis
Draw a residual plot of the drilling time data.
Comment on the appropriateness of the linear
least-squares regression model.
41(No Transcript)
42Boxplot of Residuals for the Drilling Data
43An influential observation is one that has a
disproportionate affect on the value of the slope
and y-intercept in the least-squares regression
equation.
44Case 2
Case 1 (outlier)
Case 3 (influential)
Influential observations typically exist when the
point is large relative to its X value.
45EXAMPLE Influential Observations Suppose an
additional data point is added to the drilling
data. At a depth of 300 feet, it took 12.49
minutes to drill 5 feet. Is this point
influential?
46(No Transcript)
47(No Transcript)
48With influential
Without influential
49As with outliers, influential observations should
be removed only if there is justification to do
so. When an influential observation occurs in a
data set and its removal is not warranted, there
are two courses of action (1) Collect more data
so that additional points near the influential
observation are obtained, or (2) Use techniques
that reduce the influence of the influential
observation (such as a transformation or
different method of estimation - e.g. minimize
absolute deviations).
50Chapter FourDescribing the Relation Between Two
Variables
- Section 4.4
- Nonlinear Regression Transformations
51(No Transcript)
52(No Transcript)
53(No Transcript)
54EXAMPLE Using the Definition of a
Logarithm Rewrite the logarithmic expressions to
an equivalent expression involving an exponent.
Rewrite the exponential expressions to an
equivalent logarithmic expression. (a) log315
a (b) 45 z
55In the following properties, M, N, and a are
positive real numbers, with a ? 1, and r is any
real number. loga (MN) loga M loga N loga
Mr r loga M
56EXAMPLE Simplifying Logarithms Write the
following logarithms as the sum of logarithms.
Express exponents as factors. (a) log2 x4 (b)
log5(a4b)
57If a 10 in the expression y logax, the
resulting logarithm, y log10x is called the
common logarithm. It is common practice to omit
the base, a, when it is equal to 10 and write the
common logarithm as y log x
58EXAMPLE Evaluating Exponential and
Logarithmic Expressions Evaluate the
following expressions. Round your answers to
three decimal places. (a) log 23 (b) 102.6
59y abx Exponential Model log y log (abx) Take
the common logarithm of both sides log y log
a log bx log y log a x log b Y A B
x where b 10B a 10A
60EXAMPLE 4 Finding the Curve of Best Fit to an
Exponential Model
Day Weight (in grams) 0 1000.0 1 897.1 2 802.5 3 7
19.8 4 651.1 5 583.4 6 521.7 7 468.3
A chemist as a 1000-gram sample of a radioactive
material. She records the amount of radioactive
material remaining in the sample every day for a
week and obtains the following data.
61(a) Draw a scatter diagram of the data treating
the day, x, as the predictor variable. (b)
Determine Y log y and draw a scatter diagram
treating the day, x, as the predictor variable
and Y log y as the response variable.
Comment on the shape of the scatter diagram. (c)
Find the least-squares regression line of the
transformed data. (d) Determine the exponential
equation of best fit and graph it on the scatter
diagram obtained in part (a). (e) Use the
exponential equation of best fit to predict the
amount of radioactive material is left after 8
days.
62(No Transcript)
63(No Transcript)
64y axb Power Model log y log (axb) Take the
common logarithm of both sides log y log a
log xb log y log a b log x Y A b
X where a 10A
65EXAMPLE Finding the Curve of Best Fit to a
Power Model
Cathy wishes to measure the relation between a
light bulbs intensity and the distance from some
light source. She measures a 40-watt light
bulbs intensity 1 meter from the bulb and at
0.1-meter intervals up to 2 meters from the bulb
and obtains the following data.
66(a) Draw a scatter diagram of the data treating
the distance, x, as the predictor variable. (b)
Determine X log x and Y log y and draw a
scatter diagram treating the day, X log x, as
the predictor variable and Y log y as the
response variable. Comment on the shape of the
scatter diagram. (c) Find the least-squares
regression line of the transformed data. (d)
Determine the power equation of best fit and
graph it on the scatter diagram obtained in part
(a). (e) Use the power equation of best fit to
predict the intensity of the light if you stand
2.3 meters away from the bulb.
67(No Transcript)
68(No Transcript)
69Modeling is not only a science but also an art
form. Selecting an appropriate model requires
experience and skill in the field in which you
are modeling. For example, knowledge of
economics is imperative when trying to determine
a model to predict unemployment. The main reason
for this is that there are theories in the field
that can help the modeler to select appropriate
relations and variables.