Chapter 4 Describing Relationships Between Variables - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Chapter 4 Describing Relationships Between Variables

Description:

Chapter 4 Describing Relationships Between Variables. 4.1 Fitting least squares ... Predicting win percentage based on rebounds/game for NBA teams. ... – PowerPoint PPT presentation

Number of Views:135

Avg rating:3.0/5.0

Slides: 44

Provided by: karl252

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 4 Describing Relationships Between Variables

1
Chapter 4 Describing Relationships Between
Variables

4.1 Fitting least squares lines
4.2 Fitting Curves and Surfaces

2
4.1 Fitting least squares lines-Abstract

Least squares line
How to find least squares line
Intepretation
Prediction
Extropolation
Linear Fit
Correlation strength and direction
Coefficient of determination
Residual plot Check for random scatter
Normal Probability plot of Residuals Check for
straight line
Regression Cautions

3
4.1 Fitting a Least Squares Line

Describe a relationship between two variables x
and y
We will find the best linear fit of y versus x.
and are unknown parameters
Goal find estimates and for the
parameters and .

4
Example 4.1

Eight batches of plastic are made and from each
batch one test item is molded and its hardness y
is measured at time x. The following are the 8
measurements

5
Example 4.1

Scatterplot Is a linear relationship
appropriate?
How do we find an equation for this line?

By looking at this scatterplot, we see that there
appears to be a strong, positive, linear
relationship between x and y
6
Least Squares Principle

We will fit a line given by b0 b1x, where
b0 and b1 are estimates for the parameters
and .
Note that a straight line will not pass perfectly
through every one of our data points.
Thus, if we plug a data value xi into the
equation b0 b1x, the value
we get for will not be exactly our data
value yi.

yi
b0 b1 xi
xi
7
Least Squares Principle

Need to minimize the squared distances from the
actual data value, yi, and the value given by our
equation, .
Thus, we wish to minimize

8
Least Squares Principles

How do we find estimates for and ?
Use calculus.
Plugging into the equation
yields
How to minimize
Take partial derivatives with respect to and
Set derivatives equal to zero.

9
Normal Equations

Taking partial derivatives with respect to b0 and
b1 and setting them equal to 0 yields what are
known as the Normal Equations.

10
Least Squares Estimates

Solving these equations (details omitted) for b0
and b1 yields the following

11
Example 4.2

Continued from example 4.1
Find the least squares estimates given

153.060
2.433
12
Interpretation

b1 means for every 1 unit increase in x
variable, the y variable increases, on average,
by the value of b1.
Only true for a linear model
b0 on average, the value of y when x is equal to
0.
Not always meaningful
Example GPA vs. ACT score, b0 -5.7

13
Example 4.3

Continued from example 4.1
Eight batches of plastic are made and from each
batch one test item is molded and its hardness y
is measured at time x.
b1 means that for every 1 unit increase in
time, the hardness increases, on average, by
2.433.
b0 means that when no time has passed, the
hardness is 153.060.

14
Prediction

We can predict y with the least squares line.
Simply insert a value of x into the least squares
equation to obtain a predicted value of y.
What is the predicted hardness for time x24?

15
Extrapolation

Extrapolation is when a value of x beyond the
range of our actual x observations is used to
find a predicted .
Predicted values should not be used when
extrapolating beyond the data set.
Why? Because we do not know the behavior beyond
the range of our x values.
Example What is the predicted hardness for time
x 110?

16
Linear Fit

We have a fitted line, but does it fit well?
To check the fit
Correlation
Coefficient of Determination
Residual Plots

17
Correlation

Correlation quantifies the linear fit between y
and x.
r will always lie between (1) and 1
r close to 0 indicates a weak linear
relationship.
r close to either 1 or 1 indicates a strong
linear relationship.
The sign of r indicates if the relationship is
positive or negative.
So a positive value of r tells us that y is
increasing linearly in x and a negative value of
r tells us that y is decreasing linearly in x.

18
Coefficient of Determination

Coefficient of Determination the fraction of
raw variation in y accounted for by the fitted
equation.
Can be used as Quantifies the fit of other types
of relationships (not just linear)
The value of will always lie between 0 and 1
Values closer to 0 indicating a bad fit of our
model
Values closer to 1 indicating a good fit of our
model

19
Example 4.6

Continued from example 4.1
From r we can tell that there is a strong,
positive, linear relationship (the linear model
fits well).
From R2 we can tell that our model fits well.
R2 r2 only with a linear model.

20
Residuals

We hope that the fitted values, , will look
like our data,
except for small fluctuations explainable only as
random variation.
To assess this, we look at what are called
residuals

21
Residuals

When we are fitting a least squares line, we are
minimizing the sum of residuals
These residuals should be patternless
(randomly scattered).
as indicated by a cloud of points scattered above
and below 0 in plots of
the residuals against x
residuals against fitted

22
Residuals

To use residuals to check the fit, we need to
check their pattern.
We now look at some different residual plots.
First, look at a plot that shows what we want to
see from residual plots, namely pattenless
Then explore some problems/patterns that may be
identified through residual plots.

23
Residual Plot 1
Actual Data
Residual Plot
The residuals are randomly scattered around
0. Thus, residual plot shows good fit (linear
model is appropriate).
24
Residual Plot 2
Actual Data
Residual Plot
The residual plot shows a distinct curved
pattern. Thus, a linear model is not appropriate.
The data is probably better described with a
quadratic model.
25
Residual Plot 3
Actual Data
Residual Plot
The residual plot shows a cone-shaped
pattern. There is more spread for larger fitted
values. The researcher may want to investigate
the data collection process.
26
Residual Plot 4

Residuals vs. the time order of the observation
As time increases the residuals increase.
This pattern suggests that some variable changing
in time is acting on y and has not been accounted
for in fitting the model.
After seeing a residual plot with this pattern,
the researcher may want to inspect the process
from which the data was obtained.
Example instrumental drift could produce a
pattern like this.

Ordered Residual Plot
27
Normal Prob. Plot for Residuals

If we really have random variation, we hope
Residuals should centered at zero
Scattered evenly above and below zero
The most will be close to zero with less of
residuals appearing as we move further from
zero.
Histogram of residuals should look like the
following.

28
Normal Prob. Plot for Residuals(continued)

Normal probability plot can be used for checking
whether or not a set of residuals comes from a
bell-shaped distribution.
An S-shape in a normal probability plot means
that we have skewed residuals.
Whereas a straight line indicates a bell-shape.

29
Example 4.7

Continued from example 4.1

30
Example 4.7
Residual Plot

Residual plot shows random scatter around 0.
Normal probability plot follows a straight line.
Conclusion linear model fits well.

31
Linear Regression Cautions

r measures only linear relationships. There
could be a very good nonlinear model but a small
r.
Correlation does not imply causation
An example from Wikipedia Since the 1950s, both
the atmospheric CO2 level and crime levels have
increased sharply. Thus, we would expect a large
correlation between crime and CO2 levels.
However, we would not assume that atmospheric CO2
causes crime.
Both R2 and r can be drastically affected by a
few unusual data points.
Example on page 137

32
Summary of 4.1

Least squares line
How to find
Intepretation
Prediction
Extropolation
Linear Fit
Correlation strength and direction
Coefficient of determination
Residual plot Check for random scatter
Normal Probability plot of Residuals Check for
straight line
Regression Cautions

33
4.2 Fitting Curves and Surface-Abstract

Curve fitting
Surface fitting
Interpretation given
More on model fitting

34
4.2 Fitting Curves and Surfaces

Use least squares
Computation and interpretation becomes more
complicated.
Curve fitting
A natural generalization to the linear equation
is the polynomial equation
Computation of estimates
is done by computer.

35
Surface Fitting

In surface fitting we have more than 1 predictor
variable (xs) with our response (y).
Again, computation of estimates
is done by computer.
Example we want to predict brick strength (y)
given a level of temperature ( ) and humidity
( )

36
Interpretation

Given , the
interpretation is as follows
b0 represents, on average, value of y when x1 0
and x2 0
b1 represents, on average, increase/decrease in y
for every one unit increase in x1, holding
constant x2
b2 represents, on average, increase/decrease in y
for every one unit increase in x2, holding
constant x1
Note these statements are general.
You will need to do this within the context of
the problem.

37
Residual Plots

Computed the same way as before
Normal probability plot of residuals
Residual plot against x
Residual plot against fitted values
Use computer due to computational intensity

38
More on Model Fit

It is often wise to check multiple forms of model
fit.
Each assessment may only be painting half the
picture
Most common combination
R2
Residual plot

39
Example 4.8

Trying to predict stopping distance (ft) given
the current speed (mph).

Distance vs. Speed
40
Example 4.8
Residual Plot

Although the data seemed linear, and the R2 was
extremely high, the residual plot shows a
distinct curved pattern.
Thus, the fit could be improved upon.
Use quadratic instead of linear.

41
Example 4.9

Predicting win percentage based on rebounds/game
for NBA teams.
Residual plot theres random scatter around 0.
Linear model seems to fit well.

42
Example 4.9

Although the residual plot indicates a good fit,
the R2 0.2014, which is very low.
From the scatterplot, we notice that the data are
somewhat linear, but a very weak relationship
exists (thus the low R2).

43
Summary of 4.2