Assessing the Fit of Regression Models - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Assessing the Fit of Regression Models

Description:

R2 = fraction of total variation in response that is explained by the ' ... Adding the xFe term accounts for a greater fraction of the observed variation in PAI. ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 30

Provided by: webcheE

Category:

more less

Transcript and Presenter's Notes

Title: Assessing the Fit of Regression Models

1
Assessing the Fit of Regression Models

Engineering Experimental Design
Valerie L. Young

2
In todays lecture . . .

Coefficient of Determination (R2)
Correlation Coefficient (R)
Residuals and Residual Plots
Confidence Limits on Adjustable Parameters
ANOVA
Communicating your results

3
Regression A set of statistical tools that can.
. .

define a mathematical relationship between
factors and a response (a model).
NOT proof of any physical relationship (though
ideally terms in the model have physical
significance)
quantify the significance of each factors
correlation with the response.
estimate values for the constants in a model.
indicate how well a particular model fits the
data.

4
Models

Every model consists of two parts
Terms that describe the predictable way in which
the value of the response varies with changes in
the values of the factor(s)
The random variation in the response due to
random variations in the measurement technique or
the system being measured

5
Assessing Model Fit

Statistical techniques for assessing how well a
model fits your data are based on
Quantifying the fraction of the total variation
in the response that is accounted for by the
predictable terms in the model
Assessing whether the leftover, non-predictable
variation is random (i.e., residuals normally
distributed around zero)
YOU must assess physical validity

6
Coefficient of Determination (R2)

R2 fraction of total variation in response that
is explained by the predictable part of the
model
0 ? R2 ? 1
R2 is not sufficient to validate a model. You
must demonstrate that the leftover variation is
randomly distributed around zero.
Can calculate R2 for any type of model (linear,
nonlinear)
Refer to a statistics text for equations

7
Correlation Coefficient (R)

Mathematically, R sqrt(R2)
Conceptually, R means nothing except for simple
linear models
Sign on R is same as sign on slope
R lt 0 Negative correlation between y and x
R gt 0 Positive correlation between y and x
The closer R is to 1, the closer the data are
to a straight line.

8
Residuals

A residual is . . .the error for a given
datapoint. The difference between a measured
value of the response and the value the model
predicts.

9
Residuals

A residual is . . .the error for a given data
point. The difference between a measured value
of the response and the value the model predicts.
If the predictable part of the model is
well-chosen, the residuals will include only
random error.
Residuals will be randomly distributed around
zero.
One way to tell is with a residual plot.

10
Test Three Models

PAI ? xAl ?,
where ? and ? are constants, xAl is the mass
fraction of aluminum, and PAI is the phosphate
adsorption index.
PAI ? xAl ? xFe ?,
where ?, ? and ? are constants, xAl is the mass
fraction of aluminum, xFe is the mass fraction of
iron, and PAI is the phosphate adsorption
index.
PAI ? xAl ? xAl2 ?,
where ?, ? and ? are constants, xAl is the mass
fraction of aluminum, and PAI is the phosphate
adsorption index.

11
PAI ? xAl ?
Uncertainty from 95 confidence limits.

0.23 0.07 g soil / mg Al
-11 13
R2 0.825

I edited the plot that is automatically generated
by Excel to make the labels more meaningful.
Further editing is required if you want to
include one of these plots in a report.
Good. Looks pretty random around 0.
12
PAI ? xAl ?

0.23 0.07 g soil / mg Al
-11 13
R2 0.825

What does this mean?
13
PAI ? xAl ?

0.23 0.07 g soil / mg Al
-11 13
R2 0.825

82.5 of the observed variation in PAI can be
explained by this model.
14
PAI ? xAl ? xFe ?
Im 95 sure that if I collected an infinite
number of data points, the values of the
coefficients would be inside these ranges.

0.11 0.07 g soil / mg Al
? 0.35 0.16 g soil / mg Fe
-7 8
R2 0.948

15
PAI ? xAl ? xFe ?

0.11 0.07 g soil / mg Al
? 0.35 0.16 g soil / mg Fe
-7 8
R2 0.948

Adding the xFe term accounts for a greater
fraction of the observed variation in PAI.
16
PAI ? xAl ? xFe ?

0.11 0.07 g soil / mg Al
? 0.35 0.16 g soil / mg Fe
-7 8
R2 0.948

Adding the xFe term reduces the range of the
residuals.
17
PAI ? xAl ? xFe ?
Both of these coefficients are significant at the
5 significance level. (You are 95 sure they
are different from zero.)

0.11 0.07 g soil / mg Al
? 0.35 0.16 g soil / mg Fe
-7 8
R2 0.948

18
PAI ? xAl ? xAl2 ?
To use this model in Excel, let xAl be one
independent variable and xAl2 be another, then do
multiple linear regression.

0.2 0.3 g soil / mg Al
? (2 80) ? 10-5 g2 soil / mg2 Al
-10 30
R2 0.825

Although this model explains 82.5 of the
variation in PAI, NONE of the adjustable
parameters are significantly different from zero.
This is a common result when you have included
an unnecessary factor. Remove the least-likely
factor (xAl2 in this case) and redo the
regression.
19
ANOVA for PAI ? xAl ? xFe ?
?
?

ANOVA Analysis of Variance
More on ANOVA later in the course

20
ANOVA for PAI ? xAl ? xFe ?
Variation in PAI explained by model
Probability of getting a value of F that big (or
bigger) from 13 observations if PAI is not
correlated with xAl or xFe.
Variation in PAI not explained by model

ANOVA Analysis of Variance
More on ANOVA later in the course

21
ANOVA for PAI ? xAl ?
This model also explains a significant fraction
of the variation in PAI.
22
How might you write about this model comparison?

Least-squares regression using 13 observations
supports a simple linear dependence of PAI
(phosphate adsorption index) on xAl (extractable
aluminum mass fraction) and xFe (extractable iron
mass fraction). The model
PAI ? xAl ? (1)
gives R2 0.825 and a residual plot with points
randomly distributed around zero. Adding an xAl2
term to equation (1) to account for any nonlinear
dependence does not improve the model R2 does
not increase, and values for the adjustable
parameters are not significantly different from
zero. The ability to predict PAI is improved by
adding a linear dependence on xFe to the model.
The resulting equation is
PAI ? xAl ? xFe ? (2)
where ? 0.11 0.07 g soil / mg Al, ? 0.35
0.16 g soil / mg Fe, and ? -7 8.
Uncertainties span the 95 confidence limits on
the adjustable parameters. Equation (2) gives R2
0.948. The residual plot shows points randomly
distributed around zero, indicating that the
predictable behavior of PAI has been described
and only random error remains. ANOVA confirms
the significance of the model (significance level
lt 1?10-6). More testing of low-mineral-content
soils is recommended to try to narrow the
confidence limits on ?.

23
Another version

Least-squares regression using 13 observations
supports a simple linear dependence of PAI
(phosphate adsorption index on xAl (extractable
aluminum mass fraction) and xFe (extractable iron
mass fraction). The model proposed is
PAI ? xAl ? xFe ? (1)
where ? 0.11 0.07 g soil / mg Al, ? 0.35
0.16 g soil / mg Fe, and ? -7 8.
Uncertainties span the 95 confidence limits on
the adjustable parameters. Equation (1) gives R2
0.948. The residual plot shows points randomly
distributed around zero, indicating that the
predictable behavior of PAI has been described
and only random error remains. ANOVA confirms the
significance of the model (significance level lt
1?10-6). More testing of low-mineral-content
soils is recommended to try to narrow the
confidence limits on ?. Note that measuring
either xAl or xFe alone may allow a reasonable
prediction of PAI for some applications. For
example, the model
PAI ? xAl ? (2)
gives R2 0.825 and a residual plot with points
randomly distributed around zero. No
higher-order terms are required in the model.
For example, adding an xAl2 term to equation (2)
to account for any nonlinear dependence results
in no improvement to R2, and values for the
adjustable parameters not significantly different
from zero.

24
Cells ? m ?
25
Cells ? m ?
Stats look okay. Constants significant. 80.9
of variation in Cells explained.

-22 9 cells / g inhibitor
160 50 cells
R2 0.809

26
Cells ? m ?
Stats look okay. Constants significant. 80.9
of variation in Cells explained.

-22 9 cells / g inhibitor
160 50 cells
R2 0.809

Yuck! Obvious pattern in residuals. (Looks like
a parabola.) Cells not a linear function of
m.
27
Cells ? m ? ?!?!
Of course, just looking at this plot, we should
have known not to use a linear function. It is
always valuable to look at a plot before you dive
into regression, and to check the plot with the
regression line on it at the end.
28
(I ?t2 ?t ?) or (I ?e-?t)?
R2 0.960, but polynomial model performs very
poorly for t gt 6 days.
29
Stuff to Remember

Plot the data before you start regression to make
sure you pick a reasonable model.
Use more than just R2 to evaluate model quality.
Plot the data and the model together to make sure
the model satisfies over the whole region of
interest.

Write a Comment

User Comments (0)