Title: Assessing the Fit of Regression Models
1Assessing the Fit of Regression Models
- Engineering Experimental Design
- Valerie L. Young
2In todays lecture . . .
- Coefficient of Determination (R2)
- Correlation Coefficient (R)
- Residuals and Residual Plots
- Confidence Limits on Adjustable Parameters
- ANOVA
- Communicating your results
3Regression A set of statistical tools that can.
. .
- define a mathematical relationship between
factors and a response (a model). - NOT proof of any physical relationship (though
ideally terms in the model have physical
significance) - quantify the significance of each factors
correlation with the response. - estimate values for the constants in a model.
- indicate how well a particular model fits the
data.
4Models
- Every model consists of two parts
- Terms that describe the predictable way in which
the value of the response varies with changes in
the values of the factor(s) - The random variation in the response due to
random variations in the measurement technique or
the system being measured
5Assessing Model Fit
- Statistical techniques for assessing how well a
model fits your data are based on - Quantifying the fraction of the total variation
in the response that is accounted for by the
predictable terms in the model - Assessing whether the leftover, non-predictable
variation is random (i.e., residuals normally
distributed around zero) - YOU must assess physical validity
6Coefficient of Determination (R2)
- R2 fraction of total variation in response that
is explained by the predictable part of the
model - 0 ? R2 ? 1
- R2 is not sufficient to validate a model. You
must demonstrate that the leftover variation is
randomly distributed around zero. - Can calculate R2 for any type of model (linear,
nonlinear) - Refer to a statistics text for equations
7Correlation Coefficient (R)
- Mathematically, R sqrt(R2)
- Conceptually, R means nothing except for simple
linear models - Sign on R is same as sign on slope
- R lt 0 Negative correlation between y and x
- R gt 0 Positive correlation between y and x
- The closer R is to 1, the closer the data are
to a straight line.
8Residuals
- A residual is . . .the error for a given
datapoint. The difference between a measured
value of the response and the value the model
predicts.
9Residuals
- A residual is . . .the error for a given data
point. The difference between a measured value
of the response and the value the model predicts. - If the predictable part of the model is
well-chosen, the residuals will include only
random error. - Residuals will be randomly distributed around
zero. - One way to tell is with a residual plot.
10Test Three Models
- PAI ? xAl ?,
- where ? and ? are constants, xAl is the mass
fraction of aluminum, and PAI is the phosphate
adsorption index. - PAI ? xAl ? xFe ?,
- where ?, ? and ? are constants, xAl is the mass
fraction of aluminum, xFe is the mass fraction of
iron, and PAI is the phosphate adsorption
index. - PAI ? xAl ? xAl2 ?,
- where ?, ? and ? are constants, xAl is the mass
fraction of aluminum, and PAI is the phosphate
adsorption index.
11PAI ? xAl ?
Uncertainty from 95 confidence limits.
- 0.23 0.07 g soil / mg Al
- -11 13
- R2 0.825
I edited the plot that is automatically generated
by Excel to make the labels more meaningful.
Further editing is required if you want to
include one of these plots in a report.
Good. Looks pretty random around 0.
12PAI ? xAl ?
- 0.23 0.07 g soil / mg Al
- -11 13
- R2 0.825
What does this mean?
13PAI ? xAl ?
- 0.23 0.07 g soil / mg Al
- -11 13
- R2 0.825
82.5 of the observed variation in PAI can be
explained by this model.
14PAI ? xAl ? xFe ?
Im 95 sure that if I collected an infinite
number of data points, the values of the
coefficients would be inside these ranges.
- 0.11 0.07 g soil / mg Al
- ? 0.35 0.16 g soil / mg Fe
- -7 8
- R2 0.948
15PAI ? xAl ? xFe ?
- 0.11 0.07 g soil / mg Al
- ? 0.35 0.16 g soil / mg Fe
- -7 8
- R2 0.948
Adding the xFe term accounts for a greater
fraction of the observed variation in PAI.
16PAI ? xAl ? xFe ?
- 0.11 0.07 g soil / mg Al
- ? 0.35 0.16 g soil / mg Fe
- -7 8
- R2 0.948
Adding the xFe term reduces the range of the
residuals.
17PAI ? xAl ? xFe ?
Both of these coefficients are significant at the
5 significance level. (You are 95 sure they
are different from zero.)
- 0.11 0.07 g soil / mg Al
- ? 0.35 0.16 g soil / mg Fe
- -7 8
- R2 0.948
18PAI ? xAl ? xAl2 ?
To use this model in Excel, let xAl be one
independent variable and xAl2 be another, then do
multiple linear regression.
- 0.2 0.3 g soil / mg Al
- ? (2 80) ? 10-5 g2 soil / mg2 Al
- -10 30
- R2 0.825
Although this model explains 82.5 of the
variation in PAI, NONE of the adjustable
parameters are significantly different from zero.
This is a common result when you have included
an unnecessary factor. Remove the least-likely
factor (xAl2 in this case) and redo the
regression.
19ANOVA for PAI ? xAl ? xFe ?
?
?
- ANOVA Analysis of Variance
- More on ANOVA later in the course
20ANOVA for PAI ? xAl ? xFe ?
Variation in PAI explained by model
Probability of getting a value of F that big (or
bigger) from 13 observations if PAI is not
correlated with xAl or xFe.
Variation in PAI not explained by model
- ANOVA Analysis of Variance
- More on ANOVA later in the course
21ANOVA for PAI ? xAl ?
This model also explains a significant fraction
of the variation in PAI.
22How might you write about this model comparison?
- Least-squares regression using 13 observations
supports a simple linear dependence of PAI
(phosphate adsorption index) on xAl (extractable
aluminum mass fraction) and xFe (extractable iron
mass fraction). The model - PAI ? xAl ? (1)
- gives R2 0.825 and a residual plot with points
randomly distributed around zero. Adding an xAl2
term to equation (1) to account for any nonlinear
dependence does not improve the model R2 does
not increase, and values for the adjustable
parameters are not significantly different from
zero. The ability to predict PAI is improved by
adding a linear dependence on xFe to the model.
The resulting equation is - PAI ? xAl ? xFe ? (2)
- where ? 0.11 0.07 g soil / mg Al, ? 0.35
0.16 g soil / mg Fe, and ? -7 8.
Uncertainties span the 95 confidence limits on
the adjustable parameters. Equation (2) gives R2
0.948. The residual plot shows points randomly
distributed around zero, indicating that the
predictable behavior of PAI has been described
and only random error remains. ANOVA confirms
the significance of the model (significance level
lt 1?10-6). More testing of low-mineral-content
soils is recommended to try to narrow the
confidence limits on ?.
23Another version
- Least-squares regression using 13 observations
supports a simple linear dependence of PAI
(phosphate adsorption index on xAl (extractable
aluminum mass fraction) and xFe (extractable iron
mass fraction). The model proposed is - PAI ? xAl ? xFe ? (1)
- where ? 0.11 0.07 g soil / mg Al, ? 0.35
0.16 g soil / mg Fe, and ? -7 8.
Uncertainties span the 95 confidence limits on
the adjustable parameters. Equation (1) gives R2
0.948. The residual plot shows points randomly
distributed around zero, indicating that the
predictable behavior of PAI has been described
and only random error remains. ANOVA confirms the
significance of the model (significance level lt
1?10-6). More testing of low-mineral-content
soils is recommended to try to narrow the
confidence limits on ?. Note that measuring
either xAl or xFe alone may allow a reasonable
prediction of PAI for some applications. For
example, the model - PAI ? xAl ? (2)
- gives R2 0.825 and a residual plot with points
randomly distributed around zero. No
higher-order terms are required in the model.
For example, adding an xAl2 term to equation (2)
to account for any nonlinear dependence results
in no improvement to R2, and values for the
adjustable parameters not significantly different
from zero. -
24Cells ? m ?
25Cells ? m ?
Stats look okay. Constants significant. 80.9
of variation in Cells explained.
- -22 9 cells / g inhibitor
- 160 50 cells
- R2 0.809
26Cells ? m ?
Stats look okay. Constants significant. 80.9
of variation in Cells explained.
- -22 9 cells / g inhibitor
- 160 50 cells
- R2 0.809
Yuck! Obvious pattern in residuals. (Looks like
a parabola.) Cells not a linear function of
m.
27Cells ? m ? ?!?!
Of course, just looking at this plot, we should
have known not to use a linear function. It is
always valuable to look at a plot before you dive
into regression, and to check the plot with the
regression line on it at the end.
28(I ?t2 ?t ?) or (I ?e-?t)?
R2 0.960, but polynomial model performs very
poorly for t gt 6 days.
29Stuff to Remember
- Plot the data before you start regression to make
sure you pick a reasonable model. - Use more than just R2 to evaluate model quality.
- Plot the data and the model together to make sure
the model satisfies over the whole region of
interest.