Assessing the Fit of Regression Models - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Assessing the Fit of Regression Models

Description:

R2 = fraction of total variation in response that is explained by the ' ... Adding the xFe term accounts for a greater fraction of the observed variation in PAI. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 30
Provided by: webcheE
Category:

less

Transcript and Presenter's Notes

Title: Assessing the Fit of Regression Models


1
Assessing the Fit of Regression Models
  • Engineering Experimental Design
  • Valerie L. Young

2
In todays lecture . . .
  • Coefficient of Determination (R2)
  • Correlation Coefficient (R)
  • Residuals and Residual Plots
  • Confidence Limits on Adjustable Parameters
  • ANOVA
  • Communicating your results

3
Regression A set of statistical tools that can.
. .
  • define a mathematical relationship between
    factors and a response (a model).
  • NOT proof of any physical relationship (though
    ideally terms in the model have physical
    significance)
  • quantify the significance of each factors
    correlation with the response.
  • estimate values for the constants in a model.
  • indicate how well a particular model fits the
    data.

4
Models
  • Every model consists of two parts
  • Terms that describe the predictable way in which
    the value of the response varies with changes in
    the values of the factor(s)
  • The random variation in the response due to
    random variations in the measurement technique or
    the system being measured

5
Assessing Model Fit
  • Statistical techniques for assessing how well a
    model fits your data are based on
  • Quantifying the fraction of the total variation
    in the response that is accounted for by the
    predictable terms in the model
  • Assessing whether the leftover, non-predictable
    variation is random (i.e., residuals normally
    distributed around zero)
  • YOU must assess physical validity

6
Coefficient of Determination (R2)
  • R2 fraction of total variation in response that
    is explained by the predictable part of the
    model
  • 0 ? R2 ? 1
  • R2 is not sufficient to validate a model. You
    must demonstrate that the leftover variation is
    randomly distributed around zero.
  • Can calculate R2 for any type of model (linear,
    nonlinear)
  • Refer to a statistics text for equations

7
Correlation Coefficient (R)
  • Mathematically, R sqrt(R2)
  • Conceptually, R means nothing except for simple
    linear models
  • Sign on R is same as sign on slope
  • R lt 0 Negative correlation between y and x
  • R gt 0 Positive correlation between y and x
  • The closer R is to 1, the closer the data are
    to a straight line.

8
Residuals
  • A residual is . . .the error for a given
    datapoint. The difference between a measured
    value of the response and the value the model
    predicts.

9
Residuals
  • A residual is . . .the error for a given data
    point. The difference between a measured value
    of the response and the value the model predicts.
  • If the predictable part of the model is
    well-chosen, the residuals will include only
    random error.
  • Residuals will be randomly distributed around
    zero.
  • One way to tell is with a residual plot.

10
Test Three Models
  • PAI ? xAl ?,
  • where ? and ? are constants, xAl is the mass
    fraction of aluminum, and PAI is the phosphate
    adsorption index.
  • PAI ? xAl ? xFe ?,
  • where ?, ? and ? are constants, xAl is the mass
    fraction of aluminum, xFe is the mass fraction of
    iron, and PAI is the phosphate adsorption
    index.
  • PAI ? xAl ? xAl2 ?,
  • where ?, ? and ? are constants, xAl is the mass
    fraction of aluminum, and PAI is the phosphate
    adsorption index.

11
PAI ? xAl ?
Uncertainty from 95 confidence limits.
  • 0.23 0.07 g soil / mg Al
  • -11 13
  • R2 0.825

I edited the plot that is automatically generated
by Excel to make the labels more meaningful.
Further editing is required if you want to
include one of these plots in a report.
Good. Looks pretty random around 0.
12
PAI ? xAl ?
  • 0.23 0.07 g soil / mg Al
  • -11 13
  • R2 0.825

What does this mean?
13
PAI ? xAl ?
  • 0.23 0.07 g soil / mg Al
  • -11 13
  • R2 0.825

82.5 of the observed variation in PAI can be
explained by this model.
14
PAI ? xAl ? xFe ?
Im 95 sure that if I collected an infinite
number of data points, the values of the
coefficients would be inside these ranges.
  • 0.11 0.07 g soil / mg Al
  • ? 0.35 0.16 g soil / mg Fe
  • -7 8
  • R2 0.948

15
PAI ? xAl ? xFe ?
  • 0.11 0.07 g soil / mg Al
  • ? 0.35 0.16 g soil / mg Fe
  • -7 8
  • R2 0.948

Adding the xFe term accounts for a greater
fraction of the observed variation in PAI.
16
PAI ? xAl ? xFe ?
  • 0.11 0.07 g soil / mg Al
  • ? 0.35 0.16 g soil / mg Fe
  • -7 8
  • R2 0.948

Adding the xFe term reduces the range of the
residuals.
17
PAI ? xAl ? xFe ?
Both of these coefficients are significant at the
5 significance level. (You are 95 sure they
are different from zero.)
  • 0.11 0.07 g soil / mg Al
  • ? 0.35 0.16 g soil / mg Fe
  • -7 8
  • R2 0.948

18
PAI ? xAl ? xAl2 ?
To use this model in Excel, let xAl be one
independent variable and xAl2 be another, then do
multiple linear regression.
  • 0.2 0.3 g soil / mg Al
  • ? (2 80) ? 10-5 g2 soil / mg2 Al
  • -10 30
  • R2 0.825

Although this model explains 82.5 of the
variation in PAI, NONE of the adjustable
parameters are significantly different from zero.
This is a common result when you have included
an unnecessary factor. Remove the least-likely
factor (xAl2 in this case) and redo the
regression.
19
ANOVA for PAI ? xAl ? xFe ?
?
?
  • ANOVA Analysis of Variance
  • More on ANOVA later in the course

20
ANOVA for PAI ? xAl ? xFe ?
Variation in PAI explained by model
Probability of getting a value of F that big (or
bigger) from 13 observations if PAI is not
correlated with xAl or xFe.
Variation in PAI not explained by model
  • ANOVA Analysis of Variance
  • More on ANOVA later in the course

21
ANOVA for PAI ? xAl ?
This model also explains a significant fraction
of the variation in PAI.
22
How might you write about this model comparison?
  • Least-squares regression using 13 observations
    supports a simple linear dependence of PAI
    (phosphate adsorption index) on xAl (extractable
    aluminum mass fraction) and xFe (extractable iron
    mass fraction). The model
  • PAI ? xAl ? (1)
  • gives R2 0.825 and a residual plot with points
    randomly distributed around zero. Adding an xAl2
    term to equation (1) to account for any nonlinear
    dependence does not improve the model R2 does
    not increase, and values for the adjustable
    parameters are not significantly different from
    zero. The ability to predict PAI is improved by
    adding a linear dependence on xFe to the model.
    The resulting equation is
  • PAI ? xAl ? xFe ? (2)
  • where ? 0.11 0.07 g soil / mg Al, ? 0.35
    0.16 g soil / mg Fe, and ? -7 8.
    Uncertainties span the 95 confidence limits on
    the adjustable parameters. Equation (2) gives R2
    0.948. The residual plot shows points randomly
    distributed around zero, indicating that the
    predictable behavior of PAI has been described
    and only random error remains. ANOVA confirms
    the significance of the model (significance level
    lt 1?10-6). More testing of low-mineral-content
    soils is recommended to try to narrow the
    confidence limits on ?.

23
Another version
  • Least-squares regression using 13 observations
    supports a simple linear dependence of PAI
    (phosphate adsorption index on xAl (extractable
    aluminum mass fraction) and xFe (extractable iron
    mass fraction). The model proposed is
  • PAI ? xAl ? xFe ? (1)
  • where ? 0.11 0.07 g soil / mg Al, ? 0.35
    0.16 g soil / mg Fe, and ? -7 8.
    Uncertainties span the 95 confidence limits on
    the adjustable parameters. Equation (1) gives R2
    0.948. The residual plot shows points randomly
    distributed around zero, indicating that the
    predictable behavior of PAI has been described
    and only random error remains. ANOVA confirms the
    significance of the model (significance level lt
    1?10-6). More testing of low-mineral-content
    soils is recommended to try to narrow the
    confidence limits on ?. Note that measuring
    either xAl or xFe alone may allow a reasonable
    prediction of PAI for some applications. For
    example, the model
  • PAI ? xAl ? (2)
  • gives R2 0.825 and a residual plot with points
    randomly distributed around zero. No
    higher-order terms are required in the model.
    For example, adding an xAl2 term to equation (2)
    to account for any nonlinear dependence results
    in no improvement to R2, and values for the
    adjustable parameters not significantly different
    from zero.

24
Cells ? m ?
25
Cells ? m ?
Stats look okay. Constants significant. 80.9
of variation in Cells explained.
  • -22 9 cells / g inhibitor
  • 160 50 cells
  • R2 0.809

26
Cells ? m ?
Stats look okay. Constants significant. 80.9
of variation in Cells explained.
  • -22 9 cells / g inhibitor
  • 160 50 cells
  • R2 0.809

Yuck! Obvious pattern in residuals. (Looks like
a parabola.) Cells not a linear function of
m.
27
Cells ? m ? ?!?!
Of course, just looking at this plot, we should
have known not to use a linear function. It is
always valuable to look at a plot before you dive
into regression, and to check the plot with the
regression line on it at the end.
28
(I ?t2 ?t ?) or (I ?e-?t)?
R2 0.960, but polynomial model performs very
poorly for t gt 6 days.
29
Stuff to Remember
  • Plot the data before you start regression to make
    sure you pick a reasonable model.
  • Use more than just R2 to evaluate model quality.
  • Plot the data and the model together to make sure
    the model satisfies over the whole region of
    interest.
Write a Comment
User Comments (0)
About PowerShow.com