Title: Worked%20Example
1Worked Example
2(No Transcript)
3gt plot(yx)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9gtplot(epsilon1x)
This is a plot of residuals against the
exploratory variable, x
10gtplot(epsilon1yhat)
This is a plot of residuals against the fitted
values, yhat.
11Both graphs show the same thing the residuals
are following a random pattern. Note Since the
equation is approximately yx, both graphs are
extremely similar in this case.
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17Model Diagnostics Residuals and Influence
18Consider again the problem of fitting the
model yi f(xi) ei i 1,.n Assume
again a single continuous response variable y.
The explanatory variable x may be either a
single variable, or a vector of variables. How do
we assess the quality of a given fit f?
19While summary statistics are helpful, they are
not sufficient. Good diagnostics are typically
based on case analysis, i.e. an examination of
each observation in turn in relation to the
fitting procedure. This leads to an examination
of residuals and influence.
20Residuals
The residuals should be thought of as what is
left of the values of the response variable after
the fit has been subtracted. Ideally they should
show no further dependence (especially no further
location dependence) on x.
21In general this should be investigated
graphically by plotting residuals against the
explanatory variable(s) x. For linear models, we
frequently compromise by plotting residuals
against fitted values.
22In particular the residuals provide information
about whether the best relation has been
fitted the relative merits of different
fits mild, but non-random, departures from the
hypothesised fit the magnitude of the residual
variation
23the identification of outliers possible
further dependence on x, other than through
location, of the conditional distribution of y
given x - in particular heterogeneity of spread
of the residuals.
24ExampleAnscombes Artificial Data
The R data frame anscombe is made available by gt
data(anscombe) This contains 4 artificial
datasets, each of 11 observations of a continuous
response variable y and a continuous explanatory
variable x. The data are now plotted along with
the result of the least squares linear model to
the corresponding dataset.
25All the usual summary statistics related to the
classical analyses of the fitted models are
identical across the 4 datasets. This includes
the coefficients a and b and their standard
errors and confidence intervals, together with
the residual standard errors and correlation
coefficients.
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31Consideration of the residuals shows that very
different judgements should be made about the
appropriateness of the fitted model to each of
the 4 cases. A full discussion is given by
Weisberg (1985, pp107,108).
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40Influence
Influence measures the extent to which a fit is
affected by individual observations. A possible
formal definition is the following the influence
of any observation is a measure of the difference
between the fit and the fit which would be
obtained if that observation were omitted.
41Obviously observations with large influences
require more careful checking. Especially for
linear models, influence is often measured by
Cook's distance.
42Cooks Distance Formula
43As a rule of thumb, observations for which Di gt
1 make a noticeable difference to the parameter
estimates, and should be examined carefully for
the appropriateness of their use in fitting the
model. Clearly an observation with a large
residual also has a large influence. However, an
observation with an unusual value of its
explanatory variable(s) can pull a fit towards it
and have a large influence though a small
residual.
44Example Anscombe's third data set. The last
graph produced by the plot function shows that
the observation number 3 has an unusually large
value of Cook's distance D3 1.39.
gtplot(model3) produces
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49We now refit the data omitting this
observation. gtx5x3-3 gty5y3-3 gtmodel5lm(y5
x5)
50(No Transcript)
51(No Transcript)