Title: Outline
1Session 6
2Outline
- Residual Analysis
- Are they normal?
- Do they have a common variance?
- Multicollinearity
- Autocorrelation, serial correlation
- Runs test
- Durban-Watson
3Residual Analysis
- Assumptions about regression models
- The Form of the Model
- The Residual Errors
- The Predictor Variables
- The Data
4Regression Assumptions
- Recall the assumptions about regression models
- The Form of the Model
- The relationship between Y and each X is assumed
to be linear.
5- The Residual Errors
- The residuals are normally distributed.
- The residuals have a mean of zero.
- The residuals have the same variance.
- The residuals are independent of each other.
6- The Predictor Variables
- The X variables are nonrandom (i.e. fixed or
selected in advance). This assumption is rarely
true in business regression analysis. - The data are measured without error. This
assumption is rarely true in business regression
analysis. - The X variables are linearly independent of each
other (uncorrelated, or orthogonal). This
assumption is rarely true in business regression
analysis.
7- The Data
- The observations are equally reliable and have
equal weights in determining the regression model.
8Because many of these assumptions center on the
residuals, we need to spend some time studying
the residuals in our model, to assess the degree
to which these assumptions are valid.
9Example Anscombes Quartet
Here are four bivariate data sets, devised by F.
J. Anscombe.
Anscombe, F. J. (1973), Graphs in Statistical
Analysis, The American Statistician, 27, 17-21.
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15- Three observations
- These data sets are clearly different from each
other. - The differences would not be made obvious by any
descriptive statistics or summary regression
statistics. - We need tools to identify characteristics such as
those which differentiate these four data sets.
16The differences can be detected in the different
ways that these data sets violate the basic
regression assumptions regarding residual errors.
17Assumption The residuals have a mean of
zero. This assumption is not likely to be a
problem, because the regression procedure ensures
that this will be true, unless there is a serious
skewness problem.
18Assumption The residuals are normally
distributed. We can check this with a number of
methods. We might plot a histogram of the
residuals to see if they look reasonably
normal. For this purpose we might want to
standardize the residuals, so that their values
can be compared with our expectations in terms of
the standard error.
19Standardized Residuals In order to judge whether
residuals are outliers or have an inordinate
impact on the regression they are commonly
standardized. The variance of the ith residual
ei, perhaps surprisingly, is not though this
is in many examples a reasonable approximation.
20The correct variance is One way to go,
therefore, is to calculate the so-called
standardized residual for each observation Al
ternatively, we could use the so-called
studentized residuals
21These are both measures of how far individual
observations are from their predicted values, and
large values of either are signals of concern.
Excel (and any other stats software package)
produces standardized residuals on command.
22(No Transcript)
23Another way to assess normality is to use a
normal probability plot, which graphs the
distribution of residuals against what we would
expect to see from a standard normal
distribution. The normal score is calculated
using the following procedure
- Order the observations in increasing order of
their residual errors. - Calculate a quantile, which basically measures
what proportion of the data lie below each
observation. - Calculate the normal score, which is a measure of
where we would expect the quantiles to be if we
drew a sample of this size from a perfect
standard normal distribution.
24(No Transcript)
25(No Transcript)
26(No Transcript)
27Trouble!
- Excel gives us a normal probability plot for the
dependent variable, not the residuals - We have never assumed that Y is normally
distributed - Another reason to switch to Minitab, SAS, SPSS,
etc.
28(No Transcript)
29Assumption The residuals have the same
variance. One way to check this is to plot the
actual values of Y against the predicted values.
In the case of simple regression, this is a lot
like plotting them against the X variable.
30(No Transcript)
31Another method is to plot the residuals against
the predicted value of Y (or the actual observed
value of Y, or in simple regression against the X
variable)
32Collinearity
Collinearity (also called multicollinearity) is
the situation in which one or more of the
predictor variables are nearly a linear
combination of other predictors. (The opposite
condition in which all independent variables
are more or less independent is called
orthogonality.)
33In the extreme case of exact dependence, the XTX
matrix cannot be inverted and the regression
procedure will fail. In less extreme cases, we
suffer from several possible problems
- The independent variables are not independent.
We cant talk about the slope coefficients in
terms of effects of one variable on Y all other
things held constant, because changes in one of
the X variables are associated with expected
changes in other X variables. - The slope coefficient values can be very
sensitive to changes in the data, and/or which
other independent variables are included in the
model. - Forecasting problems
- Large standard errors for all parameters
- Uncertainty about whether true relationships have
been detected - Uncertainty about the stability of the
correlation structure
34Sources of Collinearity
Data collection method. Some combination of X
variable values does not exist in the
data. Example Say that we did the tool wear
case without ever trying the Type A machine at
low speed or the Type B machine at high speed.
Collinearity here is the result of the
experimental design.
35Sources of Collinearity
Constraints on the Model or in the Population.
Some combination of X variable values does not
exist in the population. Example In the
cigarette data, imagine if the states with a high
proportion of high school graduates also had a
high proportion of black citizens. Collinearity
here is the result of attributes of the
population.
36Sources of Collinearity
Model Specification. Adding or including
variables that are tightly correlated with other
variables already in the model. Example In a
study to predict the profitability of TV
programs, we might include both the Nielsen
rating and the Nielsen share. Collinearity here
is the result of including multiple variables
that contain more or less the same information.
37Sources of Collinearity
Over-definition. We may have a relatively small
number of observations, but a large number of
independent variables for each. Collinearity
here is the result of too few degrees of freedom.
In other words, n p 1 is small (or in the
extreme, negative), because p is large compared
with n.
38Detecting Collinearity
First, be aware of the potential problem, and be
vigilant. Second, check the various combinations
of independent variables for obvious evidence of
collinearity. This might include pairwise
correlation analysis, or even regressing each
independent variable against all of the others. A
high R-square coefficient would be a sign of
trouble.
39Detecting Collinearity
Third, after a regression model has been
estimated, watch for these clues Large changes
in a coefficient as an independent variable is
added or removed. Large changes in a coefficient
as an observation is added or removed. Inappropria
te signs or magnitudes of an estimated
coefficient as compared to common sense or prior
expectations. The Variance Inflation Factor (VIF)
is one measure of collinearitys impact.
40Variance Inflation Factor
41Countermeasures
Design as much orthogonality into the data as you
can. You may improve a pre-existing situation by
collecting additional data, as orthogonally as
possible. Exclude variables from the model that
you know are correlated. Principal Components
Analysis Basically creating a small set of new
independent variables, each of which is a linear
combination of the larger set of original
independent variables (Ch. 9.5, RABE).
42Countermeasures
In some cases, rescaling and centering the data
can diminish the collinearity. For example, we
can translate each observation into a z-stat (by
subtracting the mean and dividing by the standard
deviation).
43Collinearity in the Supervisor Data
44(No Transcript)
45(No Transcript)
46(No Transcript)
47Cars
48New model Dependent variable is Volkswagen.
49Reduced Volkswagen model
5042 cars in the data set, 29 represented by the
dummy variables above. 13 remaining, of which 5
are VW.
51Some Minitab Output
Regression Analysis MSRP versus MPG City, HP,
... Volkswagen is highly correlated with other
X variables Volkswagen has been removed from
the equation The regression equation is MSRP -
14613 666 MPG City 141 HP - 1 Trunk
- 54.2 Powertrain Warranty (miles) 8634 Audi
1770 Chrysler - 3912 Ford - 594 Honda
2852 Lexus 621 Mazda - 1071 Nissan
1899 Saturn - 1814 Toyota 7563 RWD 11443
AWD Predictor Coef SE Coef T
P VIF Constant -14613
11465 -1.27 0.214 MPG City 666.4
222.7 2.99 0.006 3.0 HP
140.68 21.55 6.53 0.000
7.2 Trunk -0.7 430.3 -0.00
0.999 1.7 Powertra -54.21
75.28 -0.72 0.478 3.2 Audi
8634 4141 2.09 0.047
3.8 Chrysler 1770 4297 0.41
0.684 1.8 Ford -3912
3308 -1.18 0.248 1.5 Honda
-594 3804 -0.16 0.877
1.4 Lexus 2852 4651 0.61
0.545 7.1 Mazda 621
3808 0.16 0.872 1.4 Nissan
-1071 2992 -0.36 0.723
1.6 Saturn 1899 3891 0.49
0.630 1.5 Toyota -1814
3054 -0.59 0.558 2.1 RWD
7563 3842 1.97 0.060
4.9 AWD 11443 4772 2.40
0.024 3.2 S 4435 R-Sq 95.2
R-Sq(adj) 92.4
52Analysis of Variance Source DF
SS MS F P Regression
15 10146131594 676408773 34.38
0.000 Residual Error 26 511504646
19673256 Total 41 10657636240 Source
DF Seq SS MPG City 1
2848829819 HP 1 6511403002 Trunk
1 157281251 Powertra 1 17113839 Audi
1 307573475 Chrysler 1
25891900 Ford 1 27808684 Honda
1 1308751 Lexus 1 61068345 Mazda
1 4942428 Nissan 1
273100 Saturn 1 3974517 Toyota
1 4953584 RWD 1 60597371 AWD
1 113111527 Unusual Observations Obs
MPG City MSRP Fit SE Fit
Residual St Resid 3 16.0 71465
62642 2630 8823 2.47R
5 14.0 51785 59903 2643
-8118 -2.28R 19 22.0 56565
47739 2793 8826 2.56R
22 18.0 37070 47039
1667 -9969 -2.43R 24 16.0
63665 56113 2509 7552
2.06R R denotes an observation with a large
standardized residual
Sequential sum of squares the marginal
contribution to SSR of variables adjusted for the
order in which they are included in the model
53Serial Correlation
(A.k.a. Autocorrelation) Here we are concerned
with the assumption that the residuals are
independent of each other. In particular, we are
suspicious that the sequential residuals have a
positive correlation. In other words, some
information about an observed value of the
dependent variable is contained in the previous
observation.
54Consider the following historical data set, in
which the dependent variable is Consumer
Expenditure and the independent variable is Money
Stock. (Economists are interested in the effect
of Money Stock on Expenditure, because if it is
significant it presents an opportunity to
influence the economy through public policy.)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61Something funny is going on here!
62(No Transcript)
63(No Transcript)
64There seems to be a relationship between each
observation and the ones around it. In other
words, there is some positive correlation between
the observations and their successors. If true,
this suggests that a lot of the variability in
observation Yi can be explained by observation Yi
1. In turn, this might suggest that the
importance of Money Stock is being overstated by
our original model.
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73(No Transcript)
74Runs Test
A run is when the residual is positive (or
negative) consecutively.
75Let n1 be the observed number of positive runs
and n2 be the observed number of negative runs.
The total number of runs in a set of n
uncorrelated residuals can be shown to have a
mean of
And a variance of
76In our Money Stock case, the expected value is
8.1 and the standard deviation ought to be about
1.97.
77Our Model 1 has 5 runs which is 1.57 standard
deviations below the expected value an
unusually small number of runs. This suggests
that the residuals are not independent. (This is
an approximation based on the central limit
theorem it doesnt work well with small
samples.) Our Model 2 has 7 runs only 0.56
standard deviations below the expected value.
78Durban-Watson
Another popular hypothesis-testing procedure H0
Correlation 0 HA Correlation gt 0 The test
statistic is
79In general,
Values of d close to zero indicate strong
positive correlation, and values of d close to 2
suggest weak correlation. Precise definitions of
close to zero and close to 2 depend on the
sample size and the number of independent
variables see p. 346 in RABE for a Durban-Watson
table.
80The Durban-Watson procedure will result in one of
three possible decisions From the
Durban-Watson table, we see that our Model 1 has
upper and lower limits of 1.15 and 0.95,
respectively. Model 2 has limits of 1.26 and
0.83.
81(No Transcript)
82(No Transcript)
83In Model 1, we reject the null hypothesis and
conclude there is significant positive
correlation between sequential residuals. In
Model 2, we do not reject the null hypothesis
the serial correlation is not significantly
greater than zero.
84Residual Analysis from the Tool-Wear Model
85(No Transcript)
86Normal score calculations
87(No Transcript)
88(No Transcript)
89(No Transcript)
90(No Transcript)
91Assumption The Form of the Model The
relationship between Y and each X is assumed to
be linear.
92Summary
- Residual Analysis
- Are they normal?
- Do they have a common variance?
- Multicollinearity
- Autocorrelation, serial correlation
- Runs test
- Durban-Watson