Title: Detecting outliers
1Detecting outliers
- Weve seen how to use hat matrix diagonals as a
way of detecting potentially high leverage points - But what about points that may not have much
leverage, but are large outliers - Such points interfere with model fit measures and
our tests of significance - Weve looked at the residuals for a number of
uses, but we havent made much (direct) use of
the properties of the residuals - Recall that for all of the models weve
discussed, weve assumed that the residuals are
normally distributed, with mean zero and standard
deviation sigma
2Standardized residuals
- When we learned about the normal distribution we
saw how to work out normal tail probabilities - Recall if then
- So if the residuals, ri have an approximate
normal distribution with mean 0 and variance
sigma then
3Standardized residuals
- The trouble is that we dont know the standard
deviation of the residuals - But we have an estimate of it a standard error
- It turns out that the standard error of the ith
residual is the estimate of the overall standard
deviation (the square root of the Residual Mean
Square) times a function of the associated hat
matrix diagonal, i.e.
So the standardized residuals are given by
4Studentized residuals
- Complementary to the idea of the standardized
residuals are the Studentized residuals
where s(i) is the residual standard deviation if
the least squares procedure is run without the
ith observation
- Numerically, they are similar to the standardized
residuals - However they may have better theoretical
properties
5Using Standardised / Studentized residuals
- Theory aside, how do we use these quantities
- If the standardized residuals have a standard
normal distribution, then it is very easy to spot
large residuals - We know that for a standard normal
- Pr(Zlt-1.96) or Pr(Zgt1.96) 0.05
- Pr(Zlt-2.57) or Pr(Zgt2.57) 0.01
- Large residuals will have large Z-scores and
hence will have low probabilities of being
observed if these points truly came from a
standard normal distribution - Therefore, if we have large (in magnitude)
standardized residuals, we might think that these
points do not belong to the model - The Studentized residuals actually have a
t-distribution, so if the regression is on a
small number of points (lt 20) the Studentized
residuals might be preferable.
6Multicollinearity
- One problem that we have in multiple regression
that we dont have in simple linear regression is
multicollinearity - The concept of multicollinearity is simple,
detecting it and deciding what to do about it is
hard - Before we make a definition of multicollinearity
we need a couple of extra terms - In regression, the response is often called the
dependent variable, its value being dependent on
the value of the predictors - Correspondingly the predictors are called the
independent variables
7Multicollinearity a definition
- The previous terms give us a handle on
multicollinearity - Multicollinearity occurs when there is a linear
relationship or dependency amongst the
independent variables - This sounds a little bizarre this model is okay
(and in fact is quite a common one)
- But this one is not and would suffer from the
effects of multicollinearity
- The difference is that the relationship between X
and 3X2 is linear whereas the relationship
between X and X2 is non-linear
8- In an example as extreme as the previous one, we
would know straight away there was a problem. - This is because in calculating the least squares
solution, the matrix XTX would be singular. This
means it has no inverse - If you dont know any linear algebra, you can
think of this as trying to calculate 1/x when x
0 - Minitab would report
- z is highly correlated with other X variables
- z has been removed from the equation
- Most other packages will report that XTX is
singular or ill-conditioned - In more realistic problems, there may be linear
dependence between the variables, but not perfect
dependence
9Example
- The following dataset comes from a cheese tasting
experiment. - As cheddar cheese matures, a variety of chemical
processes take place. The taste of matured cheese
is related to the concentration of several
chemicals in the final product. - In a study of cheddar cheese from the LaTrobe
Valley of Victoria, Australia, samples of cheese
were analysed for their chemical composition and
were subjected to taste tests. - Overall taste scores were obtained by combining
the scores from several tasters.
10- The response in this model is taste
- The predictors are the log concentrations of
lactic acid (lactic), acetic acid (acetic) and
hydrogen sulphide (H2S) - Lactic acid is present in all milk products
- Acetic acid comes from vinegar and is a
preservative - Hydrogen sulphide, the gas that gives Rotorua its
fresh smell, is used a preservative - Because we dont know too much about the
relationship between the predictors and the
response (we havent been given any other
information) we fit a straight forward multiple
linear regression model
11Predictor Coef StDev T
P Constant -28.88 19.74 -1.46
0.155 Lactic 19.671 8.629
2.28 0.031 Acetic 0.328 4.460
0.07 0.942 H2S 3.912
1.248 3.13 0.004 S 10.13 R-Sq
65.2 R-Sq(adj) 61.2 Analysis of
Variance Source DF SS
MS F P Regression 3
4994.5 1664.8 16.22 0.000 Residual
Error 26 2668.4 102.6 Total
29 7662.9
- The P-value from the ANOVA table is small gt
regression is significant - The regression table indicates that perhaps the
lactic acid and hydrogen sulphide concentrations
are important in predicting taste where as the
acetic acid concentration is not - The adj-R2 is reasonable were explaining about
60 of the variation
12- Once again we see a typical pattern
- The norplot of the residuals looks fine
- There appears to be a strong linear relationship
- No one point deviates from the line in any
significant way - However, the pred-res plot tells a different
story - Very strong funnel shape plus curvature
- Something is wrong with our model
13Detecting multicollinearity
- When we saw the pattern in the pred-res plot
previously, it was because we had fit a linear
model to a non-linear curve - How would we know that multicollinearity may be
the cause this time and not just some non-linear
effect? - We dont but, because were fitting a multiple
linear regression model, we should check for
multicollinearity as a matter of course - We defined multicollinearity to be the existence
of a linear dependency between the predictors - How do we observe linear relationships
scatterplots - How do we quantify linear relationships
correlation coefficients
14Correlations (Pearson) Lactic
Acetic Acetic 0.604 H2S 0.645 0.618
- A scatterplot matrix gives a scatterplot for each
pair of predictors - We can see from the scatterplot matrix that there
is definitely a relationship between each pair of
predictors - This is backed up the correlation matrix, which
gives the linear correlation coefficient for each
pair of predictorsAll 3 correlation coefficients
are over 0.6 indicating a significant problem
with multicollinearity
15Possible solutions
- If detection of multicollinearity is hard,
deciding what to do about it is even harder - One solution is if two variables have a
significant linear between then to drop one of
the variables - In our extreme case (where one variable was a
linear function of the other) - How does it work out in this example.
- We need to think about dropping variables in a
sensible or systematic way - One way of doing this is with a partial
regression plot
16Partial regression plots
- Think for a moment about the general case where
we have k possible explanatory variables - We wish to decide whether to include the kth
variable in the model - This might fail to happen in two different ways
- The variable may be completely unrelated to the
response so including it will not improve model
fit, or - Given the other variables in the model, the
additional variability explained by this variable
may be negligible i.e.. this variable might be
linearly dependent on the other predictors
multicollinearity - We can use this to our advantage
17- The part of the response not accounted for by the
other variables, is simply the
residual we get when we regress the response on
these variables - We want to see if there is any relationship
between these residuals and the part of xk not
related to . - This unrelated part of xk is best measured by the
residuals we get when we regress xk on - We examine the relationship between these two
sets of residuals by plotting the first set of
residuals versus the second this is called a
partial regression plot - If there appears to be no relationship in this
plot then adding the variable will not improve
the fit - If the relationship seems linear, then adding the
variable will very likely improve fit - Finally, if there is a curve in the plot then a
transformation of the predictor may well improve
the fit further
18- These plots can be quite hard to interpret
- However the appears to be at least a weak
relationship for H2S and Lactic, but not for
Acetic, so we will fit the model
19Predictor Coef StDev T
P Constant -27.592 8.982 -3.07
0.005 Lactic 19.887 7.959
2.50 0.019 H2S 3.946 1.136
3.47 0.002 S 9.942 R-Sq 65.2
R-Sq(adj) 62.6 Analysis of Variance Source
DF SS MS F
P Regression 2 4993.9
2497.0 25.26 0.000 Residual Error 27
2669.0 98.9 Total 29
7662.9
- The regression seems okay with significant
coefficients and a reasonable R2 value - But we still have non-constant variance
- We might try taking logs of the response
20- Taking the logarithm of the response has cured
our residual problems, but there are two
residuals which are disturbing the fit - If we remove these points the R2 goes from 42 to
53.6 - an acceptable level