Detecting outliers - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Detecting outliers

Description:

We've seen how to use hat matrix diagonals as a way of ... In an example as extreme as the previous one, we would know straight away there was a problem. ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 21

Provided by: DrJames75

Category:

more less

Transcript and Presenter's Notes

Title: Detecting outliers

1
Detecting outliers

Weve seen how to use hat matrix diagonals as a
way of detecting potentially high leverage points
But what about points that may not have much
leverage, but are large outliers
Such points interfere with model fit measures and
our tests of significance
Weve looked at the residuals for a number of
uses, but we havent made much (direct) use of
the properties of the residuals
Recall that for all of the models weve
discussed, weve assumed that the residuals are
normally distributed, with mean zero and standard
deviation sigma

2
Standardized residuals

When we learned about the normal distribution we
saw how to work out normal tail probabilities
Recall if then

So if the residuals, ri have an approximate
normal distribution with mean 0 and variance
sigma then

3
Standardized residuals

The trouble is that we dont know the standard
deviation of the residuals
But we have an estimate of it a standard error
It turns out that the standard error of the ith
residual is the estimate of the overall standard
deviation (the square root of the Residual Mean
Square) times a function of the associated hat
matrix diagonal, i.e.

So the standardized residuals are given by
4
Studentized residuals

Complementary to the idea of the standardized
residuals are the Studentized residuals

where s(i) is the residual standard deviation if
the least squares procedure is run without the
ith observation

Numerically, they are similar to the standardized
residuals
However they may have better theoretical
properties

5
Using Standardised / Studentized residuals

Theory aside, how do we use these quantities
If the standardized residuals have a standard
normal distribution, then it is very easy to spot
large residuals
We know that for a standard normal
Pr(Zlt-1.96) or Pr(Zgt1.96) 0.05
Pr(Zlt-2.57) or Pr(Zgt2.57) 0.01
Large residuals will have large Z-scores and
hence will have low probabilities of being
observed if these points truly came from a
standard normal distribution
Therefore, if we have large (in magnitude)
standardized residuals, we might think that these
points do not belong to the model
The Studentized residuals actually have a
t-distribution, so if the regression is on a
small number of points (lt 20) the Studentized
residuals might be preferable.

6
Multicollinearity

One problem that we have in multiple regression
that we dont have in simple linear regression is
multicollinearity
The concept of multicollinearity is simple,
detecting it and deciding what to do about it is
hard
Before we make a definition of multicollinearity
we need a couple of extra terms
In regression, the response is often called the
dependent variable, its value being dependent on
the value of the predictors
Correspondingly the predictors are called the
independent variables

7
Multicollinearity a definition

The previous terms give us a handle on
multicollinearity
Multicollinearity occurs when there is a linear
relationship or dependency amongst the
independent variables
This sounds a little bizarre this model is okay
(and in fact is quite a common one)

But this one is not and would suffer from the
effects of multicollinearity

The difference is that the relationship between X
and 3X2 is linear whereas the relationship
between X and X2 is non-linear

In an example as extreme as the previous one, we
would know straight away there was a problem.
This is because in calculating the least squares
solution, the matrix XTX would be singular. This
means it has no inverse
If you dont know any linear algebra, you can
think of this as trying to calculate 1/x when x
0
Minitab would report
z is highly correlated with other X variables
z has been removed from the equation
Most other packages will report that XTX is
singular or ill-conditioned
In more realistic problems, there may be linear
dependence between the variables, but not perfect
dependence

9
Example

The following dataset comes from a cheese tasting
experiment.
As cheddar cheese matures, a variety of chemical
processes take place. The taste of matured cheese
is related to the concentration of several
chemicals in the final product.
In a study of cheddar cheese from the LaTrobe
Valley of Victoria, Australia, samples of cheese
were analysed for their chemical composition and
were subjected to taste tests.
Overall taste scores were obtained by combining
the scores from several tasters.

The response in this model is taste
The predictors are the log concentrations of
lactic acid (lactic), acetic acid (acetic) and
hydrogen sulphide (H2S)
Lactic acid is present in all milk products
Acetic acid comes from vinegar and is a
preservative
Hydrogen sulphide, the gas that gives Rotorua its
fresh smell, is used a preservative
Because we dont know too much about the
relationship between the predictors and the
response (we havent been given any other
information) we fit a straight forward multiple
linear regression model

11
Predictor Coef StDev T
P Constant -28.88 19.74 -1.46
0.155 Lactic 19.671 8.629
2.28 0.031 Acetic 0.328 4.460
0.07 0.942 H2S 3.912
1.248 3.13 0.004 S 10.13 R-Sq
65.2 R-Sq(adj) 61.2 Analysis of
Variance Source DF SS
MS F P Regression 3
4994.5 1664.8 16.22 0.000 Residual
Error 26 2668.4 102.6 Total
29 7662.9

The P-value from the ANOVA table is small gt
regression is significant
The regression table indicates that perhaps the
lactic acid and hydrogen sulphide concentrations
are important in predicting taste where as the
acetic acid concentration is not
The adj-R2 is reasonable were explaining about
60 of the variation

Once again we see a typical pattern
The norplot of the residuals looks fine
There appears to be a strong linear relationship
No one point deviates from the line in any
significant way
However, the pred-res plot tells a different
story
Very strong funnel shape plus curvature
Something is wrong with our model

13
Detecting multicollinearity

When we saw the pattern in the pred-res plot
previously, it was because we had fit a linear
model to a non-linear curve
How would we know that multicollinearity may be
the cause this time and not just some non-linear
effect?
We dont but, because were fitting a multiple
linear regression model, we should check for
multicollinearity as a matter of course
We defined multicollinearity to be the existence
of a linear dependency between the predictors
How do we observe linear relationships
scatterplots
How do we quantify linear relationships
correlation coefficients

14
Correlations (Pearson) Lactic
Acetic Acetic 0.604 H2S 0.645 0.618

A scatterplot matrix gives a scatterplot for each
pair of predictors
We can see from the scatterplot matrix that there
is definitely a relationship between each pair of
predictors
This is backed up the correlation matrix, which
gives the linear correlation coefficient for each
pair of predictorsAll 3 correlation coefficients
are over 0.6 indicating a significant problem
with multicollinearity

15
Possible solutions

If detection of multicollinearity is hard,
deciding what to do about it is even harder
One solution is if two variables have a
significant linear between then to drop one of
the variables
In our extreme case (where one variable was a
linear function of the other)
How does it work out in this example.
We need to think about dropping variables in a
sensible or systematic way
One way of doing this is with a partial
regression plot

16
Partial regression plots

Think for a moment about the general case where
we have k possible explanatory variables
We wish to decide whether to include the kth
variable in the model
This might fail to happen in two different ways
The variable may be completely unrelated to the
response so including it will not improve model
fit, or
Given the other variables in the model, the
additional variability explained by this variable
may be negligible i.e.. this variable might be
linearly dependent on the other predictors
multicollinearity
We can use this to our advantage

The part of the response not accounted for by the
other variables, is simply the
residual we get when we regress the response on
these variables
We want to see if there is any relationship
between these residuals and the part of xk not
related to .
This unrelated part of xk is best measured by the
residuals we get when we regress xk on
We examine the relationship between these two
sets of residuals by plotting the first set of
residuals versus the second this is called a
partial regression plot
If there appears to be no relationship in this
plot then adding the variable will not improve
the fit
If the relationship seems linear, then adding the
variable will very likely improve fit
Finally, if there is a curve in the plot then a
transformation of the predictor may well improve
the fit further

These plots can be quite hard to interpret
However the appears to be at least a weak
relationship for H2S and Lactic, but not for
Acetic, so we will fit the model

19
Predictor Coef StDev T
P Constant -27.592 8.982 -3.07
0.005 Lactic 19.887 7.959
2.50 0.019 H2S 3.946 1.136
3.47 0.002 S 9.942 R-Sq 65.2
R-Sq(adj) 62.6 Analysis of Variance Source
DF SS MS F
P Regression 2 4993.9
2497.0 25.26 0.000 Residual Error 27
2669.0 98.9 Total 29
7662.9

The regression seems okay with significant
coefficients and a reasonable R2 value
But we still have non-constant variance
We might try taking logs of the response

Taking the logarithm of the response has cured
our residual problems, but there are two
residuals which are disturbing the fit
If we remove these points the R2 goes from 42 to
53.6 - an acceptable level

Write a Comment

User Comments (0)