Title: Regression III: Robust regressions
1Regression III Robust regressions
- Outliers
- Tests for outlier detections
- Robust regressions
- Breakdown point
- Least trimmed squares
- M-estimators
2Outliers
- Outliers are observations that deviates from all
others significantly. They may occur by accident
or they may be results of measurement errors.
Presence of outliers may lead to misleading
results. In the example shown on the figure
without outlier mean value is -0.41 and with
outlier it is -0.01. If we want to test the
hypothesis H0mean0 then without outlier we
conclude that H0 can be rejected (p-value is
0.009), however with outlier we cannot reject
null-hypothesis (p-value is 0.98). - Analysis and dealing with outliers is an
important ingredient of modern statistical
analysis. Sometimes careful analysis of outliers,
their removal or weighting down may change the
conclusions considerably.
With outliers
Outlier
Without outlier
3Dealing with outliers
- For simple cases such as mean, variance
calculations one way of dealing with outliers is
using trimmed data for statistic calculation. For
example in the case above mean without outlier is
-.41, with outlier -0.09 and trimmed mean with
10 removed is -0.33. Trimmed mean gives better
than mean based on all data. - Often mean and other statistics are used for
testing some hypothesis. In these cases using
non-parameteric (wilcox.test) tests may be better
alternatives to t.test (wilcox.test gives p-value
0.09 and t.test gives p-value 0.98). - For simple cases like mean, covariance
calculations and test based on them usual
approach is to use rank of observations instead
of their values. Obviously when rank is used the
power of tests will be reduced, however better
conclusions could be more reliable. - Before carrying out analysis and tests it is
always good idea to visualise data and explore to
see if there are outliers.
4Outlier detection Grubbs test
- It may be a good idea to test for outliers and
remove them if possible before starting to
analyse the data and doing hypothesis testing.
One of the techniques for doing this is Grubbs
test. It tests - H0 there is an outlier versus H1 there is no
outlier - To do this Grubbs suggested using the statistic
- G (max(y)-mean(y))/sd(y)
- to test if the maximum value is outlier
- G (mean(y)-mean(y))/sd(y)
- to test if the minimum value is outlier and
- G max(yi-mean(y))/sd(y)
- to test if maximum or minimum is outlier. There
are versions for two outliers also. - Obviously test statistics will depend on the
number of observations. - Grubbs test (grubbs.test) is available from the
package outliers. It is not part of the standard
R distributions.
5Outlier detection Grubbs test
- Applying grubbs.test for the example we get
- Grubbs test for one outlier
- data del1
- G 2.9063, U 0.0709, p-value 9.858e-06
- alternative hypothesis highest value 4 is an
outlier - Since p-value is very small we can reject null
hypothesis that there is no outlier. - Once we are sure that there is an outlier then we
remove it and carry out tests and/or analysis
further. -
6Outlier detection Simulated distribution
- If outliers package is not available then we can
generate distribution for the statistics given
above. Let us write for one of them (H0 maximum
value is not outlier) - outdist function(nsample,n)
- ff vector(lengthn)
- for (i in 1n)rrrnorm(nsample)ffimax(rr-me
an(rr))/sd(rr) - ff
-
- Now we can generate distribution of the
statistics for samples of different sizes. For
example the distribution for sample of sizes
10,15 and 20 are shown on the figure. To generate
the figure the following set of commands are used - oo10 outdist(10,10000)
- curve(ecdf(oo10)(x),from0,to15,lwd3)
- oo15 outdist(15,10000)
- curve(ecdf(oo15)(x),from0,to15,lwd3)
- oo20 outdist(20,10000)
- curve(ecdf(oo20)(x),from0,to15,lwd3)
- Obviously as the sample size increases
probability that - large values will be genuinely observed will
increase. That is why as sample size increases
the distribution shifts to the right.
7Outlier detection Simulated distribution
- Once we have the desired distributions we can
calculate For example for the case we considered
we sample size is 11. Let us generate empirical
cumulative probability distribution and use it
for outlier detection - oo11outdist(11,10000)
- ec11 ecdf(oo11)
- st max(del1-mean(del1))/sd(del1)
- 1-ec11(st)
- This sequence of commands will produce p-values.
If the maximum value is 4 then p-value is 0, if
the maximum value is 2 then p-value is 0.001 and
when maximum value is 1 then p-value is 0.04. We
can reject null-hypothesis for first and the
second case, however for the third case we should
be careful.
8Breakdown point
- Breakdown point of an estimator is a fraction of
of sample if changed arbitrarily that does not
affect the estimation significantly. For example
for mean value if we changing by value
arbitrarily we can change mean value as much as
we want. Let us take an example - -0.8 -0.6 -0.3 0.1 -1.1 0.2 -0.3 -0.5 -0.5 -0.3
4.0 - Sample size is 11, mean value is -0.09. If we
change the last value to 100 then the mean value
becomes 8.72. Breakdown point of mean value is 0. - Another limiting case is median. Median of the
above sample is -0.3. If we change one value and
make it extremely large then median will not
change much. For example if we change the last
value to -100 then median will become -0.5.
Breakdown point for median is 0.5, i.e. more than
50 of the sample should be changed arbitrarily
to change the median arbitrarily. Breakdown point
0.5 is the theoretical limit. - Efficiency of estimators with high breakdown
point is usually worse than those with lower
breakdown point. In other words variances of
estimators with high breakdown point are larger.
9Outliers and regression
Regression no outliers
- Let us remind us the form of the least-squares
equations for regressions. Again x is a vector of
input (predictor) parameters, ß is a vector of
parameters, y is output, the number of sample
points is n. - As we know in special case when g(ß,x) ß, and ß
is a single value then least-squares estimation
gives mean value of y. We can consider above
estimation as an extension of mean value
estimation. Breakdown point of this estimation is
0, so least-squares is very sensitive to
outliers. - There are several approaches to deal with
outliers in regression analysis. We will consider
only two of them 1) least-trimmed squares 2)
M-estimators
Regression with an outlier
Outlier
10Least trimmed squares
- Least trimmed squares works iteratively.
- Set up initial values for the model parameters
(for example using simple least squares method
implemented in lm) - Calculate squared residuals ri2(yi-g(xi,ß))2
- Sort squared residuals
- Remove fraction of observations for which squared
residuals are large - Minimise least squares using these observations
only - Repeat 2)-6) until convergence achieved.
- The function lts in R does LTS and several others
as a special case of LTS (least median and least
quantile squares). The number of used residuals
for different methods are different. For
least-median it is (n1)/2, for least quantile
(np1)/2 and for LTS it is n/2(p1)/2,
where is the integer part of the argument
The result of default lqs
11Robust M-estimators
- General extension of least-squares have the form
- Form the function ? defines various forms of
robust M-estimators. When ?(z)z2 it becomes
simple least-squares. - Let us first analyse this function. To minimise
this function let us use Gauss-Newton method. To
use this method we need the first and second
derivatives (more precisely an approximation for
the second derivative) - Where ?, ? are the first and the second
derivative of ?. In Gauss-Newton methods the
second term of the second derivative equation is
usually ignored. Usually ?? and ?w notations
are used. If we look at the equations we can see
that it looks like an extension of least-squares
equations. The minimisation of the function is
done iteratively using iteratively reweighted
least squares (IRLS or IWLS). - ? function is an influence function. Analysis of
values of this function at the observations may
help to understand outliers in the data and how
are dealt with.
12Forms of robust regression
Example of ? and ? (Geman and Mcclure function)
- Robust M-estimators are usually chosen so that to
make contribution of gradients for large
residuals small, in other words to weight down
large deviations. They can be chosen either using
? or ?. - Basic idea behind robust estimators is For small
deviations behaviour of the function should be
similar to least squares and for large deviations
contributions should be weighted down. Different
functions differ by degree of weighting.
13Forms of robust regression
- Most popular forms of robust estimators are
- Huber
- Tukeys bisquare
- Geman and Mcclure
- Welsch
- t-distribution (actually it is a little bit
modified form of log t distribution)
14Robust estimators
- The result of robust regression with Huber
function is very good. In practice choice of the
function will depend on the number and severity
of outliers.
15Robust estimators Dagnostic plots
16Boxplots for residuals and weighted residuals
17R commands for robust estimation
- lqs least trimmed, least median estimation
- rlm robust linear model estimation using
M-estimator - outliers package may be helpful.
18References
- P. J. Huber (1981) Robust Statistics.
- F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw
and W. A. Stahel (1986) Robust Statistics The
Approach based on Influence Functions - A. Marazzi (1993) Algorithms, Routines and S
Functions for Robust Statistics, - W. N. and Ripley, B. D. (2002) Modern Applied
Statistics with S