Title: Outliers and influential data points
1Outliers and influential data points
2The distinction
- An outlier is a data point whose response y does
not follow the general trend of the rest of the
data. - A data point is influential if it unduly
influences any part of a regression analysis,
such as predicted responses, estimated beta
coefficients, hypothesis test results, etc.
3No outliers? No influential data points?
4Any outliers? Any influential data points?
5Any outliers? Any influential data points?
6Any outliers? Any influential data points?
7Any outliers? Any influential data points?
8Any outliers? Any influential data points?
9Any outliers? Any influential data points?
10Impact on regression analyses
- Not every outlier strongly influences the
regression analysis. - Always determine if the regression analysis is
unduly influenced by one or a few data points. - Simple plots for simple linear regression.
- Summary measures for multiple linear regression.
11The leverages hi
12The leverages hi
The predicted response can be written as a linear
combination of the n observed values y1, y2, ,
yn
where the weights h1, h2, , hi, , hn depend
only on the predictor values.
13Properties of the leverages hi
- The hi is
- a measure of the distance between the x value for
the ith data point and the mean of the x values
for all n data points. - a number between 0 and 1, inclusive.
- The sum of the hi equals p, the number of
parameters.
14Any high leverages hi?
15HI1 0.176297 0.157454 0.127014 0.119313
0.086145 0.077744 0.065028 0.061276
0.050974 0.049628 0.048147 0.049313
0.051829 0.055760 0.069311 0.072580
0.109616 0.127489 0.140453 0.141136
0.163492
Sum of HI1 2.0000
16Any high leverages hi?
17HI1 0.153481 0.139367 0.116292 0.110382
0.084374 0.077557 0.066879 0.063589 0.050033
0.052121 0.047632 0.048156 0.049557 0.055893
0.057574 0.078121 0.088549 0.096634 0.096227
0.110048 0.357535
Sum of HI1 2.0000
18Identifying data points whose x values are
extreme .... and therefore potentially influential
19Using leverages to identify extreme x values
Minitab flags any observations whose leverage
value, hi, is more than 3 times larger than the
mean leverage value.
or if its greater than 0.99 (whichever is
smallest).
20 x y HI1 14.00 68.00 0.357535
Unusual Observations Obs x y Fit SE
Fit Residual St Resid 21 14.0 68.00 71.449
1.620 -3.449 -1.59 X X denotes an
observation whose X value gives it
large influence.
21 x y HI2 13.00 15.00 0.311532
Unusual Observations Obs x y Fit SE
Fit Residual St Resid 21 13.0 15.00 51.66
5.83 -36.66 -4.23RX R denotes an
observation with a large standardized residual. X
denotes an observation whose X value gives it
large influence.
22Identifying outliers (unusual y values)
23Identifying outliers
- Residuals
- Standardized residuals
- also called internally studentized residuals
24Residuals
Ordinary residuals defined for each observation,
i 1, , n
x y FITS1 RESI1 1 2 2.2 -0.2 2
5 4.4 0.6 3 6 6.6 -0.6 4
9 8.8 0.2
25Standardized residuals
Standardized residuals defined for each
observation, i 1, , n
MSE1 0.400000 x y FITS1 RESI1 HI1
SRES1 1 2 2.2 -0.2 0.7
-0.57735 2 5 4.4 0.6 0.3
1.13389 3 6 6.6 -0.6 0.3
-1.13389 4 9 8.8 0.2 0.7 0.57735
26Standardized residuals
- Standardized residuals quantify how large the
residuals are in standard deviation units. - An observation with a standardized residual that
is larger than 3 (in absolute value) is
considered an outlier. - Recall that Minitab flags any observation with a
standardized residual that is larger than 2 (in
absolute value).
27An outlier?
28S 4.711
x y FITS1 HI1 s(e)
RESI1 SRES1 0.10000 -0.0716 3.4614
0.176297 4.27561 -3.5330 -0.82635 0.45401
4.1673 5.2446 0.157454 4.32424 -1.0774
-0.24916 1.09765 6.5703 8.4869 0.127014
4.40166 -1.9166 -0.43544 1.27936 13.8150
9.4022 0.119313 4.42103 4.4128
0.99818 2.20611 11.4501 14.0706 0.086145
4.50352 -2.6205 -0.58191 ... 8.70156 46.5475
46.7904 0.140453 4.36765 -0.2429
-0.05561 9.16463 45.7762 49.1230 0.163492
4.30872 -3.3468 -0.77679 4.00000 40.0000
23.1070 0.050974 4.58936 16.8930 3.68110
- Unusual Observations
- Obs x y Fit SE Fit Residual
St Resid - 4.00 40.00 23.11 1.06 16.89
3.68R - R denotes an observation with a large
standardized residual.
29Identifying influential data points
30Identifying influential data points
- Deleted residuals
- Deleted t residuals
- also called studentized deleted residuals
- also called externally studentized residuals
- Difference in fits, DFITS
- Cooks distance measure
31Basic idea of these four measures
- Delete the observations one at a time, each time
refitting the regression model on the remaining
n-1 observations. - Compare the results using all n observations to
the results with the ith observation deleted to
see how much influence the observation has on the
analysis.
32Deleted residuals
yi the observed response for ith observation
predicted response for ith observation
based on the estimated model with the ith
observation deleted
Deleted residual
33(No Transcript)
34Deleted t residuals
A deleted t residual is just a standardized
deleted residual
The deleted t residuals follow a t distribution
with ((n-1)-p) degrees of freedom.
35 x y RESI1 TRES1 1 2.1 -1.59
-1.7431 2 3.8 0.24 0.1217 3 5.2
1.77 1.6361 10 2.1 -0.42 -19.7990
36 Row x y RESI1 SRES1
TRES1 1 0.10000 -0.0716 -3.5330 -0.82635
-0.81916 2 0.45401 4.1673 -1.0774
-0.24916 -0.24291 3 1.09765 6.5703
-1.9166 -0.43544 -0.42596 ... 19 8.70156
46.5475 -0.2429 -0.05561 -0.05413 20
9.16463 45.7762 -3.3468 -0.77679 -0.76837
21 4.00000 40.0000 16.8930 3.68110
6.69012
37DFITS
The difference in fits
is the number of standard deviations that the
fitted value changes when the ith case is
omitted.
38DFITS
An observation is deemed influential if the
absolute value of its DFIT value is
greater than 1 for small to medium data sets
or if it just sticks out like a sore thumb
39 x y DFIT1 14.00 68.00 -1.23841
40 Row x y DFIT1 1 0.1000
-0.0716 -0.52503 2 0.4540 4.1673
-0.08388 3 1.0977 6.5703 -0.18232 4
1.2794 13.8150 0.75898 5 2.2061
11.4501 -0.21823 6 2.5006 12.9554
-0.20155 7 3.0403 20.1575 0.27774 8
3.2358 17.5633 -0.08230 9 4.4531
26.0317 0.13865 10 4.1699 22.7573
-0.02221 11 5.2847 26.3030 -0.18487 12
5.5924 30.6885 0.05523 13 5.9209
33.9402 0.19741 14 6.6607 30.9228
-0.42449 15 6.7995 34.1100 -0.17249 16
7.9794 44.4536 0.29918 17 8.4154
46.5022 0.30960 18 8.7161 50.0568
0.63049 19 8.7016 46.5475 0.14948 20
9.1646 45.7762 -0.25094 21 14.0000
68.0000 -1.23841
41 x y DFIT2 13.00 15.00 -11.4670
42 Row x y DFIT2 1 0.1000
-0.0716 -0.4028 2 0.4540 4.1673
-0.2438 3 1.0977 6.5703 -0.2058 4
1.2794 13.8150 0.0376 5 2.2061
11.4501 -0.1314 6 2.5006 12.9554
-0.1096 7 3.0403 20.1575 0.0405 8
3.2358 17.5633 -0.0424 9 4.4531
26.0317 0.0602 10 4.1699 22.7573
0.0092 11 5.2847 26.3030 0.0054 12
5.5924 30.6885 0.0782 13 5.9209
33.9402 0.1278 14 6.6607 30.9228
0.0072 15 6.7995 34.1100 0.0731 16
7.9794 44.4536 0.2805 17 8.4154
46.5022 0.3236 18 8.7161 50.0568
0.4361 19 8.7016 46.5475 0.3089 20
9.1646 45.7762 0.2492 21 13.0000
15.0000 -11.4670
43Cooks distance
Cooks distance
- Di depends on both residual ei and leverage hi.
- Di summarizes how much all of the estimated beta
coefficients change when deleting the ith
observation. - A large Di indicates yi has a strong influence
on the estimated beta coefficients.
44Cooks distance
- Compare Di to the F(p, n-p) distribution.
- If Di is greater than the 50th percentile,
F(0.50, p, n-p), then the ith observation has
lots of influence.
45 x y COOK1 14.00 68.00 0.701960
46 x y COOK2 13.00 15.00 4.04801