Outliers and influential data points - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Outliers and influential data points

Description:

... data point is influential if it unduly influences any part of a regression ... determine if the regression analysis is unduly influenced by one or a few ... – PowerPoint PPT presentation

Number of Views:355

Avg rating:3.0/5.0

Slides: 47

Provided by: lsi4

Category:

more less

Transcript and Presenter's Notes

Title: Outliers and influential data points

1
Outliers and influential data points
2
The distinction

An outlier is a data point whose response y does
not follow the general trend of the rest of the
data.
A data point is influential if it unduly
influences any part of a regression analysis,
such as predicted responses, estimated beta
coefficients, hypothesis test results, etc.

3
No outliers? No influential data points?
4
Any outliers? Any influential data points?
5
Any outliers? Any influential data points?
6
Any outliers? Any influential data points?
7
Any outliers? Any influential data points?
8
Any outliers? Any influential data points?
9
Any outliers? Any influential data points?
10
Impact on regression analyses

Not every outlier strongly influences the
regression analysis.
Always determine if the regression analysis is
unduly influenced by one or a few data points.
Simple plots for simple linear regression.
Summary measures for multiple linear regression.

11
The leverages hi
12
The leverages hi
The predicted response can be written as a linear
combination of the n observed values y1, y2, ,
yn
where the weights h1, h2, , hi, , hn depend
only on the predictor values.
13
Properties of the leverages hi

The hi is
a measure of the distance between the x value for
the ith data point and the mean of the x values
for all n data points.
a number between 0 and 1, inclusive.
The sum of the hi equals p, the number of
parameters.

14
Any high leverages hi?
15
HI1 0.176297 0.157454 0.127014 0.119313
0.086145 0.077744 0.065028 0.061276
0.050974 0.049628 0.048147 0.049313
0.051829 0.055760 0.069311 0.072580
0.109616 0.127489 0.140453 0.141136
0.163492
Sum of HI1 2.0000
16
Any high leverages hi?
17
HI1 0.153481 0.139367 0.116292 0.110382
0.084374 0.077557 0.066879 0.063589 0.050033
0.052121 0.047632 0.048156 0.049557 0.055893
0.057574 0.078121 0.088549 0.096634 0.096227
0.110048 0.357535
Sum of HI1 2.0000
18
Identifying data points whose x values are
extreme .... and therefore potentially influential
19
Using leverages to identify extreme x values
Minitab flags any observations whose leverage
value, hi, is more than 3 times larger than the
mean leverage value.
or if its greater than 0.99 (whichever is
smallest).
20
x y HI1 14.00 68.00 0.357535
Unusual Observations Obs x y Fit SE
Fit Residual St Resid 21 14.0 68.00 71.449
1.620 -3.449 -1.59 X X denotes an
observation whose X value gives it
large influence.
21
x y HI2 13.00 15.00 0.311532
Unusual Observations Obs x y Fit SE
Fit Residual St Resid 21 13.0 15.00 51.66
5.83 -36.66 -4.23RX R denotes an
observation with a large standardized residual. X
denotes an observation whose X value gives it
large influence.
22
Identifying outliers (unusual y values)
23
Identifying outliers

Residuals
Standardized residuals
also called internally studentized residuals

24
Residuals
Ordinary residuals defined for each observation,
i 1, , n
x y FITS1 RESI1 1 2 2.2 -0.2 2
5 4.4 0.6 3 6 6.6 -0.6 4
9 8.8 0.2
25
Standardized residuals
Standardized residuals defined for each
observation, i 1, , n
MSE1 0.400000 x y FITS1 RESI1 HI1
SRES1 1 2 2.2 -0.2 0.7
-0.57735 2 5 4.4 0.6 0.3
1.13389 3 6 6.6 -0.6 0.3
-1.13389 4 9 8.8 0.2 0.7 0.57735
26
Standardized residuals

Standardized residuals quantify how large the
residuals are in standard deviation units.
An observation with a standardized residual that
is larger than 3 (in absolute value) is
considered an outlier.
Recall that Minitab flags any observation with a
standardized residual that is larger than 2 (in
absolute value).

27
An outlier?
28
S 4.711
x y FITS1 HI1 s(e)
RESI1 SRES1 0.10000 -0.0716 3.4614
0.176297 4.27561 -3.5330 -0.82635 0.45401
4.1673 5.2446 0.157454 4.32424 -1.0774
-0.24916 1.09765 6.5703 8.4869 0.127014
4.40166 -1.9166 -0.43544 1.27936 13.8150
9.4022 0.119313 4.42103 4.4128
0.99818 2.20611 11.4501 14.0706 0.086145
4.50352 -2.6205 -0.58191 ... 8.70156 46.5475
46.7904 0.140453 4.36765 -0.2429
-0.05561 9.16463 45.7762 49.1230 0.163492
4.30872 -3.3468 -0.77679 4.00000 40.0000
23.1070 0.050974 4.58936 16.8930 3.68110

Unusual Observations
Obs x y Fit SE Fit Residual
St Resid
4.00 40.00 23.11 1.06 16.89
3.68R
R denotes an observation with a large
standardized residual.

29
Identifying influential data points
30
Identifying influential data points

Deleted residuals
Deleted t residuals
also called studentized deleted residuals
also called externally studentized residuals
Difference in fits, DFITS
Cooks distance measure

31
Basic idea of these four measures

Delete the observations one at a time, each time
refitting the regression model on the remaining
n-1 observations.
Compare the results using all n observations to
the results with the ith observation deleted to
see how much influence the observation has on the
analysis.

32
Deleted residuals
yi the observed response for ith observation
predicted response for ith observation
based on the estimated model with the ith
observation deleted
Deleted residual
33
(No Transcript)
34
Deleted t residuals
A deleted t residual is just a standardized
deleted residual
The deleted t residuals follow a t distribution
with ((n-1)-p) degrees of freedom.
35
x y RESI1 TRES1 1 2.1 -1.59
-1.7431 2 3.8 0.24 0.1217 3 5.2
1.77 1.6361 10 2.1 -0.42 -19.7990
36
Row x y RESI1 SRES1
TRES1 1 0.10000 -0.0716 -3.5330 -0.82635
-0.81916 2 0.45401 4.1673 -1.0774
-0.24916 -0.24291 3 1.09765 6.5703
-1.9166 -0.43544 -0.42596 ... 19 8.70156
46.5475 -0.2429 -0.05561 -0.05413 20
9.16463 45.7762 -3.3468 -0.77679 -0.76837
21 4.00000 40.0000 16.8930 3.68110
6.69012
37
DFITS
The difference in fits
is the number of standard deviations that the
fitted value changes when the ith case is
omitted.
38
DFITS
An observation is deemed influential if the
absolute value of its DFIT value is
greater than 1 for small to medium data sets
or if it just sticks out like a sore thumb
39
x y DFIT1 14.00 68.00 -1.23841
40
Row x y DFIT1 1 0.1000
-0.0716 -0.52503 2 0.4540 4.1673
-0.08388 3 1.0977 6.5703 -0.18232 4
1.2794 13.8150 0.75898 5 2.2061
11.4501 -0.21823 6 2.5006 12.9554
-0.20155 7 3.0403 20.1575 0.27774 8
3.2358 17.5633 -0.08230 9 4.4531
26.0317 0.13865 10 4.1699 22.7573
-0.02221 11 5.2847 26.3030 -0.18487 12
5.5924 30.6885 0.05523 13 5.9209
33.9402 0.19741 14 6.6607 30.9228
-0.42449 15 6.7995 34.1100 -0.17249 16
7.9794 44.4536 0.29918 17 8.4154
46.5022 0.30960 18 8.7161 50.0568
0.63049 19 8.7016 46.5475 0.14948 20
9.1646 45.7762 -0.25094 21 14.0000
68.0000 -1.23841
41
x y DFIT2 13.00 15.00 -11.4670
42
Row x y DFIT2 1 0.1000
-0.0716 -0.4028 2 0.4540 4.1673
-0.2438 3 1.0977 6.5703 -0.2058 4
1.2794 13.8150 0.0376 5 2.2061
11.4501 -0.1314 6 2.5006 12.9554
-0.1096 7 3.0403 20.1575 0.0405 8
3.2358 17.5633 -0.0424 9 4.4531
26.0317 0.0602 10 4.1699 22.7573
0.0092 11 5.2847 26.3030 0.0054 12
5.5924 30.6885 0.0782 13 5.9209
33.9402 0.1278 14 6.6607 30.9228
0.0072 15 6.7995 34.1100 0.0731 16
7.9794 44.4536 0.2805 17 8.4154
46.5022 0.3236 18 8.7161 50.0568
0.4361 19 8.7016 46.5475 0.3089 20
9.1646 45.7762 0.2492 21 13.0000
15.0000 -11.4670
43
Cooks distance
Cooks distance

Di depends on both residual ei and leverage hi.
Di summarizes how much all of the estimated beta
coefficients change when deleting the ith
observation.
A large Di indicates yi has a strong influence
on the estimated beta coefficients.

44
Cooks distance