Title: Stat 6601 Project: Regression Diagnostics V
1Stat 6601 ProjectRegression Diagnostics(VR
6.3)
- Presenters
- Anthony Britto, Kathy Fung, Kai Koo
2Basic Definition of Regression Diagnostics
- An old robust method
- Developed to measure and iteratively detect
possibly wrong data and reject them through
analysis of globally fitted model
3Regression Diagnostics
- Goal
- Detection of possibly wrong data through analysis
of globally fitted model. - Typical approach
- (1) Determine an initial fitted model
- (2) Compute the residuals
- (3) Reject / identify outliers
- (4) Rebuild model or tracking the source of errors
4Influence and Leverage (1)
- Influence An observation is influential if the
estimates change substantially when this
observation is omitted. - Leverage The "horizontal" distance of the x
-value from the mean of x. The further from the
mean, the more leverage an observation has. - y-discrepancy The vertical distance between
yobs. and ypredicted - Conceptual formula Influence Leverage
y-Discrepancy
5Influence and Leverage (2)
High influence point (5,60)
Low influence point (30,105) (x - mean of
x)2 830 (x - mean of x)2 15
yobs - ypred 45
yobs - ypred 45
6Detecting Outliers
- Distinguish the difference between two types of
outliers - 1st type outliers in the response variable
represent model failure, such observations are
called outliers. - 2nd type outliers with respect to the predictors
are called leverage points. - Both types can affect the regression model.
However, they may almost uniquely determine
regression coefficients. They may also cause the
standard error of regression coefficients to be
much smaller than they would be if the
observation were excluded.
7Methods to detect outliers in R
- Outliers in the predictors can often be detected
by simply examining the distribution of the
predictors. - Dot Plots
- Stem-and-leaf plots
- Box Plots
- Histograms
8Linear Model
Y b0 b1x1 b2x2 .... bkxk e
Matrix form Y Xb e
Y
X
b
e
9R Functions for Regression Diagnostics
- Package Function Description
- Base plot(model) Basic diagnostics plots
- ls.diag (lsfit(x,y)) Diagnostic tool
- car cr.plots(model) Partial residual plots
- av.plots(model) Partial regression plots
- hatvalues (model) Hat values
- outlier.test (model) Test for largest
residual - df.betas(model) DfBet as measure of
influence - cookd(model) Cooks D measure of influence
- rstudent(model) Studentized residuals
- vif(model) VIF or GVIF for each term in the
model
10R function for Robust Regression
- Package Function Description
- MASS rlm (yx) M-Estimation
- lqs ltsreg (yx) Least-Trimmed squares
- lms(yx) Least-Median regression
11Example Linear regression (one independent
variable) 1
Matrix form R / S-plus script
gt xd lt- c(rep(1,5),1,3,4,5,7) gt yd lt-
c(6,14,10,14,26) gt x lt- matrix(xd,5,2, byrowF) gt
y lt- matrix(yd,5,1, byrowT) gt xtrp lt- t(x)
Matrix transpose gt xxtrp lt- xtrp x Matrix
multiplication gt inxxtrp lt- solve(xxtrp) Matrix
inverting gt b.hat lt- inxxtrp xtrp y gt
b.hat ,1 1, 2 2, 3 gt H lt- x
inxxtrp xtrp hat matrix gt H ,1
,2 ,3 ,4 ,5 1, 0.65 0.35 0.2 0.05
-0.25 2, 0.35 0.25 0.2 0.15 0.05 3, 0.20
0.20 0.2 0.20 0.20 4, 0.05 0.15 0.2 0.25
0.35 5, -0.25 0.05 0.2 0.35 0.65
Y Xb e
12Example Linear regression (one independent
variable) 2
Extraction of leverages and predicted values
Leverage of the ith observation (hii) (for one
independent variable n of obs. p 1)
gt n lt- 5 gt lev lt- numeric(n) gt for (i in 1n)
levi lt- Hi,i gt lev 1 0.65 0.25 0.20
0.25 0.65 gt h lt- lm.influence(lm(yx))hat gt
h 1 0.65 0.25 0.20 0.25 0.65 gt
ls.diag(lsfit(x,2,y))hat 1 0.65 0.25 0.20
0.25 0.65 gt y1.pred lt- 0 gt for (i in 1n)
y1.pred lt- y1.pred H1,i yi gt y1.pred
y1.pred(x11)3(slope2(intercept) 1 5
hij leverage of (xi, yi) if i j
13Example linear regression (measurement of
residuals)
- From y-discrepancy to influence
Raw residual value (y-discrepancy)
Standardized residual value (influence)
Studentized residual value (influence)
14Influence, leverage and discrepancy
The influence of observations can be determined
by their residual values and leverages.
15Calculation of residual values
Do it by youself in R gt y.pred lt- numeric(n) gt
for (i in 1n) for (j in 1n)
y.predi lt- y.predi Hi,j ydj gt
res lt- yd-y.pred gt Sy lt- sqrt(sum(res2)/(n-2)) gt
resstd lt- res/(Sysqrt(1-lev)) gt resstd 1
0.4413674 0.9045340 -1.1677484 -0.9045340
1.3241022 Using ls.diag to get residuals gt
ls.diag(lsfit(x,2,y))std.res standardized
residuals 1 0.4413674 0.9045340 -1.1677484
-0.9045340 1.3241022 gt ls.diag(lsfit(x,2,y))st
ud.res Studentized residuals 1 0.3726780
0.8660254 -1.2909944 -0.8660254 1.6770510
16Example Multiple regression
R / S-plus script R output
Call glm(formula log10price elevation date
flood distance, data
project.data) Deviance Residuals Min
1Q Median 3Q Max -0.22145
-0.09075 -0.04765 0.07475 0.43564
Coefficients Estimate Std. Error
t value Pr(gtt) (Intercept) 1.226620
0.092763 13.223 4.74e-13 elevation
0.032394 0.007304 4.435 0.000149 date
0.008065 0.001168 6.902 2.50e-07
flood -0.338254 0.087451 -3.868
0.000659 distance 0.025659 0.007177
3.575 0.001401 --- Signif. codes 0 '
0.001 ' 0.01 ' 0.05 .' 0.1 ' 1
(Dispersion parameter for gaussian family taken
to be 0.02453675) Null deviance 2.90725 on
30 degrees of freedom Residual deviance 0.63796
on 26 degrees of freedom AIC -20.414 Number
of Fisher Scoring iterations 2
project.datalt-read.csv("projdata.csv") model1 lt-
glm(log10priceelevationdateflooddistance,
dataproject.data) summary(model1)
17Example Multiple regression (measurement of
influence using R / S-plus)
Residual plot R / S-plus script
Measurement of influence y lt-
matrix(log10price,31,1, byrowT) x lt-
matrix(c(elevation, date, flood, distance),
31,4,byrowF) lesi lt- ls.diag(lsfit(x,y))
Regression diagnostics lesistud.res Extraction
of Studentized residuals plot(lesistud.res,
ylab"Studentized residuals", xlab"obs
") lesicooks Extraction of Cook's 1
1.392863e-02 3.528960e-01 8.396778e-02
1.518977e-01 1.390608e-01 6 1.145438e-02
2.437453e-03 1.972966e-03 1.705327e-01
9.386767e-02 11 7.468621e-03 1.134031e-06
1.945352e-04 1.678359e-03 8.794873e-03 16
5.150404e-03 2.257051e-05 4.193730e-03
1.961141e-02 1.120336e-03 21 1.075247e-01
1.071167e-02 2.825819e-02 2.193734e-03
5.710213e-02 26 7.024345e-02 1.166287e-03
1.322331e-02 2.616666e-03 1.411050e-01 31
1.06727e-02
18Example Multiple regression (SAS)
SAS script Output
The REG
Procedure
Model MODEL1
Dependent Variable log10price
Analysis of Variance
Sum of
Mean Source DF
Squares Square F Value Pr gt F
Model 4 2.00013
0.50003 14.33 lt.0001
Error 26 0.90712
0.03489 Corrected Total 30
2.90725 Root MSE
0.18679 R-Square 0.6880
Dependent Mean 0.98126 Adj R-Sq
0.6400 Coeff Var
19.03533
Parameter Estimates
Parameter Standard
Variable DF Estimate Error
t Value Pr gt t Intercept
1 1.38737 0.09877 14.05
lt.0001 size 1
0.00012958 0.00011481 1.13 0.2694
elevation 1 0.02820
0.00866 3.26 0.0031
flood 1 -0.23779 0.09837
-2.42 0.0229 date 1
0.00881 0.00150 5.88
lt.0001
data land(dropcounty sewer)
infile "c\stat
6401\projdata.csv"
delimiter','
firstobs2
input price county size
elevation
sewer date flood
distance
log10pricelog10(price)
run
proc reg
dataland
model
log10priceelevation size date flood /r
plot
rstudent.log10price'' output outpred
predphat
title 'linear regression for
housing prices'
run
19Example Multiple regression (SAS)
Output
Statistics Dep Var Predicted Std
Error Std Error Student
Cook's Obs log10price Value Mean
Predict Residual Residual Residual -2-1 0 1 2
D 1 0.6532 0.7276
0.0698 -0.0744 0.140 -0.530
0.014 2 1.0253 0.5897
0.0628 0.4356 0.144 3.036
0.353 3 0.2304 0.3623
0.0850 -0.1319 0.132 -1.002
0.084 4 0.6990 0.8682
0.0872 -0.1692 0.130 -1.301
0.152 5 0.6990 0.5380
0.0875 0.1609 0.130 1.239
0.139 6 0.5185 0.5978
0.0623 -0.0793 0.144 -0.552
0.011 7 0.7559 0.8078
0.0474 -0.0519 0.149 -0.348
0.002 8 0.7924 0.8400
0.0466 -0.0477 0.150 -0.319
0.002 9 1.2878 1.3972
0.1082 -0.1094 0.113 -0.966
0.171 10 0.5051 0.7266
0.0635 -0.2215 0.143 -1.546
0.094 11 0.6721 0.7347
0.0634 -0.0626 0.143 -0.437
0.007 12 0.8388 0.8399
0.0495 -0.001063 0.149 -0.0072
0.000 13 0.9085 0.9256
0.0416 -0.0171 0.151 -0.113
0.000 14 1.0645 1.1228
0.0364 -0.0584 0.152 -0.383
0.002 15 1.2856 1.1825
0.0457 0.1031 0.150 0.688
0.009 16 1.0682 1.1578
0.0409 -0.0896 0.151 -0.593
0.005 17 1.1239 1.1305
0.0368 -0.006693 0.152 -0.0440
0.000 18 1.1790 1.0709
0.0315 0.1081 0.153 0.704
0.004 19 1.0934 1.2252
0.0519 -0.1317 0.148 -0.891
0.020 20 1.1847 1.1382
0.0373 0.0464 0.152 0.305
0.001 21 1.0864 1.2145
0.0920 -0.1281 0.127 -1.011
0.108 22 1.2577 1.0865
0.0318 0.1712 0.153 1.116
0.011 23 1.2253 1.0051
0.0393 0.2202 0.152 1.452
0.028 24 0.7709 0.7985
0.0728 -0.0277 0.139 -0.200
0.002 25 0.6021 0.7314
0.0769 -0.1293 0.136 -0.948
0.057 26 1.5705 1.2588
0.0431 0.3117 0.151 2.070
0.070 27 1.2601 1.2152
0.0391 0.0449 0.152 0.296
0.001 28 1.1790 1.2709
0.0589 -0.0919 0.145 -0.633
0.013 29 1.3598 1.4007
0.0590 -0.0408 0.145 -0.281
0.003 30 1.1818 1.0540
0.0980 0.1279 0.122 1.047
0.141 31 1.3404 1.4003
0.0737 -0.0598 0.138 -0.433
0.011
20Example Multiple regression (SAS)
Residual plot Studentized Residual plot
21Further studies for regression analysis
- Analysis of models
- Multicollinearity
- Heteroscedasticity
- Autocorrelation
- Validation of models
- Website of R Function for modern regression
http//socserv.socsci.mcmaster.ca/andersen/ICPSR/R
Functions.pdf
22The End