Stat 6601 Project: Regression Diagnostics V - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Stat 6601 Project: Regression Diagnostics V

Description:

Developed to measure and iteratively detect possibly wrong data ... df.betas(model) DfBet as measure of influence. cookd(model) Cook's D measure of influence ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 23
Provided by: kathy153
Category:

less

Transcript and Presenter's Notes

Title: Stat 6601 Project: Regression Diagnostics V


1
Stat 6601 ProjectRegression Diagnostics(VR
6.3)
  • Presenters
  • Anthony Britto, Kathy Fung, Kai Koo

2
Basic Definition of Regression Diagnostics
  • An old robust method
  • Developed to measure and iteratively detect
    possibly wrong data and reject them through
    analysis of globally fitted model

3
Regression Diagnostics
  • Goal
  • Detection of possibly wrong data through analysis
    of globally fitted model.
  • Typical approach
  • (1) Determine an initial fitted model
  • (2) Compute the residuals
  • (3) Reject / identify outliers
  • (4) Rebuild model or tracking the source of errors

4
Influence and Leverage (1)
  • Influence An observation is influential if the
    estimates change substantially when this
    observation is omitted.
  • Leverage The "horizontal" distance of the x
    -value from the mean of x. The further from the
    mean, the more leverage an observation has.
  • y-discrepancy The vertical distance between
    yobs. and ypredicted
  • Conceptual formula Influence Leverage
    y-Discrepancy

5
Influence and Leverage (2)
High influence point (5,60)
Low influence point (30,105) (x - mean of
x)2 830 (x - mean of x)2 15
yobs - ypred 45
yobs - ypred 45
6
Detecting Outliers
  • Distinguish the difference between two types of
    outliers
  • 1st type outliers in the response variable
    represent model failure, such observations are
    called outliers.
  • 2nd type outliers with respect to the predictors
    are called leverage points.
  • Both types can affect the regression model.
    However, they may almost uniquely determine
    regression coefficients. They may also cause the
    standard error of regression coefficients to be
    much smaller than they would be if the
    observation were excluded.

7
Methods to detect outliers in R
  • Outliers in the predictors can often be detected
    by simply examining the distribution of the
    predictors.
  • Dot Plots
  • Stem-and-leaf plots
  • Box Plots
  • Histograms

8
Linear Model
Y b0 b1x1 b2x2 .... bkxk e
Matrix form Y Xb e
Y
X
b
e
9
R Functions for Regression Diagnostics
  • Package Function Description
  • Base plot(model) Basic diagnostics plots
  • ls.diag (lsfit(x,y)) Diagnostic tool
  • car cr.plots(model) Partial residual plots
  • av.plots(model) Partial regression plots
  • hatvalues (model) Hat values
  • outlier.test (model) Test for largest
    residual
  • df.betas(model) DfBet as measure of
    influence
  • cookd(model) Cooks D measure of influence
  • rstudent(model) Studentized residuals
  • vif(model) VIF or GVIF for each term in the
    model

10
R function for Robust Regression
  • Package Function Description
  • MASS rlm (yx) M-Estimation
  • lqs ltsreg (yx) Least-Trimmed squares
  • lms(yx) Least-Median regression

11
Example Linear regression (one independent
variable) 1
Matrix form R / S-plus script
gt xd lt- c(rep(1,5),1,3,4,5,7) gt yd lt-
c(6,14,10,14,26) gt x lt- matrix(xd,5,2, byrowF) gt
y lt- matrix(yd,5,1, byrowT) gt xtrp lt- t(x)
Matrix transpose gt xxtrp lt- xtrp x Matrix
multiplication gt inxxtrp lt- solve(xxtrp) Matrix
inverting gt b.hat lt- inxxtrp xtrp y gt
b.hat ,1 1, 2 2, 3 gt H lt- x
inxxtrp xtrp hat matrix gt H ,1
,2 ,3 ,4 ,5 1, 0.65 0.35 0.2 0.05
-0.25 2, 0.35 0.25 0.2 0.15 0.05 3, 0.20
0.20 0.2 0.20 0.20 4, 0.05 0.15 0.2 0.25
0.35 5, -0.25 0.05 0.2 0.35 0.65
Y Xb e
12
Example Linear regression (one independent
variable) 2
Extraction of leverages and predicted values
Leverage of the ith observation (hii) (for one
independent variable n of obs. p 1)
gt n lt- 5 gt lev lt- numeric(n) gt for (i in 1n)
levi lt- Hi,i gt lev 1 0.65 0.25 0.20
0.25 0.65 gt h lt- lm.influence(lm(yx))hat gt
h 1 0.65 0.25 0.20 0.25 0.65 gt
ls.diag(lsfit(x,2,y))hat 1 0.65 0.25 0.20
0.25 0.65 gt y1.pred lt- 0 gt for (i in 1n)
y1.pred lt- y1.pred H1,i yi gt y1.pred
y1.pred(x11)3(slope2(intercept) 1 5
hij leverage of (xi, yi) if i j
13
Example linear regression (measurement of
residuals)
  • From y-discrepancy to influence

Raw residual value (y-discrepancy)
Standardized residual value (influence)
Studentized residual value (influence)
14
Influence, leverage and discrepancy
The influence of observations can be determined
by their residual values and leverages.
15
Calculation of residual values
Do it by youself in R gt y.pred lt- numeric(n) gt
for (i in 1n) for (j in 1n)
y.predi lt- y.predi Hi,j ydj gt
res lt- yd-y.pred gt Sy lt- sqrt(sum(res2)/(n-2)) gt
resstd lt- res/(Sysqrt(1-lev)) gt resstd 1
0.4413674 0.9045340 -1.1677484 -0.9045340
1.3241022 Using ls.diag to get residuals gt
ls.diag(lsfit(x,2,y))std.res standardized
residuals 1 0.4413674 0.9045340 -1.1677484
-0.9045340 1.3241022 gt ls.diag(lsfit(x,2,y))st
ud.res Studentized residuals 1 0.3726780
0.8660254 -1.2909944 -0.8660254 1.6770510
16
Example Multiple regression
R / S-plus script R output
Call glm(formula log10price elevation date
flood distance, data
project.data) Deviance Residuals Min
1Q Median 3Q Max -0.22145
-0.09075 -0.04765 0.07475 0.43564
Coefficients Estimate Std. Error
t value Pr(gtt) (Intercept) 1.226620
0.092763 13.223 4.74e-13 elevation
0.032394 0.007304 4.435 0.000149 date
0.008065 0.001168 6.902 2.50e-07
flood -0.338254 0.087451 -3.868
0.000659 distance 0.025659 0.007177
3.575 0.001401 --- Signif. codes 0 '
0.001 ' 0.01 ' 0.05 .' 0.1 ' 1
(Dispersion parameter for gaussian family taken
to be 0.02453675) Null deviance 2.90725 on
30 degrees of freedom Residual deviance 0.63796
on 26 degrees of freedom AIC -20.414 Number
of Fisher Scoring iterations 2
project.datalt-read.csv("projdata.csv") model1 lt-
glm(log10priceelevationdateflooddistance,
dataproject.data) summary(model1)
17
Example Multiple regression (measurement of
influence using R / S-plus)
Residual plot R / S-plus script
Measurement of influence y lt-
matrix(log10price,31,1, byrowT) x lt-
matrix(c(elevation, date, flood, distance),
31,4,byrowF) lesi lt- ls.diag(lsfit(x,y))
Regression diagnostics lesistud.res Extraction
of Studentized residuals plot(lesistud.res,
ylab"Studentized residuals", xlab"obs
") lesicooks Extraction of Cook's 1
1.392863e-02 3.528960e-01 8.396778e-02
1.518977e-01 1.390608e-01 6 1.145438e-02
2.437453e-03 1.972966e-03 1.705327e-01
9.386767e-02 11 7.468621e-03 1.134031e-06
1.945352e-04 1.678359e-03 8.794873e-03 16
5.150404e-03 2.257051e-05 4.193730e-03
1.961141e-02 1.120336e-03 21 1.075247e-01
1.071167e-02 2.825819e-02 2.193734e-03
5.710213e-02 26 7.024345e-02 1.166287e-03
1.322331e-02 2.616666e-03 1.411050e-01 31
1.06727e-02
18
Example Multiple regression (SAS)
SAS script Output
The REG
Procedure
Model MODEL1
Dependent Variable log10price
Analysis of Variance
Sum of
Mean Source DF
Squares Square F Value Pr gt F
Model 4 2.00013
0.50003 14.33 lt.0001
Error 26 0.90712
0.03489 Corrected Total 30
2.90725 Root MSE
0.18679 R-Square 0.6880
Dependent Mean 0.98126 Adj R-Sq
0.6400 Coeff Var
19.03533
Parameter Estimates
Parameter Standard
Variable DF Estimate Error
t Value Pr gt t Intercept
1 1.38737 0.09877 14.05
lt.0001 size 1
0.00012958 0.00011481 1.13 0.2694
elevation 1 0.02820
0.00866 3.26 0.0031
flood 1 -0.23779 0.09837
-2.42 0.0229 date 1
0.00881 0.00150 5.88
lt.0001
data land(dropcounty sewer)

infile "c\stat
6401\projdata.csv"

delimiter','


firstobs2


input price county size

elevation
sewer date flood

distance


log10pricelog10(price)

run


proc reg
dataland

model
log10priceelevation size date flood /r

plot
rstudent.log10price'' output outpred
predphat

title 'linear regression for
housing prices'

run


19
Example Multiple regression (SAS)
Output
Statistics Dep Var Predicted Std
Error Std Error Student
Cook's Obs log10price Value Mean
Predict Residual Residual Residual -2-1 0 1 2
D 1 0.6532 0.7276
0.0698 -0.0744 0.140 -0.530
0.014 2 1.0253 0.5897
0.0628 0.4356 0.144 3.036
0.353 3 0.2304 0.3623
0.0850 -0.1319 0.132 -1.002
0.084 4 0.6990 0.8682
0.0872 -0.1692 0.130 -1.301
0.152 5 0.6990 0.5380
0.0875 0.1609 0.130 1.239
0.139 6 0.5185 0.5978
0.0623 -0.0793 0.144 -0.552
0.011 7 0.7559 0.8078
0.0474 -0.0519 0.149 -0.348
0.002 8 0.7924 0.8400
0.0466 -0.0477 0.150 -0.319
0.002 9 1.2878 1.3972
0.1082 -0.1094 0.113 -0.966
0.171 10 0.5051 0.7266
0.0635 -0.2215 0.143 -1.546
0.094 11 0.6721 0.7347
0.0634 -0.0626 0.143 -0.437
0.007 12 0.8388 0.8399
0.0495 -0.001063 0.149 -0.0072
0.000 13 0.9085 0.9256
0.0416 -0.0171 0.151 -0.113
0.000 14 1.0645 1.1228
0.0364 -0.0584 0.152 -0.383
0.002 15 1.2856 1.1825
0.0457 0.1031 0.150 0.688
0.009 16 1.0682 1.1578
0.0409 -0.0896 0.151 -0.593
0.005 17 1.1239 1.1305
0.0368 -0.006693 0.152 -0.0440
0.000 18 1.1790 1.0709
0.0315 0.1081 0.153 0.704
0.004 19 1.0934 1.2252
0.0519 -0.1317 0.148 -0.891
0.020 20 1.1847 1.1382
0.0373 0.0464 0.152 0.305
0.001 21 1.0864 1.2145
0.0920 -0.1281 0.127 -1.011
0.108 22 1.2577 1.0865
0.0318 0.1712 0.153 1.116
0.011 23 1.2253 1.0051
0.0393 0.2202 0.152 1.452
0.028 24 0.7709 0.7985
0.0728 -0.0277 0.139 -0.200
0.002 25 0.6021 0.7314
0.0769 -0.1293 0.136 -0.948
0.057 26 1.5705 1.2588
0.0431 0.3117 0.151 2.070
0.070 27 1.2601 1.2152
0.0391 0.0449 0.152 0.296
0.001 28 1.1790 1.2709
0.0589 -0.0919 0.145 -0.633
0.013 29 1.3598 1.4007
0.0590 -0.0408 0.145 -0.281
0.003 30 1.1818 1.0540
0.0980 0.1279 0.122 1.047
0.141 31 1.3404 1.4003
0.0737 -0.0598 0.138 -0.433
0.011
20
Example Multiple regression (SAS)
Residual plot Studentized Residual plot
21
Further studies for regression analysis
  • Analysis of models
  • Multicollinearity
  • Heteroscedasticity
  • Autocorrelation
  • Validation of models
  • Website of R Function for modern regression
    http//socserv.socsci.mcmaster.ca/andersen/ICPSR/R
    Functions.pdf

22
The End
Write a Comment
User Comments (0)
About PowerShow.com