Chapter 6 Diagnostics for Leverage and Influence - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 6 Diagnostics for Leverage and Influence

Description:

Outliers or observations that have the unusual y values. ... have a large hat diagonal and is assuredly a leverage point, but it has almost ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 34
Provided by: Lis158
Category:

less

Transcript and Presenter's Notes

Title: Chapter 6 Diagnostics for Leverage and Influence


1
Chapter 6 Diagnostics for Leverage and Influence
  • Ray-Bing Chen
  • Institute of Statistics
  • National University of Kaohsiung

2
6.1 Important of Detecting Influential
Observations
  • Usually assume equal weights for the
    observations. For example sample mean
  • In Section 2.7, the location of observations in
    x-space can play an important role in determining
    the regression coefficients (see Figure 2.6 and
    2.7)
  • Outliers or observations that have the unusual y
    values.
  • In Section 4.4, the outliers can be identified by
    residuals

3
  • See Figure 6.1

4
  • The point A is called leverage point.
  • Leverage point
  • Has an unusual x-value and may control certain
    model properties.
  • This point does not effect the estimates of the
    regression coefficients, but it certainly will
    dramatic effect on the model summary statistics
    such as R2 and the standard errors of the
    regression coefficients.

5
  • See the point A in Figure 6.2

6
  • Influence point
  • For the point A, it has a moderately unusual
    x-coordinate, and the y value is unusual as well.
  • An influence point has a noticeable impact on the
    model coefficients in that it pulls the
    regression model in its direction.
  • Sometimes we find that a small subset of data
    exerts a disproportionate influence on the model
    coefficients and properties.
  • In the extreme case, the parameter estimates may
    depend on the influential subset of points than
    on the majority of the data.

7
  • We would like for a regression model to be
    representative of all of the sample observations,
    not an artifact of a few.
  • If the influence points are bad values, then they
    should be eliminated from the sample.
  • If they are not bad values, there may be nothing
    wrong with these points, but if they control key
    model properties we would like to know it, as it
    could affect the end use of the regression model.
  • Here we present several diagnostics for leverage
    and influence. And it is important to use these
    diagnostics in conjunction with the residual
    analysis techniques of Chapter 4.

8
6.2 Leverage
  • The location of points in x-space is potentially
    important in determining the properties of the
    regression model.
  • In particular, remote points potentially have
    disproportionate impact on the parameter
    estimates, standard error, predicted values, and
    model summary statistics.

9
  • Hat matrix plays an important role in identifying
    influential observations.
  • H X(XX)-1X
  • H determines the variances and covariances of the
    fitted values and residuals, e.
  • The elements hij of H may be interpreted as the
    amount of leverage exerted by the ith observation
    yi on the ith fitted value.

10

11
  • The point A in Figure 6.1 will have a large hat
    diagonal and is assuredly a leverage point, but
    it has almost no effect on the regression
    coefficients because it lies almost on the line
    passing through the remaining observations.
    (Because the hat diagonals examine only the
    location of the observation in x-space)
  • Observations with large hat diagonals and large
    residuals are likely to be influential.
  • If 2p/n gt 1, then the cutoff value does not apply.

12
  • Example 6.1 The Delivery Time Data
  • In Example 3.1, p3, n25. The cutoff value is
    2p/n 0.24. That is if hii exceeds 0.24, then
    the ith observation is a leverage point.
  • Observation 9 and 22 are leverage points.
  • See Figure 3.4 (the matrix of scatterplots),
    Figure 3.11 and Table 4.1 (the studentized
    residuals and R-student)
  • The corresponding residuals for the observation
    22 are not unusually large. So it indicates that
    the observation 22 has little influence on the
    fitted model.

13
  • Both scaled residuals for the observation 9 are
    moderately large, suggesting that this
    observation may have moderate influence on the
    model.

14
(No Transcript)
15
6.3 Measures of Influence Cooks D
  • It is desirable to consider both the location of
    the point in x-space and the response variable in
    measuring influence.
  • Cook (1977, 1979) suggested to use a measure of
    the squared distance between the least-square
    estimate based on the estimate of the n points
    and the estimate obtained by deleting the ith
    point.

16
  • Usually
  • Points with large values of Di have considerable
    influence on the least-square estimate.
  • The magnitude of Di is usually assessed by
    comparing it to F?, p, n-p.
  • If Di F0.5, p, n-p, then deleting point I would
    move to the boundary an approximate 50
    confidence region for ? based on the complete
    data set.

17
  • A large displacement indicates that the
    least-squares estimate is sensitive to the ith
    data point.
  • F0.5, p, n-p ? 1
  • The Di statistic may be rewritten as
  • Di is proportional to the product of the square
    of the ith studentized residual and hii / (1
    hii).
  • This ratio can be shown to be the distance from
    the vector xi to the centroid of the remaining
    data.
  • Di is made up of a component that reflects how
    well the model fits the ith observation yi and a
    component that measures how far that points is
    from the rest of the data.

18
  • Either component (or both) may contribute to a
    large value of Di.
  • Di combines residual magnitude for the ith
    observation and the location of that point in
    x-space to assess influence.
  • Because ,
    another way to write Cooks distance measure is
  • Di is also the squared distance that the vector
    of fitted values moves when the ith observation
    is deleted.

19
  • Example 6.2 The delivery Time Data
  • Column b of Table 6.1 contains the values of
    Cooks distance measure for the soft drink
    delivery time data.

20
(No Transcript)
21
(No Transcript)
22
6.4 Measure of Influence DFFITS and DFBETAS
  • Cooks D is a deletion diagnostic.
  • Blesley, Kuh and Welsch (1980) introduce two
    useful measures of deletion influence.
  • First one How much the regression coefficient
    changes.
  • Here Cjj is the jth diagonal element of (XX)-1

23
  • A large value of DFBETASj,i indicates that
    observation i has considerable influence on the
    jth regression coefficient.
  • Define R (XX)-1X
  • The n elements in the jth row of R produce the
    leverage that the n observations in the sample
    have on the estimate of the jth coefficient,

24
  • DFBETASj,i measures both leverage and the effect
    of a large residual.
  • Cutoff value 2/n1/2
  • That is if DFBETASj,i gt 2/n1/2, then the ith
    observation warrant examination.
  • Second one the deletion influence of the ith
    observation on the predicted or fitted value

25
  • DFFITSi is the number of standard deviation that
    the fitted value changes if observation i is
    removed.
  • DFFITSi is also affected by both leverage and
    prediction error.
  • Cutoff value 2(pn)1/2

26
(No Transcript)
27
(No Transcript)
28
6.5 A Measure of Model Performance
  • The diagnostics Di, DFBETASj,i and DFFITSi
    provide insight about the effect of observations
    on the estimated coefficients and the fitted
    values.
  • No information about overall precision of
    estimation.
  • Generalized variance

29
  • To express the role of the ith observation on the
    precision of estimation, we define
  • If COVRATIOi gt 1, then the ith observation
    improves the precision of estimation.
  • If COVRATIOi lt 1, inclusion of the ith point
    degrades precision.

30
  • Because of 1 / (1 hii), a high-leverage point
    will make COVRATIOi large.
  • The ith point is considered influential if
    COVRATIOi gt 1 3p/n or COVRATIOi lt 1 3p/n
  • Theses cutoffs are only recommended for large
    sample.
  • Example 6.4 The Delivery Time Data
  • The cutoffs are 1.36 and 0.64.
  • Observation 9 and 22 are influential.
  • Obs. 9 degrades precision of estimation.
  • The influence of Obs. 22 is fairly small.

31
6.6 Delecting Groups of Influential Observations
  • The above methods only focus on
    single-observation deletion diagnostics for
    influence and leverage.
  • Single-observation diagnostic gt
    multiple-observation case.
  • Extend Cooks distance measure
  • Let i be the m ? 1 vector of indices specifying
    the points to be deleted, and

32
  • Di is a multiple-observation version of Cooks
    distance measure.
  • Large value of Di indicates that the set of m
    points are influential.
  • In some data sets, subsets of points are jointly
    influential but individual points are not!
  • Sebert, Montgomery and Rollier (1998) investigate
    the use of cluster analysisto find the set of
    influential observation in regression.
    (signle-linkage clustering procedure)

33
6.7 Treatment of Influential Observations
  • Diagnostics for leverage and influence are an
    important part of the regression model-builders
    arsenal of tools.
  • Offer the analyst insight about the data, and
    signal which observations may deserve more
    scrutiny.
  • Should influential observations ever be
    discarded?
  • A compromise between deleting an observation and
    retaining it is to consider using an estimation
    technique that is not impacted as severely by
    influential points as least squares.
Write a Comment
User Comments (0)
About PowerShow.com