Stat 112: Lecture 16 Notes - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Stat 112: Lecture 16 Notes

Description:

Distance from point to horizontal line is the residual if the explanatory ... (Use them the same way as in a simple regression to identify the effect of points ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 26
Provided by: D2
Category:
Tags: horizontal | is | lecture | notes | stat | way | which

less

Transcript and Presenter's Notes

Title: Stat 112: Lecture 16 Notes


1
Stat 112 Lecture 16 Notes
  • Finish Chapter 6
  • Influential Points for Multiple Regression
    (Section 6.7)
  • Assessing the Independence Assumptions and
    Remedies for Its Violation (Section 6.8)
  • Homework 5 due next Thursday. I will e-mail it
    tonight.
  • Please let me know of any ideas you want to
    discuss for the final project.

2
Multiple regression, modeling and outliers,
leverage and influential pointsPollution Example
  • Data set pollution.JMP provides information about
    the relationship between pollution and mortality
    for 60 cities between 1959-1961.
  • The variables are
  • y (MORT)total age adjusted mortality in deaths
    per 100,000 population
  • PRECIPmean annual precipitation (in inches)
  • EDUCmedian number of school years completed for
    persons 25 and older
  • NONWHITEpercentage of 1960 population that is
    nonwhite NOXrelative pollution potential of Nox
    (related to amount of tons of Nox emitted per
    day per square kilometer)
  • SO2relative pollution potential of SO2

3
Multiple Regression Steps in Analysis
  • Preliminaries Define the question of interest.
    Review the design of the study. Correct errors
    in the data.
  • Explore the data. Use graphical tools, e.g.,
    scatterplot matrix consider transformations of
    explanatory variables fit a tentative model
    check for outliers and influential points.
  • Formulate an inferential model. Word the
    questions of interest in terms of model
    parameters.

4
Multiple Regression Steps in Analysis Continued
  1. Check the Model. (a) Check the model assumptions
    of linearity, constant variance, normality. (b)
    If needed, return to step 2 and make changes to
    the model (such as transformations or adding
    terms for interaction and curvature) (c) Drop
    variables from the model that are not of central
    interest and are not significant.
  2. Infer the answers to the questions of interest
    using appropriate inferential tools (e.g.,
    confidence intervals, hypothesis tests,
    prediction intervals).
  3. Presentation Communicate the results to the
    intended audience.

5
Scatterplot Matrix
  • Before fitting a multiple linear regression
    model, it is good idea to make scatterplots of
    the response variable versus the explanatory
    variable. This can suggest transformations of
    the explanatory variables that need to be done as
    well as potential outliers and influential
    points.
  • Scatterplot matrix in JMP Click Analyze,
    Multivariate Methods and Multivariate, and then
    put the response variable first in the Y, columns
    box and then the explanatory variables in the Y,
    columns box.

6
Scatterplot Matrix
7
Crunched Variables
  • When an X variable is crunched meaning that
    most of its values are crunched together and a
    few are far apart there will be influential
    points. To reduce the effects of crunching, it
    is a good idea to transform the variable to log
    of the variable.

8
  • 2. a) From the scatter plot of MORT vs. NOX we
    see that NOX values are crunched very tight. A
    Log transformation of NOX is needed.
  • b) The curvature in MORT vs. SO2 indicates a Log
    transformation for SO2 may be suitable.
  • After the two transformations we have the
    following correlations

9
(No Transcript)
10
Influential Points, High Leverage Points,
Outliers in Multiple Regression
  • As in simple linear regression, we identify high
    leverage and high influence points by checking
    the leverages and Cooks distances (Use save
    columns to save Cooks D Influence and Hats).
  • High influence points Cooks distance gt 1
  • High leverage points Hat greater than (3( of
    explanatory variables 1))/n is a point with
    high leverage. These are points for which the
    explanatory variables are an outlier in a
    multidimensional sense.
  • Use same guidelines for dealing with influential
    observations as in simple linear regression.
  • Point that has unusual Y given its explanatory
    variables point with a residual that is more
    than 3 RMSEs away from zero.

11
New Orleans has Cooks Distance greater than 1
New Orleans may be influential.
3 RMSEs 108 No points are outliers in residuals
12
Labeling Observations
  • To have points identified by a certain column, go
    the column, click Columns and click Label (click
    Unlabel to Unlabel).
  • To label a row, go to the row, click rows and
    click label.

13

14
Dealing with New Orleans
  • New Orleans is influential.
  • New Orleans also has high leverage,
    hat0.45gt(36/60)0.2.
  • Thus, it is reasonable to exclude New Orleans
    from the analysis, report that we excluded New
    Orleans, and note that our model does not apply
    to cities with explanatory variables in the range
    of New Orleans.

15
Leverage Plots
  • A simple regression view of a multiple
    regression coefficient. For xj
  • Residual y (w/o xj) vs. Residual xj (vs the
    rest of xs)
  • (both axes are recentered)
  • Slope in leverage plot
  • coefficient for that variable in the multiple
    regression
  • Distances from the points to the LS line are
    multiple regression residuals. Distance from
    point to horizontal line is the residual if the
    explanatory variable is not included in the
    model.
  • Useful to identify (for xj)
  • outliers
  • leverage
  • influential points
  • (Use them the same way as in a simple regression
    to identify the effect of points for the
    regression coefficient
  • of a particular variable)
  • Leverage plots are particularly useful for points
    which are influential for a particular
    coefficient in the regression.

16
The enlarged observation New Orleans is an
outlier for estimating each coefficient and is
highly leveraged for estimating the coefficients
of interest on log Nox and log SO2. Since New
Orleans is both highly leveraged and an outlier,
we expect it to be influential.
17
Analysis without New Orleans
18
Checking the Model
19
Linearity, constant variance and normality
assumptions all appear reasonable.
20
Inference About Questions of Interest
  • Strong evidence that mortality is positively
    associated with S02 for fixed levels of
    precipitation, education, nonwhite, NOX.
  • No strong evidence that mortality is associated
    with NOX for fixed levels of precipitation,
    education, nonwhite, S02.

21
Multiple Regression and Causal Inference
  • Goal Figure out what the causal effect on
    mortality would be of decreasing air pollution
    (and keeping everything else in the world fixed)
  • Lurking variable A variable that is associated
    with both air pollution in a city and mortality
    in a city.
  • In order to figure out whether air pollution
    causes mortality, we want to compare mean
    mortality among cities with different air
    pollution levels but the same values of the
    confounding variables.
  • If we include all of the lurking variables in the
    multiple regression model, the coefficient on air
    pollution represents the change in the mean of
    mortality that is caused by a one unit increase
    in air pollution.
  • If we omit some of the lurking variables, then
    there is omitted variables bias, i.e., the
    multiple regression coefficient on air pollution
    does not measure the causal effect of air
    pollution.

22
Time Series Data and Autocorrelation
  • When Y is a variable collected for the same
    entity (person, state, country) over time, we
    call the data time series data.
  • For time series data, we need to consider the
    independence assumption for the simple and
    multiple regression model.
  • Independence Assumption The residuals are
    independent of one another. This means that if
    the residual is positive this year, it needs to
    be equally likely for the residuals to be
    positive or negative next year, i.e., there is no
    autocorrelation.
  • Positive autocorrelation Positive residuals are
    more likely to be followed by positive residuals
    than by negative residuals.
  • Negative autocorrelation Positive residuals are
    more likely to be followed by negative residuals
    than by positive residuals.

23
Example Melanoma Incidence
  • Is the incidence of melanoma (skin cancer)
    increasing over time? Is melanoma related to
    solar radiation?
  • We address thse questions by looking at melanoma
    incidence among males from the Connecticut Tumor
    Registry from 1936 to 1972. Data is in
    melanoma.JMP

24
Residuals suggest positive autocorrelation.
25
Test of Independence
  • The Durbin-Watson test is a test of whether the
    residuals are independent. The null hypothesis
    is that the residuals are independent and the
    alternative hypothesis is that the residuals are
    not independent (either positively or negatively)
    autocorrelated.
  • To compute Durbin-Watson test in JMP, after Fit
    Model, click the red triangle next to Response,
    click Row Diagnostics and click Durbin-Watson
    Test. Then click red triangle next to
    Durbin-Watson to get p-value.
  • For melanoma data,
  • Remedy for autocorrelation Add lagged value of Y
    to model.
Write a Comment
User Comments (0)
About PowerShow.com