Title: Stat 112: Lecture 16 Notes
1Stat 112 Lecture 16 Notes
- Finish Chapter 6
- Influential Points for Multiple Regression
(Section 6.7) - Assessing the Independence Assumptions and
Remedies for Its Violation (Section 6.8) - Homework 5 due next Thursday. I will e-mail it
tonight. - Please let me know of any ideas you want to
discuss for the final project.
2Multiple regression, modeling and outliers,
leverage and influential pointsPollution Example
- Data set pollution.JMP provides information about
the relationship between pollution and mortality
for 60 cities between 1959-1961. - The variables are
- y (MORT)total age adjusted mortality in deaths
per 100,000 population - PRECIPmean annual precipitation (in inches)
- EDUCmedian number of school years completed for
persons 25 and older - NONWHITEpercentage of 1960 population that is
nonwhite NOXrelative pollution potential of Nox
(related to amount of tons of Nox emitted per
day per square kilometer) - SO2relative pollution potential of SO2
3Multiple Regression Steps in Analysis
- Preliminaries Define the question of interest.
Review the design of the study. Correct errors
in the data. - Explore the data. Use graphical tools, e.g.,
scatterplot matrix consider transformations of
explanatory variables fit a tentative model
check for outliers and influential points. - Formulate an inferential model. Word the
questions of interest in terms of model
parameters.
4Multiple Regression Steps in Analysis Continued
- Check the Model. (a) Check the model assumptions
of linearity, constant variance, normality. (b)
If needed, return to step 2 and make changes to
the model (such as transformations or adding
terms for interaction and curvature) (c) Drop
variables from the model that are not of central
interest and are not significant. - Infer the answers to the questions of interest
using appropriate inferential tools (e.g.,
confidence intervals, hypothesis tests,
prediction intervals). - Presentation Communicate the results to the
intended audience.
5Scatterplot Matrix
- Before fitting a multiple linear regression
model, it is good idea to make scatterplots of
the response variable versus the explanatory
variable. This can suggest transformations of
the explanatory variables that need to be done as
well as potential outliers and influential
points. - Scatterplot matrix in JMP Click Analyze,
Multivariate Methods and Multivariate, and then
put the response variable first in the Y, columns
box and then the explanatory variables in the Y,
columns box.
6Scatterplot Matrix
7Crunched Variables
- When an X variable is crunched meaning that
most of its values are crunched together and a
few are far apart there will be influential
points. To reduce the effects of crunching, it
is a good idea to transform the variable to log
of the variable.
8- 2. a) From the scatter plot of MORT vs. NOX we
see that NOX values are crunched very tight. A
Log transformation of NOX is needed. - b) The curvature in MORT vs. SO2 indicates a Log
transformation for SO2 may be suitable. - After the two transformations we have the
following correlations
9(No Transcript)
10Influential Points, High Leverage Points,
Outliers in Multiple Regression
- As in simple linear regression, we identify high
leverage and high influence points by checking
the leverages and Cooks distances (Use save
columns to save Cooks D Influence and Hats). - High influence points Cooks distance gt 1
- High leverage points Hat greater than (3( of
explanatory variables 1))/n is a point with
high leverage. These are points for which the
explanatory variables are an outlier in a
multidimensional sense. - Use same guidelines for dealing with influential
observations as in simple linear regression. - Point that has unusual Y given its explanatory
variables point with a residual that is more
than 3 RMSEs away from zero.
11New Orleans has Cooks Distance greater than 1
New Orleans may be influential.
3 RMSEs 108 No points are outliers in residuals
12Labeling Observations
- To have points identified by a certain column, go
the column, click Columns and click Label (click
Unlabel to Unlabel). - To label a row, go to the row, click rows and
click label.
13 14Dealing with New Orleans
- New Orleans is influential.
- New Orleans also has high leverage,
hat0.45gt(36/60)0.2. - Thus, it is reasonable to exclude New Orleans
from the analysis, report that we excluded New
Orleans, and note that our model does not apply
to cities with explanatory variables in the range
of New Orleans.
15Leverage Plots
- A simple regression view of a multiple
regression coefficient. For xj - Residual y (w/o xj) vs. Residual xj (vs the
rest of xs) - (both axes are recentered)
- Slope in leverage plot
- coefficient for that variable in the multiple
regression - Distances from the points to the LS line are
multiple regression residuals. Distance from
point to horizontal line is the residual if the
explanatory variable is not included in the
model. - Useful to identify (for xj)
- outliers
- leverage
- influential points
- (Use them the same way as in a simple regression
to identify the effect of points for the
regression coefficient - of a particular variable)
- Leverage plots are particularly useful for points
which are influential for a particular
coefficient in the regression.
16The enlarged observation New Orleans is an
outlier for estimating each coefficient and is
highly leveraged for estimating the coefficients
of interest on log Nox and log SO2. Since New
Orleans is both highly leveraged and an outlier,
we expect it to be influential.
17Analysis without New Orleans
18Checking the Model
19Linearity, constant variance and normality
assumptions all appear reasonable.
20Inference About Questions of Interest
- Strong evidence that mortality is positively
associated with S02 for fixed levels of
precipitation, education, nonwhite, NOX. - No strong evidence that mortality is associated
with NOX for fixed levels of precipitation,
education, nonwhite, S02.
21Multiple Regression and Causal Inference
- Goal Figure out what the causal effect on
mortality would be of decreasing air pollution
(and keeping everything else in the world fixed) - Lurking variable A variable that is associated
with both air pollution in a city and mortality
in a city. - In order to figure out whether air pollution
causes mortality, we want to compare mean
mortality among cities with different air
pollution levels but the same values of the
confounding variables. - If we include all of the lurking variables in the
multiple regression model, the coefficient on air
pollution represents the change in the mean of
mortality that is caused by a one unit increase
in air pollution. - If we omit some of the lurking variables, then
there is omitted variables bias, i.e., the
multiple regression coefficient on air pollution
does not measure the causal effect of air
pollution.
22Time Series Data and Autocorrelation
- When Y is a variable collected for the same
entity (person, state, country) over time, we
call the data time series data. - For time series data, we need to consider the
independence assumption for the simple and
multiple regression model. - Independence Assumption The residuals are
independent of one another. This means that if
the residual is positive this year, it needs to
be equally likely for the residuals to be
positive or negative next year, i.e., there is no
autocorrelation. - Positive autocorrelation Positive residuals are
more likely to be followed by positive residuals
than by negative residuals. - Negative autocorrelation Positive residuals are
more likely to be followed by negative residuals
than by positive residuals.
23Example Melanoma Incidence
- Is the incidence of melanoma (skin cancer)
increasing over time? Is melanoma related to
solar radiation? - We address thse questions by looking at melanoma
incidence among males from the Connecticut Tumor
Registry from 1936 to 1972. Data is in
melanoma.JMP
24Residuals suggest positive autocorrelation.
25Test of Independence
- The Durbin-Watson test is a test of whether the
residuals are independent. The null hypothesis
is that the residuals are independent and the
alternative hypothesis is that the residuals are
not independent (either positively or negatively)
autocorrelated. - To compute Durbin-Watson test in JMP, after Fit
Model, click the red triangle next to Response,
click Row Diagnostics and click Durbin-Watson
Test. Then click red triangle next to
Durbin-Watson to get p-value. - For melanoma data,
-
- Remedy for autocorrelation Add lagged value of Y
to model.