Title: Assumptions Underlying Multiple Regression Analysis
1Assumptions Underlying Multiple Regression
Analysis
2Multiple regression analysis requires meeting
several assumptions. We will(1) identify some
of these assumptions(2) describe how to tell
if they have been met and(3) suggest how to
overcome or adjust for violations of the
assumptions, if violations are detected.
3Some Assumptions Underlying Multiple Regression
1. No specification error 2.
Continuous variables 3. Additivity
4. No multicollinearity 5. Normally
distributed error term 6. No
heteroscedasticity
4A. Absence of Specification ErrorThis refers
to three different things1. No relevant
variable is absent from the model.2. No
unnecessary variable is present in the model.3.
The model is estimated using the proper
functional form.
5A1. No relevant variable is absent from the
model.There is no statistical check for this
type of error. Theory and knowledge of the
dependent variable (that is, the phenomenon of
interest) are the only checks.The strength of
causal claims is directly proportional to the
adequacy and completeness of model
specification.
6A2. No unnecessary variable is present in the
model. Another, less serious, type of
specification error is the inclusion of some
variable or variables that are NOT associated
with the dependent variable. You discover this
when an independent variable proves NOT to be
statistically significant. However, if theory
and knowledge of the subject demanded that the
variable be included, then this is not really
specification error.Be careful not to remove
statistically insignificant variables in order to
re-estimate models without them. This smacks of
one of the sins of multiple regression analysis,
step-wise regression.
7A3. Proper functional form.A third aspect of
proper model specification relates to what is
called the functional form of the analysis.
Multiple regression analysis assumes that the
model has been estimated using the correct
mathematical function. Recall that our
discussions of the line of best fit, etc., have
all emphasized the idea that the data can be
described by a straight line, however
imperfectly, rather than by some other
mathematical function. This is what functional
form is all about.
8To determine whether the assumption of linear
form is violated, simply create scatterplots for
the relationship between each independent
variable and the dependent variable. Examine
each scatterplot to see if there is overwhelming
evidence of nonlinearity. Use PROC PLOT for
this libname old 'a\' libname library
'a\' proc plot dataold.cities plot
crimrate(policexp incomepc stress74) title1
'Plots of Dependent and All Independent
Variables' run
9 A
A 80
A
A
AN
AU M 70 B
AE
AR
A A AO
AF 60 A
A A
A AS E A A
AR A AI
AO 50 AU B AAS
C A A A AR
A AI 40 A A
AM AE
A AS A A
ACP A AE 30 A A
AR A A A 1
, A0
A A0 20 0 A BA
10 AA
---------------------------------------------
---------------------- 10 20
30 40 50 60 70
80Â POLICE
EXPENDITURES PER CAPITA
10 A A
80
A A
AN AU M 70 B
AE
AR A
AAO
AF 60
A A
A A AS E
A A AR
A AI
AO 50 AU
A A BS C
A A A
AR A
AI 40 A A AM
AE
A AS A
AAA A AP
A AE 30
A A AR A
A A 1 ,
A0 A A0
20 0 A
B A 10
A A
0 ------------------------------------
------------------------------- 200
250 300 350 400 450
500 550Â
INCOME PER CAPITA, IN 1OS
11 A A
80 A
A
AN
AU M 70 B
AE
AR A
A AO
AF 60
AA A A
AS E A A
AR
A AI AO 50
AU A A
A AS C AAA
AR
BI 40 A A AM
AE A AS
A
A A AA AP A AE
30 A A AR A
A A 1 ,
A0 A
A0 20 0 A A A
A 10
A A ---------------------------------
---------------------------------- 0
200 400 600 800 1000
1200 1400Â
LONGTERM DEBT PER CAPITA, 1974
12If one or more of the relationships look
nonlinear, then the INDEPENDENT variable can be
transformed. Let's pretend that the INCOMEPC
variable in our annual salary model showed an
especially nonlinear relationship with salary.
The most frequent transformations involve
converting raw values of the independent variable
to their natural logarithms, their squares, or
their inverses. This is easily done in a SAS
DATA step data temp1 set old.cities
logofinc log(incomepc) incsqrd
incomepc2 incinv 1 / incomepc run
13Then the transformed values are replotted against
the dependent variable. Inspection will tell
which transformation has produced the most linear
relationship proc plot datatemp1 plot
salary (logofinc incsqrd incinv) title1
'New Scatterplots After Transformations' run
Â
14We would estimate the model using the logarithmic
values of income rather than raw values the
other variables would be used in their original
forms. proc reg datatemp1 model salary
educ logofinc age title1 'Regression with
Transformed Variable' run Notice that the
variable 'logofinc' exists ONLY in the temporary
data set, 'temp1', which we use in the analysis.
With this variable transformed, the model is
estimated with the proper functional form, and
specification error is avoided.
15B. Continuous VariablesWe have stressed
throughout our discussions of simple and multiple
regression analysis that continuous variables are
required. Our use of graphs to introduce the
concept of the scatterplot in fact REQUIRED
variables that were measured on at least an
equal-interval scale. However, in constructing
multiple regression models, one invites
specification error if one does not include
discrete variables such as ethnicity or gender as
independent variables. How does one do this when
these are not continuous variables?The answer
is to create what are called dummy variables.
16Dummy variables are binary variables, that is,
they have values of 0 and 1. The value 0 means
that the phenomenon is absent the value 1 means
that it is present. These are by definition
equal-interval variables since the distance
between 0 and 1 is equal to the distance between
1 and 0. Actually, this works if you calculate
the mean for a binary variable, the result is the
proportion of observations in the "1"
category.With a variable like GENDER, creating
a dummy is simple recode the category "f"
(female) as 1 and the category "m" (male) as 0
(or vice versa) to create the new variable,
FEMALE
17 libname old 'a\' libname library
'a\' data old.somedata set old.somedata
if gender 'f' then female 1 else if
gender 'm' then female 0 else
female . run The new variable, female,"
can then be added as a (continuous) control
variable.
18What about a variable like ethnicity (ETHNIC)?
Let's say that ETHNIC is coded as follows 1
Anglo 2 Hispanic 3 African American 4
Asian American 9 All other
To create continuous variables, we would create
J 1 dummy variables, where J is the number of
categories of the original variable. This is
easy
19 libname old 'a\' libname library
'a\' data old.somedata set old.somedata
if ethnic 1 then anglo 1 else if
ethnic 2 or ethnic 3 or ethnic 4 or
ethnic 9 then anglo 0 else anglo
. if ethnic 2 then hispan 1 else if
ethnic 1 or ethnic 3 or ethnic 4
or ethnic 9 then hispan 0 else
hispan .
20 if ethnic 3 then afam 1 else if ethnic
1 or ethnic 2 or ethnic 4 or ethnic 9
then afam 0 else afam . if ethnic
4 then asianam 1 else if ethnic 1 or
ethnic 2 or ethnic 3 or ethnic 9
then asianam 0 else asianam . run
21Now we can include the four new dummy variables
in the multiple regression model libname old
'a\' libname library 'a\' options nodate
nonumber ps66 proc reg dataold.somedata mod
el salary educ age logofinc anglo hispan
afam asianam title1 'Regression with Dummy
Variables' run
22These four dummy variables will behave as
continuous variables. Notice that the number of
dummies is one less than the total number of
categories in the original variable. This is
because "All Other" is already present. Those
cases in the "All Other" ethnic category are
represented by observations whose values are
ANGLO 0, HISPAN 0, AFAM 0, and ASIANAM
0.To create a fifth variable, OTHER, would be
redundant and would create a problem of
multicollinearity, which we will look at
shortly.
23C. AdditivityIf you recall the algebraic
version of our multiple regression model, Yi
? b1X1i b2X2i b3X3i ?i you will
remember that the individual terms are joined
together by addition () signs. This is an
additive model, in other words. What this means
as far as cause and effect is concerned is that
each independent variable has its own separate,
individual influence on the dependent variable.
The additive model does not recognize the
influence of combinations of independent
variables that may exist over and above the
separate influence of those variables.
24Let's use an analogy Hydrogen and oxygen have
chemical properties different from their
combination, H2O. A regression model that had
only H2 and O2 in it would not contain an H2O
molecule. b4X2X3
X2
X3
X2
X3
25In our example model, the specification states
that education and parents income have separate
and independent (net) effects on respondents
annual salary. However, there are probably
clusters of education-parent income connections
that affect salary over and above the two
separate variables. To capture such
influencesand to avoid committing specification
error, we can create an interaction term and add
it to the model. Interaction terms for
continuous variables are created by
multiplication. Because of this, such variables
are sometimes called product terms. For
example, to create an interaction term for
education and parents income with SAS, the DATA
step would be
26 libname old 'a\' libname library
'a\' data old.somedata set old.somedata
educinc educincome runThe interaction
term would then be added to the model proc reg
dataold.somedata model salary educ age
income educinc title1 'Regression with
Interaction Term' run
27Symbolically, the multiple regression model now
is Yi ? b1X1i b2X2i b3X3i
b4X2iX3i ?i where b4 is the multiple
regression coefficient for the interaction term,
X2iX3i. If this coefficient is statistically
significant (as evaluated by a t-test), then we
conclude that there is a joint influence over and
above the separate influence of the two
variables, X2i and X3i.
28D. Absence of MulticollinearityWhen we
discussed the creation of dummy variables, we
mentioned that J - 1 dummies were created to
avoid redundancy. Sometimes we have two or more
independent variables in a multiple regression
model that are, unknown to us, in reality
measures of the same underlying phenomenon. They
are seemingly different measures of the same
thing (e.g., gender and education in a world in
which men have all the advanced education and
women have none). Having "advanced education"
really means the same thing as being "male," and
"lack of education" is really the same as being
in the "female" category. This is known as
multicollinearity. It results in extremely
strong associations between two (or more)
independent variables thought to be measures of
different things.
29Multicollinearity is identified by regressing
each INDEPENDENT variable on all other
INDEPENDENT variables. Model R-squares are then
examined to see if any two (or more) are greater
than 0.90. If so, the variables which are the
dependent variable in the one model and an
independent variable in the other model are said
to be collinear. The process is easier than it
sounds. For our model proc reg datatemp2
model educ age pincome model age educ
pincome model pincome educ age title1
'Test for Multicollinearity' run
30If Models 1 and 2 have R-squares greater than
0.90, this means that education and age are
collinear. SAS now has two options that
diagnosis multicollinearity automatically. One
is the creation of the "tolerance" statistic.
Tolerance is simply 1 - R2. Thus, two or more
tolerance measures equal to or less than 0.10
(for R2 ? 0.90) indicate the presence of
multicollinearity. To produce the tolerance
measure, simply add the optional subcommand tol
to the MODEL statement after the "/".
31 proc reg dataold.somedata model salary educ
age pincome / tol title1 'Regression with
Test for Multicollinearity' run Common
solutions for the presence of multicollinearity
include (1) dropping one of the variables from
the model or (2) creating a new variable by
combining the variables, such as in an
interaction term or through factor analysis.
Creating a new variable through factor analysis
is probably preferable, provided that it makes
sense substantively.
32 OLS REGRESSION RESULTSÂ Model
MODEL1Dependent Variable CRIMRATE NUMBER OF
SERIOUS CRIMES PER 1,000Â
Analysis of VarianceÂ
Sum of Mean Source
DF Squares Square F Value
ProbgtFÂ Model 3
7871.03150 2623.67717 11.126
0.0001 Error 59 13912.52405
235.80549 C Total 62
21783.55556Â Root MSE 15.35596
R-square 0.3613 Dep Mean
44.44444 Adj R-sq 0.3289
C.V. 34.55091Â
Parameter EstimatesÂ
Parameter Standard T for H0
Variable DF Estimate Error
Parameter0 Prob gt TÂ INTERCEP 1
14.482581 12.95814942 1.118
0.2683 POLICEXP 1 0.772946
0.15818555 4.886 0.0001
INCOMEPC 1 0.020073 0.03573539
0.562 0.5764 STRESS74 1
0.005875 0.00800288 0.734
0.4658Â Standardized
Variable DF Estimate
Tolerance INTERCEP 1 0.00000000
. Intercept POLICEXP 1
0.55792749 0.83163748 POLICE EXPENDITURES
PER CAPITA INCOMEPC 1 0.05911456
0.97889220 INCOME PER CAPITA, IN 1OS
STRESS74 1 0.08431770 0.82285609
LONGTERM DEBT PER CAPITA, 1974
33E. Normally Distributed Error Term There are
several assumptions underlying multiple
regression analysis that involve the pattern of
the distribution of the residuals (?i). You will
recall that the residual is the error term, the
unexplained variance, that is, the difference
between the actual location of Yi given the
values of variables in the model versus the
predicted location, Yi - hat. One serious
departure from normality is the presence of
outliers. Outliers are data points that are
extremely different from all the others. For
example, if a random sample of cities from across
the U.S. ranged in size from 25,000 to
500,000except for one city, New York, then New
York would be an outlier. Its values on almost
any variable would be vastly different from those
of the other cities.
34Outliers can be detected by requesting
studentized residuals. These are like standard
scores (i.e., z-scores) except that they are in
Student's t values (due to non-normality). As a
rule of thumb, any studentized residual greater
than 3.00 or less than - 3.00 is considered an
outlier. The studentized residuals may be
requested by simply adding the optional
subcommand "r" to the MODEL statement after the
"/" proc reg dataold.somedata model
salary educ age pincome / r title1
'Regression with Test for Outliers' runÂ
35If one or more outliers are detected, the
solution is to re-estimate the model, deleting
the outlying observation(s). Both sets of
results would be presented so that you can show
results for all cases as well as separately for
the cases that are most alike.
36 OLS REGRESSION RESULTSÂ
Dep Var Predict Std Err Std
Err Student Obs CRIMRATE Value
Predict Residual Residual ResidualÂ
1 48.0000 38.2208 2.257 9.7792
15.189 0.644 2 29.0000 40.3939
2.332 -11.3939 15.178 -0.751
3 31.0000 32.6852 3.043 -1.6852
15.051 -0.112 4 55.0000 44.7336
6.757 10.2664 13.789 0.745
5 55.0000 45.1087 3.876 9.8913
14.859 0.666 6 44.0000 39.3328
5.157 4.6672 14.464 0.323
7 22.0000 33.1233 4.350 -11.1233
14.727 -0.755 8 60.0000 61.9258
3.901 -1.9258 14.852 -0.130
9 40.0000 52.1408 3.044 -12.1408
15.051 -0.807 10 75.0000 47.8937
2.902 27.1063 15.079 1.798
11 54.0000 41.1859 2.057 12.8141
15.218 0.842 12 15.0000 39.8490
3.564 -24.8490 14.937 -1.664
13 21.0000 31.7722 3.180 -10.7722
15.023 -0.717 14 9.0000 38.9681
3.558 -29.9681 14.938 -2.006
15 40.0000 37.7368 2.631 2.2632
15.129 0.150 16 33.0000 39.7868
4.477 -6.7868 14.689 -0.462
17 65.0000 35.0524 3.407 29.9476
14.973 2.000 18 30.0000 35.4104
2.761 -5.4104 15.106 -0.358
19 41.0000 37.8033 2.252 3.1967
15.190 0.210 20 34.0000 35.3039
2.708 -1.3039 15.115 -0.086
21 30.0000 39.7113 2.682 -9.7113
15.120 -0.642 22 49.0000 37.5937
5.094 11.4063 14.486 0.787
37 OLS REGRESSION RESULTSÂ
Cook's Obs
-2-1-0 1 2 DÂ 1
0.002 2
0.003 3 0.000
4 0.033 5
0.008 6
0.003 7
0.012 8 0.000
9 0.007 10
0.030 11
0.003 12
0.039 13 0.006
14 0.057 15
0.000 16
0.005 17
0.052 18 0.001
19 0.000 20
0.000 21
0.003 22
0.019
38F. Absence of HeteroscedasticityA second
assumption involving the distribution of
residuals has to do with their variance. It is
assumed that the variance of the residuals is
constant throughout the range of the model.
Since the residual reflects how accurately the
model predicts actual Y values, constant variance
of residuals means that the model is equally
predictive in low, medium, and high values of the
model (i.e., of Yi - hat). If it is not, then
this suggests that the dependent variable is
explained better in some of the ranges of the
model but not in others. The property of
constant variance is called homoscedasticity.
Its absence is called heteroscedasticity.
39The presence of heteroscedasticity is detected by
examination of a plot of the residuals against
the predicted-Y values.This is easy to do. In
SAS, you include an OUTPUT statement in the
regression procedure in which you give names to
the predicted Y-values (following p ) and to the
residuals (following r ).
40 libname old 'a\' libname library
'a\' proc reg dataold.somedata model
salary educ age pincome output outtemp5
pyhat ryresid title1 'Regression Output for
Heteroscedasticity Test' run proc plot
datatemp5 plot yresidyhat ''/vref
0 title1 'Plot of Residuals run
41 30
20
10
R
e
s i
d 0
-----------------------------------------------
------------------------u
a l
-10
-20
-30
---------------------------------------------
---------------------- 30 40
50 60 70 80
90Â Predicted
Value of SALARY
42If heteroscedasticity is absent, the plot should
look like thisRe s i d u - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
a
l
0.0 _________________________
__________
- - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- -
Predicted Y-valueHere there is no marked
change in the magnitude of the residuals
throughout the range of the model (defined by
Y-hat values). Compare this with the following
pattern
43Re s i
d u - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- a
l
0.0 _____________________________________
- -
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - -
Predicted
Y-valueHere, the residuals are smaller at
greater predicted Y-values. That is, the
residuals are not uniformly scattered about the
reference line (at residual 0). This means
that the model does not give consistent
predictions. This is heteroscedasticity. The
solution is to create a weighting variable and to
perform weighted least squares.
44 Summary of Multiple Regression Assumptions
1. Absence of specification error no
relevant variables omitted no irrelevant
variables included proper functional form (e.g.,
linear relationships)2. Continuous
variables can use dummy variables as
independent variables
45 Multiple Regression Assumptions
(continued)3. Additivity can create
interaction (product) variables,
if necessary4. No multicollinearity may
need to combine two (or more) of the independent
variables5. Normally distributed error
term may need to repeat analysis without
outlier(s)
46 Multiple Regression Assumptions
(continued)6. No heteroscedasticity may need
to perform weighted regression