Title: Multiple Regression
1Multiple Regression
- We can include more than one independent or
explanatory variable. The general model
a is the intercept (the value of y when all xi
0) bj regression coefficient or slope for
variable xj the change in y per unit change in
xj, holding all other x variables constant (all
else equal) e normally distributed independent
random variable with mean 0, standard deviation
s k number of independent variables df n
k 1
2Least Squares Estimation
- Using sample data, we select values of a, b1,
b2bk such that Sei2 is a minimum
The equations for b1, b2bk are too complicated
to permit hand calculationuse Excel! Check that
the values of a, b1, b2bk make sense. The best
fit surface is a k-dimensional hyperplane
3(No Transcript)
4Standard Error
- As before, the standard error of the regression
or the estimate is the standard deviation of the
residuals, adjusted for degrees of freedom
As before, the key assumption is that the
standard error is constant, independent of the
values of the independent variables.
5Goodness of Fit R2
- As before, R2 SSR/SST fraction of variability
in y that is explained by the best-fit equation - Adding an additional variable will always
increase R2, even if it has no additional
explanatory power, because of slight correlations
in sample data - The corrected or adjusted R2 takes this into
account. Rc2 will increase only if the standard
error, se, decreases
6Goodness of Fit F
- The F statistic is the ratio of explained
variance to the unexplained variance
The corresponding p value is a test of the null
hypothesis that all the regression coefficients
(b1, b2,bk) are zerothat the regression
equation has no explanatory power. It is, in
effect, a test of the statistical significance of
R2.
7Types of Explanatory Variables
- In ordinary least squares, the response or
dependent variable must be continuous - Explanatory variables are usually continuous,
but - categorical variables can be included by using
dummy and interaction variables - proportions (0 to 1) can be included by
converting them into an odds ratio (0 to 8) or
the log of the odds ratio (8 to 8) - We can also use various transformations to
linearize or stabilize variance
8Dummy Variables
- Use dummy variables to include categorical
variables that have a constant effect on y
(independent of other xj). - For example, if salary (y) is a function of
experience (x1) - y ? ?1x1
- To test for gender bias
- y ? ?1x1 ?2x2
- where x2 0 if female, x2 1 if male ?2 is
difference in salary, independent of experience
(x1). For females, - y a b1x1
- and for males
- y a b1x1 b2 (a b2) b1x1
9(No Transcript)
10Dummy Variables
- To test for bias, test the null hypothesis ?2
0. - If the variable has m values, use (m1) dummy
variables. For example, if salary depends on
experience (x1) and education (BS, MS, or PhD) - y a b1x1 b2x2 b3x3
- where
11Interaction Variables
- Suppose that the gender gap varied by number of
years of experience. To include this, use an
interaction variable, x3 x1x2 - y a b1x1 b2x2 b3(x1x2)
- where x1 is years of experience and x2 is gender.
- For women (x2 0)
- y a b1x1
- while for men (x2 1)
- y a b1x1 b2 b3x1 (a b2) (b1
b3)x1 - b2 is difference in starting salary b3 is
difference in rate at which salary increases with
experience.
12(No Transcript)
13Dummy Var Log Transformations
- y LN(salary), x1 experience, x2 1 if male
If ?2 0.05, e?2 ? 1.05 mens salary 5
higher If ?2 0.05, e?2 ? 0.95 mens salary 5
lower
14Dummy Var Log-Log Transformation
- y LN(salary), x1 experience, x2 1 if male
- As before, mens salaries are a constant factor
e?2 higher/lower than womens salaries (all else
equal) if ?2 ltlt 1, then mens salaries are ?2
percent higher/lower than womens salaries.
15Inferences about Regression Coefficients
- The standard errors of regression coefficients,
sb, are too complicated to calculate by hand the
values appear in the Excel outputs - The standard errors can be used to
- construct confidence intervals for ?j (bj
t?,dfsbj) - test the null hypothesis that ?j 0 (no
associa-tion between xj and y, taking into
account the other response variables in the
model) - Excel output contains confidence intervals and p
values for for each bj
16Uncertainty in Predictions
- The standard error for a predicted value of y is
too complicated to calculate by hand. -
- Use Tools/Data Analysis Plus/Prediction Interval
- Shift variables and use sa.
- For example, if (x1,x2) (10,20), redo
regression with z1 x1 10, z2 x2 20
17Other Ways to Estimate Errors
- The variation in b from sample to sample
sometimes exceeds predictions based on sb - Particularly true in multiple regression, when
variables are included based on how well they fit
the sample data. Several ways to deal with this - Jackknife compute the regression n times, each
time omitting one case. The standard error of b
is
where bi is the computed slope when case i is
omitted.
18Other Ways to Estimate Errors
- Bootstrap select a random sample of size n from
the original sample of size n and compute
regression repeat hundreds of times compute sb
from estimates of b as with jackknife. (Variation
in b is possible because each case can be
selected more than once, or not selected at all.)
- Cross-validation omit a case and compute
regression use regression to predict y for
omitted case and note estimation error
repeat for all cases is given by standard
deviation of errors
19Validation with Data Splitting
- As explained above, the best-fit model can
underestimate errors, since it is fit to the data - The jackknife, bootstrap, and cross validation
methods can obtain more realistic estimates of
errors, but are convenient only with software
that can automatically perform these
manipulations - A quick and dirty way to gain validate a model is
to randomly split the data set in two parts. Run
a regression for one part, and use the result to
produce predictions and residuals for the other
part. If the standard deviation of the residuals
is roughly equal to se, then the model is ok.
20Multicollinearity
- The p-value for F can be low even if p-values for
?j are high, if xj are correlated with each
other. - Multicollinearity makes it difficult to separate
effect of x1 on y from the effect of x2 on y. - Multicollinearity can be eliminated by choosing
values of x1, x2, xk randomly, over a broad
range. But often we must take data as they come. - Examples air pollution (SO2, NOx, and PM) v.
mortality diet (meat and fiber) v. intestinal
cancer - Multicollinearity can be reduced by eliminating
redundant independent variables
21Multicollinearity
22Correlation Matrix
- To check for multicollinearity (and assist in
model-building), construct a correlation matrix. - See which xj are most correlated with y, and
which xj are strongly correlated with each other.
23Model Building
- The key to regression analysis is to properly
specify the model. You want to include all the
important variables and omit any extraneous ones
(variables that dont significantly improve the
explanatory power of the model). - A model that omits an important variable is
misspecified one that has only the variables
necessary to accurately describe the data is
parsimonious. - How do you decide which variables to include and
which to omit?
24Model Building
- Theory ideally, we have a theory that describes
or predicts that factors that determine y. But
often we dont have a complete theory, or the
proper data cannot be collected. - Intuition lacking a solid theory, we can test
various plausible explanations for the factors
that determine y. - Data availability too often, people simply
assemble all the data that might possibly be
relevant and use regression to discover the
causes of y. But spurious correlation is a
problem, particularly for small data sets.
25Include/Exclude Decisions
- One you have assembled data for candidate
explanatory variables, there are two approaches
to decide which to include in the final model - Bottom-up or forward begin with the explanatory
variable having the lowest p-value then add the
explanatory variable having the lowest p-value in
a two-variable regression, and so on until no
variable can be added that would have a p-value
below the given threshold (e.g., 0.05).
26Include/Exclude Decisions
- Stepwise like the forward procedure, except
deletions are considered along the way. - Data Analysis Plus/Stepwise regression. Use
values of p for include/exclude decisions. - Top-down or backward begin with all the
explanatory variables and delete the one with the
largest p-value continue until all variables
have a p-value below the given threshold. - Use OLS Regression tool available on web page.
27Include/Exclude Decisions
- In some cases, a set of explanatory variables
form a logical group (e.g., dummy variables for
race, education, department, etc.) - It is common (but not necessary) to include or
exclude all of the variables in the group - Uses the partial F test to test the null
hypothesis that the entire set of variables has
no explanatory power.
28Include/Exclude Decisions
- To decide which variables to include in the
model, we can aim for a model with - the highest adjusted R2
- the lowest se
- the lowest p-value for F
- all regression coefficients with t gt tc or p lt a
- fewest explanatory variables
- There is no right way to select the best model.
It depends on the problem. Many times, different
paths lead to the same solution.
29Tainted Variables
- Beware of tainted variables when building
models. A variable is tainted if it is, like y, a
consequence of the other explanatory variables,
rather than a cause of y - For example, rank might explain differences in
salary. Although rank is correlated with salary,
it should be regarded, like salary, as a measure
or consequence of job performance, not a cause of
job performance. Rank might be an alternative
dependent variable, but not an independent
variable.
30Analysis of Residuals
- Plot the residuals as a function of the
fitted/predicted response variable and look for
outliers or signs of heteroscedasticity - Plot the residuals v. each explanatory variable,
and look for signs of nonlinearity - Make a histogram and look for non-normality
- In the case of time series data (even if time is
not an explanatory variable in the model), plot
the residuals v. time and look for signs of
autocorrelation compute Durbin-Watson statistic
31(No Transcript)