Multiple Regression - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Multiple Regression

Description:

categorical variables can be included by using dummy and interaction variables ... If the variable has m values, use (m 1) dummy variables. ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 32
Provided by: SteveF6
Category:

less

Transcript and Presenter's Notes

Title: Multiple Regression


1
Multiple Regression
  • We can include more than one independent or
    explanatory variable. The general model

a is the intercept (the value of y when all xi
0) bj regression coefficient or slope for
variable xj the change in y per unit change in
xj, holding all other x variables constant (all
else equal) e normally distributed independent
random variable with mean 0, standard deviation
s k number of independent variables df n
k 1
2
Least Squares Estimation
  • Using sample data, we select values of a, b1,
    b2bk such that Sei2 is a minimum

The equations for b1, b2bk are too complicated
to permit hand calculationuse Excel! Check that
the values of a, b1, b2bk make sense. The best
fit surface is a k-dimensional hyperplane
3
(No Transcript)
4
Standard Error
  • As before, the standard error of the regression
    or the estimate is the standard deviation of the
    residuals, adjusted for degrees of freedom

As before, the key assumption is that the
standard error is constant, independent of the
values of the independent variables.
5
Goodness of Fit R2
  • As before, R2 SSR/SST fraction of variability
    in y that is explained by the best-fit equation
  • Adding an additional variable will always
    increase R2, even if it has no additional
    explanatory power, because of slight correlations
    in sample data
  • The corrected or adjusted R2 takes this into
    account. Rc2 will increase only if the standard
    error, se, decreases

6
Goodness of Fit F
  • The F statistic is the ratio of explained
    variance to the unexplained variance

The corresponding p value is a test of the null
hypothesis that all the regression coefficients
(b1, b2,bk) are zerothat the regression
equation has no explanatory power. It is, in
effect, a test of the statistical significance of
R2.
7
Types of Explanatory Variables
  • In ordinary least squares, the response or
    dependent variable must be continuous
  • Explanatory variables are usually continuous,
    but
  • categorical variables can be included by using
    dummy and interaction variables
  • proportions (0 to 1) can be included by
    converting them into an odds ratio (0 to 8) or
    the log of the odds ratio (8 to 8)
  • We can also use various transformations to
    linearize or stabilize variance

8
Dummy Variables
  • Use dummy variables to include categorical
    variables that have a constant effect on y
    (independent of other xj).
  • For example, if salary (y) is a function of
    experience (x1)
  • y ? ?1x1
  • To test for gender bias
  • y ? ?1x1 ?2x2
  • where x2 0 if female, x2 1 if male ?2 is
    difference in salary, independent of experience
    (x1). For females,
  • y a b1x1
  • and for males
  • y a b1x1 b2 (a b2) b1x1

9
(No Transcript)
10
Dummy Variables
  • To test for bias, test the null hypothesis ?2
    0.
  • If the variable has m values, use (m1) dummy
    variables. For example, if salary depends on
    experience (x1) and education (BS, MS, or PhD)
  • y a b1x1 b2x2 b3x3
  • where

11
Interaction Variables
  • Suppose that the gender gap varied by number of
    years of experience. To include this, use an
    interaction variable, x3 x1x2
  • y a b1x1 b2x2 b3(x1x2)
  • where x1 is years of experience and x2 is gender.
  • For women (x2 0)
  • y a b1x1
  • while for men (x2 1)
  • y a b1x1 b2 b3x1 (a b2) (b1
    b3)x1
  • b2 is difference in starting salary b3 is
    difference in rate at which salary increases with
    experience.

12
(No Transcript)
13
Dummy Var Log Transformations
  • y LN(salary), x1 experience, x2 1 if male
  • If female (x2 0), then
  • If male (x2 1), then

If ?2 0.05, e?2 ? 1.05 mens salary 5
higher If ?2 0.05, e?2 ? 0.95 mens salary 5
lower
14
Dummy Var Log-Log Transformation
  • y LN(salary), x1 experience, x2 1 if male
  • If female (x2 0), then
  • If male (x2 1), then
  • As before, mens salaries are a constant factor
    e?2 higher/lower than womens salaries (all else
    equal) if ?2 ltlt 1, then mens salaries are ?2
    percent higher/lower than womens salaries.

15
Inferences about Regression Coefficients
  • The standard errors of regression coefficients,
    sb, are too complicated to calculate by hand the
    values appear in the Excel outputs
  • The standard errors can be used to
  • construct confidence intervals for ?j (bj
    t?,dfsbj)
  • test the null hypothesis that ?j 0 (no
    associa-tion between xj and y, taking into
    account the other response variables in the
    model)
  • Excel output contains confidence intervals and p
    values for for each bj

16
Uncertainty in Predictions
  • The standard error for a predicted value of y is
    too complicated to calculate by hand.
  • Use Tools/Data Analysis Plus/Prediction Interval
  • Shift variables and use sa.
  • For example, if (x1,x2) (10,20), redo
    regression with z1 x1 10, z2 x2 20

17
Other Ways to Estimate Errors
  • The variation in b from sample to sample
    sometimes exceeds predictions based on sb
  • Particularly true in multiple regression, when
    variables are included based on how well they fit
    the sample data. Several ways to deal with this
  • Jackknife compute the regression n times, each
    time omitting one case. The standard error of b
    is

where bi is the computed slope when case i is
omitted.
18
Other Ways to Estimate Errors
  • Bootstrap select a random sample of size n from
    the original sample of size n and compute
    regression repeat hundreds of times compute sb
    from estimates of b as with jackknife. (Variation
    in b is possible because each case can be
    selected more than once, or not selected at all.)
  • Cross-validation omit a case and compute
    regression use regression to predict y for
    omitted case and note estimation error
    repeat for all cases is given by standard
    deviation of errors

19
Validation with Data Splitting
  • As explained above, the best-fit model can
    underestimate errors, since it is fit to the data
  • The jackknife, bootstrap, and cross validation
    methods can obtain more realistic estimates of
    errors, but are convenient only with software
    that can automatically perform these
    manipulations
  • A quick and dirty way to gain validate a model is
    to randomly split the data set in two parts. Run
    a regression for one part, and use the result to
    produce predictions and residuals for the other
    part. If the standard deviation of the residuals
    is roughly equal to se, then the model is ok.

20
Multicollinearity
  • The p-value for F can be low even if p-values for
    ?j are high, if xj are correlated with each
    other.
  • Multicollinearity makes it difficult to separate
    effect of x1 on y from the effect of x2 on y.
  • Multicollinearity can be eliminated by choosing
    values of x1, x2, xk randomly, over a broad
    range. But often we must take data as they come.
  • Examples air pollution (SO2, NOx, and PM) v.
    mortality diet (meat and fiber) v. intestinal
    cancer
  • Multicollinearity can be reduced by eliminating
    redundant independent variables

21
Multicollinearity
22
Correlation Matrix
  • To check for multicollinearity (and assist in
    model-building), construct a correlation matrix.
  • See which xj are most correlated with y, and
    which xj are strongly correlated with each other.

23
Model Building
  • The key to regression analysis is to properly
    specify the model. You want to include all the
    important variables and omit any extraneous ones
    (variables that dont significantly improve the
    explanatory power of the model).
  • A model that omits an important variable is
    misspecified one that has only the variables
    necessary to accurately describe the data is
    parsimonious.
  • How do you decide which variables to include and
    which to omit?

24
Model Building
  • Theory ideally, we have a theory that describes
    or predicts that factors that determine y. But
    often we dont have a complete theory, or the
    proper data cannot be collected.
  • Intuition lacking a solid theory, we can test
    various plausible explanations for the factors
    that determine y.
  • Data availability too often, people simply
    assemble all the data that might possibly be
    relevant and use regression to discover the
    causes of y. But spurious correlation is a
    problem, particularly for small data sets.

25
Include/Exclude Decisions
  • One you have assembled data for candidate
    explanatory variables, there are two approaches
    to decide which to include in the final model
  • Bottom-up or forward begin with the explanatory
    variable having the lowest p-value then add the
    explanatory variable having the lowest p-value in
    a two-variable regression, and so on until no
    variable can be added that would have a p-value
    below the given threshold (e.g., 0.05).

26
Include/Exclude Decisions
  • Stepwise like the forward procedure, except
    deletions are considered along the way.
  • Data Analysis Plus/Stepwise regression. Use
    values of p for include/exclude decisions.
  • Top-down or backward begin with all the
    explanatory variables and delete the one with the
    largest p-value continue until all variables
    have a p-value below the given threshold.
  • Use OLS Regression tool available on web page.

27
Include/Exclude Decisions
  • In some cases, a set of explanatory variables
    form a logical group (e.g., dummy variables for
    race, education, department, etc.)
  • It is common (but not necessary) to include or
    exclude all of the variables in the group
  • Uses the partial F test to test the null
    hypothesis that the entire set of variables has
    no explanatory power.

28
Include/Exclude Decisions
  • To decide which variables to include in the
    model, we can aim for a model with
  • the highest adjusted R2
  • the lowest se
  • the lowest p-value for F
  • all regression coefficients with t gt tc or p lt a
  • fewest explanatory variables
  • There is no right way to select the best model.
    It depends on the problem. Many times, different
    paths lead to the same solution.

29
Tainted Variables
  • Beware of tainted variables when building
    models. A variable is tainted if it is, like y, a
    consequence of the other explanatory variables,
    rather than a cause of y
  • For example, rank might explain differences in
    salary. Although rank is correlated with salary,
    it should be regarded, like salary, as a measure
    or consequence of job performance, not a cause of
    job performance. Rank might be an alternative
    dependent variable, but not an independent
    variable.

30
Analysis of Residuals
  • Plot the residuals as a function of the
    fitted/predicted response variable and look for
    outliers or signs of heteroscedasticity
  • Plot the residuals v. each explanatory variable,
    and look for signs of nonlinearity
  • Make a histogram and look for non-normality
  • In the case of time series data (even if time is
    not an explanatory variable in the model), plot
    the residuals v. time and look for signs of
    autocorrelation compute Durbin-Watson statistic

31
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com