Multiple Regression - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Multiple Regression

Description:

We often think multiple factors effect the dependent variable. ... Side note: All else being equal more variance in x means better estimates. ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 45
Provided by: Dam948
Category:

less

Transcript and Presenter's Notes

Title: Multiple Regression


1
Multiple Regression
  • Not Multivariate Regression

2
Why more than two?
  • We often think multiple factors effect the
    dependent variable. Getting it done in one
    regression seems easier than doing a bunch of
    regressions with two variables
  • Usually, it is more than a matter of convenience,
    it is a matter of necessity if we want correct
    estimates. Consider this

3
Salary and Height?
4
Salary and Height?
5
Spurious Relationships
  • This relationship is said to be spurious
  • When we did the bivariate relationship, we said,
  • In reality, things looked like this

Salary
Height
Salary
Height
Gender
6
So what happened?
  • Statistics dont determine causality we do.
  • Regression just saw the connection between height
    and salary. It doesnt tell us why that exists
  • We know if exists because gender is driving both
    factors

7
Getting things under Control
  • If a third factor is correlated with your
    dependent variable and an independent variable,
    you must control for it.
  • Instead of
  • we need

8
How do we get there
We still want to minimize the errors
9
Set to 0 and solve for a, b1, b2
Notes 1. If x1 x2, then the denominator
0. The variables cannot be exactly the
same (collinear) 2. Multiple regression
controls automatically for
correlation between x1 and x2
10
Example
  • We previously said
  • But we also know that education and income are
    related.

Turnout
Education
We need to have both education and income in the
regression to see what the true effects are
Turnout
Education
Income
11
We need an Equation like
Regression will automatically control for the
correlation between education and income!
12
  • Type regress turnout diplomau mdnincm

. regress turnout diplomau mdnincm Source
SS df MS Number of obs
426 ---------------------------------------
F( 2, 423) 31.37 Model 1.4313e11
2 7.1565e10 Prob gt F
0.0000 Residual 9.6512e11 423 2.2816e09
R-squared 0.1291 -----------------------
---------------- Adj R-squared 0.1250
Total 1.1082e12 425 2.6076e09 Root
MSE 47766 ------------------------------
-------------------------------------- turnout
Coef. Std. Err. t Pgtt 95 Conf.
Interval ---------------------------------------
---------------------------- diplomau 1101.359
504.4476 2.18 0.030 109.823 2092.895
mdnincm 1.111589 .4325834 2.57 0.011
.261308 1.961869 _cons 154154.4 9641.523
15.99 0.000 135203.1 173105.6 ------------
--------------------------------------------------
------
How do we interpret this?
13
Interpretation
  • Interpretation of multiple regression requires
  • First, a 1 unit change in education only
    generates a 1101 person increase in turnout if we
    hold median income constant. (Why?)
  • Second, as before, interpretation only makes
    sense in terms of the expected value of y
    (Remember, ). Thus, we
    dont always see exactly an 1101 person increase.
    Because this is the average of some distribution
    around , we say on average

14
Instead of drawing a line
  • Now that there are two variables, we essentially
    fit a plane to the data points
  • Its like sticking a piece of paper on some angle
    through a square

15
(No Transcript)
16
What about 3 or more variables?
  • This gets too messy
  • Follows the same processtake partial derivatives
    with respect to a and each bn, set them equal to
    zero and solve.
  • An alternative method, matrix algebra, makes it
    very easy
  • We can go a long ways without that, well do it
    later
  • In the mean time, Stata will take care of things

17
Lets add several more controls
  • Add spending, age65, and black

. regress turnout diplomau mdnincm spending age65
black Source SS df MS
Number of obs 426 ----------------------
---------------- F( 5, 420) 27.69
Model 2.7474e11 5 5.4948e10 Prob gt
F 0.0000 Residual 8.3350e11 420
1.9845e09 R-squared
0.2479 --------------------------------------
Adj R-squared 0.2390 Total 1.1082e12
425 2.6076e09 Root MSE
44548 -------------------------------------------
-------------------------- turnout Coef.
Std. Err. t Pgtt 95 Conf.
Interval ---------------------------------------
----------------------------- diplomau
1473.793 473.9758 3.11 0.002 542.1332
2405.454 mdnincm .3228469 .421679 0.77
0.444 -.5060173 1.151711 spending .0052333
.0017983 2.91 0.004 .0016985 .0087681
age65 .2137993 .0997664 2.14 0.033
.0176956 .4099029 black -.1665995
.0255038 -6.53 0.000 -.2167304 -.1164685
_cons 164964.6 13711.48 12.03 0.000
138012.9 191916.2 ------------------------------
----------------------------------------
18
Goodness of fit in multiple regression
  • Our options for measuring goodness of fit are the
    same formulas change slightly
  • Basic pieces are still the samewe want to know
    how close is to

Total Sum of Squares (TSS)
Residual Sum of Squares (RSS)
TSS - RSS Regression Sum of Squares (RSS)
19
RMSE (Std. Err. of regression)
  • Before (for bivariate case)
  • Now (for any number of IVs)

It takes 1 d.f. to estimate each bn
20
r2
  • Formula remains the same
  • However, with multiple variables, adjusted r2 is
    preferred

21
Final notes on r2 and fit
  • While r2 may be on the same scale across all
    regressions, in some instances a low r2 may be
    more acceptablesome things are hard to predict
  • r2 is not really a true statistic because there
    is no standard error or measure of uncertainty
    for it (it is based on a sample)
  • Explaining variation is not always the same as
    explaining the political worlduse judgment and
    theory too (not just fit)
  • Consequence Use adj. r2, but with caution

22
Assessing the relative importance of multiple
coefficients
  • Which variable is most important?

. regress turnout diplomau mdnincm spending age65
black Source SS df MS
Number of obs 426 ----------------------
---------------- F( 5, 420) 27.69
Model 2.7474e11 5 5.4948e10 Prob gt
F 0.0000 Residual 8.3350e11 420
1.9845e09 R-squared
0.2479 --------------------------------------
Adj R-squared 0.2390 Total 1.1082e12
425 2.6076e09 Root MSE
44548 -------------------------------------------
-------------------------- turnout Coef.
Std. Err. t Pgtt 95 Conf.
Interval ---------------------------------------
----------------------------- diplomau
1473.793 473.9758 3.11 0.002 542.1332
2405.454 mdnincm .3228469 .421679 0.77
0.444 -.5060173 1.151711 spending .0052333
.0017983 2.91 0.004 .0016985 .0087681
age65 .2137993 .0997664 2.14 0.033
.0176956 .4099029 black -.1665995
.0255038 -6.53 0.000 -.2167304 -.1164685
_cons 164964.6 13711.48 12.03 0.000
138012.9 191916.2 ------------------------------
----------------------------------------
23
Assessing the relative importance of multiple
coefficients
  • A 1 unit change in diplomau makes a difference of
    over 1000 votes, on average holding all other
    factors constant
  • Changing diplomau by 1 unit is hard
  • A 1 unit change in mdnincm only yields an
    increase of .3 votes
  • Increasing mdnincm by 1 is not hard

24
Approach 1 Standardized Coefficients
  • There is relatively little variance in the
    percentage of university diploma holders (about
    5-50), a change of 45 points in the scale.
  • There is a lot of variance in mdnincm (about
    15,000 to 55,000), a change of 40,000 points in
    the scale
  • If we convert all of the variables to units of
    their standard deviation, we answer the question,
    how much change in y (in Std. Devs.) for a 1
    standard deviation change in x?

25
Standardized Coefficients
  • Standardized coefficients are also colloquially
    referred to as Standardized Betas or Betas
  • In Stata, after the command, type , beta

. regress turnout diplomau mdnincm spending age65
black, beta Source SS df
MS Number of obs 426 -----------------
------------------------- F( 5, 420)
27.69 Model 2.7474e11 5 5.4948e10
Prob gt F 0.0000 Residual
8.3350e11 420 1.9845e09 R-squared
0.2479 ------------------------------------------
Adj R-squared 0.2390 Total
1.1082e12 425 2.6076e09 Root MSE
44548 --------------------------------------------
------------------------ turnout
Coef. Std. Err. t Pgtt
Beta --------------------------------------------
----------------------- diplomau 1473.793
473.9758 3.11 0.002 .2315058
mdnincm .3228469 .421679 0.77 0.444
.0591382 spending .0052333
.0017983 2.91 0.004 .1253994
age65 .2137993 .0997664 2.14 0.033
.0932332 black -.1665995 .0255038
-6.53 0.000 -.2973935 _cons
164964.6 13711.48 12.03 0.000
. ------------------------------------------------
--------------------
26
Why this is Potentially Bad
  • A 1 standard deviation change in one variable
    might not be conceptually equal to a 1 standard
    deviation change in another (especially if the
    variables are distributed asymmetrically)
  • We have only the sample std. dev., and samples
    vary. Standardized coefficients are not
    comparable across regressions based on different
    samples
  • When we get to the point where we cram nominal
    and ordinal IVs in, this doesnt work well
  • Result Generally avoid using these, but you are
    likely to see them as you read older work

27
Approach 2 Use your Theory
  • Only theory and sound research design in concert
    with your statistical analysis determine which
    factors are really the most important.
  • There is no perfect statistical answer to this
    question

28
Approach 3 If you must
  • Rescale all of your variables to range between
    0-1.
  • Add or subtract some number to every observation
    to make the lowest value in the variable 0
  • From these new values, find the highest value and
    rescale it.
  • Coefficients tell you the effect of changing
    across the whole range of the variable

29
. summarize turnout diplomau mdnincm spending
age65 black Variable Obs Mean
Std. Dev. Min Max ---------------
--------------------------------------------------
---- turnout 426 216568.7
51065.02 68770 401389 diplomau
427 20.1171 8.018776 5.3
51.4 mdnincm 427 36176.92
9356.165 16683 64199 spending
427 1165419 1222216 0
1.12e07 age65 427 71750.04
22264.35 7 174436 black
427 67012.02 91091.2 719
430627 . generate turnout01
(turnout-68770)/(401389-68770) . generate
diplomau01 (diplomau-5.3)/(51.4-5.3) . generate
mdnincm01 (mdnincm-16683)/(64199-16683) .
generate spending01 spending/11240972 .
generate age6501 (age65-7)/(174436-7) .
generate black01 (black-719)/(430627-719) .
summarize turnout01 diplomau01 mdnincm01
spending01 age6501 black01 Variable
Obs Mean Std. Dev. Min
Max ---------------------------------------------
------------------------ turnout01 426
.4443484 .1535241 0 1
diplomau01 427 .3214121 .1739431
0 1 mdnincm01 427
.41026 .1969056 0 1
spending01 427 .103676 .1087287
0 1 age6501 427
.4113023 .1276414 0 1
black01 427 .1542028 .2118853
0 1
30
And the results
. regress turnout01 diplomau01 mdnincm01
spending01 age6501 black01 Source SS
df MS Number of obs
426 ---------------------------------------
F( 5, 420) 27.69 Model 2.48331225
5 .496662451 Prob gt F
0.0000 Residual 7.53378351 420 .01793758
R-squared 0.2479 ----------------------
----------------- Adj R-squared 0.2390
Total 10.0170958 425 .023569637 Root
MSE .13393 ------------------------------
----------------------------------------
turnout01 Coef. Std. Err. t Pgtt
95 Con. Interval -----------------------------
---------------------------------------- diplomau0
1 .2042633 .0656916 3.11 0.002
.075138 .3333887 mdnincm01 .04612
.0602386 0.77 0.444 -.0722867
.1645267 spending01 .1768615 .060774 2.91
0.004 .0574024 .2963206 age6501
.1121186 .0523186 2.14 0.033 .0092798
.2149575 black01 -.2153288 .0329635 -6.53
0.000 -.2801227 -.1505348 _cons
.3285244 .03113 10.55 0.000 .2673344
.3897144 -----------------------------------------
------------------------------
31
Assumptions of OLS Regression
  • Linearity The regression model is linear in the
    parameters

Not
32
Assumptions of OLS Regression
  • 2. Y is conditional upon Xi
  • - Xi are fixed in repeated sampling
  • - We assume X is exogenous
  • - This is violated when y causes x instead of x
    causing y

33
Assumptions of OLS Regression
  • 3. Error term has a mean of zero
  • Two key Consequences
  • a.
  • Otherwise we could not drop the e
  • when we estimate the model.
  • b. The errors around E(y x) are random

34
When is this violated?
  • When x is correlated with the error term
  • When a variable is omitted that is correlated
    with x and with y
  • When x and y are endogenous

35
Assumptions of OLS Regression
  • 4. Homoskedastic Variance
  • Homo same/equal Skedanime disperse or
    scatter
  • The errors around the regression line have a
    mean of zero (by assumption 3). But are they
    equal to each other? If not, we have
    heteroskedastic variance.

36
Heteroskedasticity (bad)
Homoskedasticity (good)
37
Homoskedasticity
  • When is this violated?
  • When observations in one category (or at some
    range of data) have more error around them than
    observations in another category
  • Example Suppose we are trying to predict the
    Sales in a firm by the number of sales people in
    that firm. When there are 0 salespeople, there
    are 0 sales, with 1 salesperson, there are a few
    sales, and probably not a great deal of variance
    around predicted sales. With many salespersons,
    there is greater variance
  • Example We are studying support for lotteries
    across states. There will be greater variance
    around the predicted support in heterogeneous
    states than in homogenous states .

38
Assumptions of OLS Regression
  • 5. No autocorrelation between the errors
  • - Also called no serial correlation
  • - It says that the errors in predicting y for
    two different levels of x are not correlated
  • - This is most often violated when we have data
    observed over time.
  • - If we draw a high value of y from the
    conditional distribution of y given one value of
    x, we will also draw a high value of y on another
    value of x. This would mean that

39
Assumptions of OLS Regression
  • No correlation between ei and xi
  • - This is automatically true if assumptions 2
    3 hold
  • - If ei and xi have different, unrelated effects
    on y, we can estimate the effect of x on y even
    if we do not observe e
  • - If x and y are correlated, ei and xi we would
    have to observe ei to separate the effects

Education
Income
Region
Own biz.
Turnout
F.T. for Bush
40
Assumptions of OLS Regression
  • n gt k
  • - You must have more observations than
    independent variables
  • 8. Variability in x
  • - You must have at least two values of x.
    Otherwise its a constant, not a variable
  • - Side note All else being equal more variance
    in x means better estimates.

41
Assumptions of OLS Regression
  • The regression model is correctly specified
  • The right variables are in the model
  • The functional form is correct
  • No perfect Multicollinearity
  • - No variable is a linear combination of another
    variable (or variables)

42
Regression in Practice
  • Ansolabehere and Gerber (1994)
  • yi a b1Chal. spend. b2 Inc. Spend b3
    e
  • Problem in measuring Campaign Spending for
    incumbents
  • Incumbents in close campaigns spend tons of money
    (because it is a close race)
  • Incumbents who are not threatened spend very
    little money
  • They claim a different problem
  • Incumbents dont spend money on campaigning

43
(No Transcript)
44
Results
  • Is there a problem with mismeasuring campaign
    spending (as they claim)?
  • Does correctly measuring spending resolve the
    endogeneity problem?
Write a Comment
User Comments (0)
About PowerShow.com