Title: Multiple Regression
1Multiple Regression
- Not Multivariate Regression
2Why more than two?
- We often think multiple factors effect the
dependent variable. Getting it done in one
regression seems easier than doing a bunch of
regressions with two variables - Usually, it is more than a matter of convenience,
it is a matter of necessity if we want correct
estimates. Consider this
3Salary and Height?
4Salary and Height?
5Spurious Relationships
- This relationship is said to be spurious
- When we did the bivariate relationship, we said,
- In reality, things looked like this
Salary
Height
Salary
Height
Gender
6So what happened?
- Statistics dont determine causality we do.
- Regression just saw the connection between height
and salary. It doesnt tell us why that exists - We know if exists because gender is driving both
factors
7Getting things under Control
- If a third factor is correlated with your
dependent variable and an independent variable,
you must control for it. - Instead of
- we need
8How do we get there
We still want to minimize the errors
9Set to 0 and solve for a, b1, b2
Notes 1. If x1 x2, then the denominator
0. The variables cannot be exactly the
same (collinear) 2. Multiple regression
controls automatically for
correlation between x1 and x2
10Example
- We previously said
- But we also know that education and income are
related.
Turnout
Education
We need to have both education and income in the
regression to see what the true effects are
Turnout
Education
Income
11We need an Equation like
Regression will automatically control for the
correlation between education and income!
12- Type regress turnout diplomau mdnincm
. regress turnout diplomau mdnincm Source
SS df MS Number of obs
426 ---------------------------------------
F( 2, 423) 31.37 Model 1.4313e11
2 7.1565e10 Prob gt F
0.0000 Residual 9.6512e11 423 2.2816e09
R-squared 0.1291 -----------------------
---------------- Adj R-squared 0.1250
Total 1.1082e12 425 2.6076e09 Root
MSE 47766 ------------------------------
-------------------------------------- turnout
Coef. Std. Err. t Pgtt 95 Conf.
Interval ---------------------------------------
---------------------------- diplomau 1101.359
504.4476 2.18 0.030 109.823 2092.895
mdnincm 1.111589 .4325834 2.57 0.011
.261308 1.961869 _cons 154154.4 9641.523
15.99 0.000 135203.1 173105.6 ------------
--------------------------------------------------
------
How do we interpret this?
13Interpretation
- Interpretation of multiple regression requires
- First, a 1 unit change in education only
generates a 1101 person increase in turnout if we
hold median income constant. (Why?) - Second, as before, interpretation only makes
sense in terms of the expected value of y
(Remember, ). Thus, we
dont always see exactly an 1101 person increase.
Because this is the average of some distribution
around , we say on average
14Instead of drawing a line
- Now that there are two variables, we essentially
fit a plane to the data points - Its like sticking a piece of paper on some angle
through a square
15(No Transcript)
16What about 3 or more variables?
- This gets too messy
- Follows the same processtake partial derivatives
with respect to a and each bn, set them equal to
zero and solve. - An alternative method, matrix algebra, makes it
very easy - We can go a long ways without that, well do it
later - In the mean time, Stata will take care of things
17Lets add several more controls
- Add spending, age65, and black
. regress turnout diplomau mdnincm spending age65
black Source SS df MS
Number of obs 426 ----------------------
---------------- F( 5, 420) 27.69
Model 2.7474e11 5 5.4948e10 Prob gt
F 0.0000 Residual 8.3350e11 420
1.9845e09 R-squared
0.2479 --------------------------------------
Adj R-squared 0.2390 Total 1.1082e12
425 2.6076e09 Root MSE
44548 -------------------------------------------
-------------------------- turnout Coef.
Std. Err. t Pgtt 95 Conf.
Interval ---------------------------------------
----------------------------- diplomau
1473.793 473.9758 3.11 0.002 542.1332
2405.454 mdnincm .3228469 .421679 0.77
0.444 -.5060173 1.151711 spending .0052333
.0017983 2.91 0.004 .0016985 .0087681
age65 .2137993 .0997664 2.14 0.033
.0176956 .4099029 black -.1665995
.0255038 -6.53 0.000 -.2167304 -.1164685
_cons 164964.6 13711.48 12.03 0.000
138012.9 191916.2 ------------------------------
----------------------------------------
18Goodness of fit in multiple regression
- Our options for measuring goodness of fit are the
same formulas change slightly - Basic pieces are still the samewe want to know
how close is to
Total Sum of Squares (TSS)
Residual Sum of Squares (RSS)
TSS - RSS Regression Sum of Squares (RSS)
19RMSE (Std. Err. of regression)
- Before (for bivariate case)
- Now (for any number of IVs)
It takes 1 d.f. to estimate each bn
20r2
- Formula remains the same
- However, with multiple variables, adjusted r2 is
preferred
21Final notes on r2 and fit
- While r2 may be on the same scale across all
regressions, in some instances a low r2 may be
more acceptablesome things are hard to predict - r2 is not really a true statistic because there
is no standard error or measure of uncertainty
for it (it is based on a sample) - Explaining variation is not always the same as
explaining the political worlduse judgment and
theory too (not just fit) - Consequence Use adj. r2, but with caution
22Assessing the relative importance of multiple
coefficients
- Which variable is most important?
. regress turnout diplomau mdnincm spending age65
black Source SS df MS
Number of obs 426 ----------------------
---------------- F( 5, 420) 27.69
Model 2.7474e11 5 5.4948e10 Prob gt
F 0.0000 Residual 8.3350e11 420
1.9845e09 R-squared
0.2479 --------------------------------------
Adj R-squared 0.2390 Total 1.1082e12
425 2.6076e09 Root MSE
44548 -------------------------------------------
-------------------------- turnout Coef.
Std. Err. t Pgtt 95 Conf.
Interval ---------------------------------------
----------------------------- diplomau
1473.793 473.9758 3.11 0.002 542.1332
2405.454 mdnincm .3228469 .421679 0.77
0.444 -.5060173 1.151711 spending .0052333
.0017983 2.91 0.004 .0016985 .0087681
age65 .2137993 .0997664 2.14 0.033
.0176956 .4099029 black -.1665995
.0255038 -6.53 0.000 -.2167304 -.1164685
_cons 164964.6 13711.48 12.03 0.000
138012.9 191916.2 ------------------------------
----------------------------------------
23Assessing the relative importance of multiple
coefficients
- A 1 unit change in diplomau makes a difference of
over 1000 votes, on average holding all other
factors constant - Changing diplomau by 1 unit is hard
- A 1 unit change in mdnincm only yields an
increase of .3 votes - Increasing mdnincm by 1 is not hard
24Approach 1 Standardized Coefficients
- There is relatively little variance in the
percentage of university diploma holders (about
5-50), a change of 45 points in the scale. - There is a lot of variance in mdnincm (about
15,000 to 55,000), a change of 40,000 points in
the scale - If we convert all of the variables to units of
their standard deviation, we answer the question,
how much change in y (in Std. Devs.) for a 1
standard deviation change in x?
25Standardized Coefficients
- Standardized coefficients are also colloquially
referred to as Standardized Betas or Betas - In Stata, after the command, type , beta
. regress turnout diplomau mdnincm spending age65
black, beta Source SS df
MS Number of obs 426 -----------------
------------------------- F( 5, 420)
27.69 Model 2.7474e11 5 5.4948e10
Prob gt F 0.0000 Residual
8.3350e11 420 1.9845e09 R-squared
0.2479 ------------------------------------------
Adj R-squared 0.2390 Total
1.1082e12 425 2.6076e09 Root MSE
44548 --------------------------------------------
------------------------ turnout
Coef. Std. Err. t Pgtt
Beta --------------------------------------------
----------------------- diplomau 1473.793
473.9758 3.11 0.002 .2315058
mdnincm .3228469 .421679 0.77 0.444
.0591382 spending .0052333
.0017983 2.91 0.004 .1253994
age65 .2137993 .0997664 2.14 0.033
.0932332 black -.1665995 .0255038
-6.53 0.000 -.2973935 _cons
164964.6 13711.48 12.03 0.000
. ------------------------------------------------
--------------------
26Why this is Potentially Bad
- A 1 standard deviation change in one variable
might not be conceptually equal to a 1 standard
deviation change in another (especially if the
variables are distributed asymmetrically) - We have only the sample std. dev., and samples
vary. Standardized coefficients are not
comparable across regressions based on different
samples - When we get to the point where we cram nominal
and ordinal IVs in, this doesnt work well - Result Generally avoid using these, but you are
likely to see them as you read older work
27Approach 2 Use your Theory
- Only theory and sound research design in concert
with your statistical analysis determine which
factors are really the most important. - There is no perfect statistical answer to this
question
28Approach 3 If you must
- Rescale all of your variables to range between
0-1. - Add or subtract some number to every observation
to make the lowest value in the variable 0 - From these new values, find the highest value and
rescale it. - Coefficients tell you the effect of changing
across the whole range of the variable
29. summarize turnout diplomau mdnincm spending
age65 black Variable Obs Mean
Std. Dev. Min Max ---------------
--------------------------------------------------
---- turnout 426 216568.7
51065.02 68770 401389 diplomau
427 20.1171 8.018776 5.3
51.4 mdnincm 427 36176.92
9356.165 16683 64199 spending
427 1165419 1222216 0
1.12e07 age65 427 71750.04
22264.35 7 174436 black
427 67012.02 91091.2 719
430627 . generate turnout01
(turnout-68770)/(401389-68770) . generate
diplomau01 (diplomau-5.3)/(51.4-5.3) . generate
mdnincm01 (mdnincm-16683)/(64199-16683) .
generate spending01 spending/11240972 .
generate age6501 (age65-7)/(174436-7) .
generate black01 (black-719)/(430627-719) .
summarize turnout01 diplomau01 mdnincm01
spending01 age6501 black01 Variable
Obs Mean Std. Dev. Min
Max ---------------------------------------------
------------------------ turnout01 426
.4443484 .1535241 0 1
diplomau01 427 .3214121 .1739431
0 1 mdnincm01 427
.41026 .1969056 0 1
spending01 427 .103676 .1087287
0 1 age6501 427
.4113023 .1276414 0 1
black01 427 .1542028 .2118853
0 1
30And the results
. regress turnout01 diplomau01 mdnincm01
spending01 age6501 black01 Source SS
df MS Number of obs
426 ---------------------------------------
F( 5, 420) 27.69 Model 2.48331225
5 .496662451 Prob gt F
0.0000 Residual 7.53378351 420 .01793758
R-squared 0.2479 ----------------------
----------------- Adj R-squared 0.2390
Total 10.0170958 425 .023569637 Root
MSE .13393 ------------------------------
----------------------------------------
turnout01 Coef. Std. Err. t Pgtt
95 Con. Interval -----------------------------
---------------------------------------- diplomau0
1 .2042633 .0656916 3.11 0.002
.075138 .3333887 mdnincm01 .04612
.0602386 0.77 0.444 -.0722867
.1645267 spending01 .1768615 .060774 2.91
0.004 .0574024 .2963206 age6501
.1121186 .0523186 2.14 0.033 .0092798
.2149575 black01 -.2153288 .0329635 -6.53
0.000 -.2801227 -.1505348 _cons
.3285244 .03113 10.55 0.000 .2673344
.3897144 -----------------------------------------
------------------------------
31Assumptions of OLS Regression
- Linearity The regression model is linear in the
parameters
Not
32Assumptions of OLS Regression
- 2. Y is conditional upon Xi
- - Xi are fixed in repeated sampling
- - We assume X is exogenous
- - This is violated when y causes x instead of x
causing y
33Assumptions of OLS Regression
- 3. Error term has a mean of zero
- Two key Consequences
- a.
- Otherwise we could not drop the e
- when we estimate the model.
- b. The errors around E(y x) are random
34When is this violated?
- When x is correlated with the error term
- When a variable is omitted that is correlated
with x and with y - When x and y are endogenous
35Assumptions of OLS Regression
- 4. Homoskedastic Variance
-
- Homo same/equal Skedanime disperse or
scatter - The errors around the regression line have a
mean of zero (by assumption 3). But are they
equal to each other? If not, we have
heteroskedastic variance.
36Heteroskedasticity (bad)
Homoskedasticity (good)
37Homoskedasticity
- When is this violated?
- When observations in one category (or at some
range of data) have more error around them than
observations in another category - Example Suppose we are trying to predict the
Sales in a firm by the number of sales people in
that firm. When there are 0 salespeople, there
are 0 sales, with 1 salesperson, there are a few
sales, and probably not a great deal of variance
around predicted sales. With many salespersons,
there is greater variance - Example We are studying support for lotteries
across states. There will be greater variance
around the predicted support in heterogeneous
states than in homogenous states .
38Assumptions of OLS Regression
- 5. No autocorrelation between the errors
- - Also called no serial correlation
- - It says that the errors in predicting y for
two different levels of x are not correlated - - This is most often violated when we have data
observed over time. - - If we draw a high value of y from the
conditional distribution of y given one value of
x, we will also draw a high value of y on another
value of x. This would mean that
39Assumptions of OLS Regression
- No correlation between ei and xi
- - This is automatically true if assumptions 2
3 hold - - If ei and xi have different, unrelated effects
on y, we can estimate the effect of x on y even
if we do not observe e - - If x and y are correlated, ei and xi we would
have to observe ei to separate the effects
Education
Income
Region
Own biz.
Turnout
F.T. for Bush
40Assumptions of OLS Regression
- n gt k
- - You must have more observations than
independent variables - 8. Variability in x
- - You must have at least two values of x.
Otherwise its a constant, not a variable - - Side note All else being equal more variance
in x means better estimates.
41Assumptions of OLS Regression
- The regression model is correctly specified
- The right variables are in the model
- The functional form is correct
- No perfect Multicollinearity
- - No variable is a linear combination of another
variable (or variables)
42Regression in Practice
- Ansolabehere and Gerber (1994)
- yi a b1Chal. spend. b2 Inc. Spend b3
e - Problem in measuring Campaign Spending for
incumbents - Incumbents in close campaigns spend tons of money
(because it is a close race) - Incumbents who are not threatened spend very
little money - They claim a different problem
- Incumbents dont spend money on campaigning
43(No Transcript)
44Results
- Is there a problem with mismeasuring campaign
spending (as they claim)? - Does correctly measuring spending resolve the
endogeneity problem?