Multiple Regression

About This Presentation

Title:

Multiple Regression

Description:

We often think multiple factors effect the dependent variable. ... Side note: All else being equal more variance in x means better estimates. ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 45

Provided by: Dam948

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Regression

1
Multiple Regression

Not Multivariate Regression

2
Why more than two?

We often think multiple factors effect the
dependent variable. Getting it done in one
regression seems easier than doing a bunch of
regressions with two variables
Usually, it is more than a matter of convenience,
it is a matter of necessity if we want correct
estimates. Consider this

3
Salary and Height?
4
Salary and Height?
5
Spurious Relationships

This relationship is said to be spurious
When we did the bivariate relationship, we said,
In reality, things looked like this

Salary
Height
Salary
Height
Gender
6
So what happened?

Statistics dont determine causality we do.
Regression just saw the connection between height
and salary. It doesnt tell us why that exists
We know if exists because gender is driving both
factors

7
Getting things under Control

If a third factor is correlated with your
dependent variable and an independent variable,
you must control for it.
Instead of
we need

8
How do we get there
We still want to minimize the errors
9
Set to 0 and solve for a, b1, b2
Notes 1. If x1 x2, then the denominator
0. The variables cannot be exactly the
same (collinear) 2. Multiple regression
controls automatically for
correlation between x1 and x2
10
Example

We previously said
But we also know that education and income are
related.

Turnout
Education
We need to have both education and income in the
regression to see what the true effects are
Turnout
Education
Income
11
We need an Equation like
Regression will automatically control for the
correlation between education and income!
12

Type regress turnout diplomau mdnincm

. regress turnout diplomau mdnincm Source
SS df MS Number of obs
426 ---------------------------------------
F( 2, 423) 31.37 Model 1.4313e11
2 7.1565e10 Prob gt F
0.0000 Residual 9.6512e11 423 2.2816e09
R-squared 0.1291 -----------------------
---------------- Adj R-squared 0.1250
Total 1.1082e12 425 2.6076e09 Root
MSE 47766 ------------------------------
-------------------------------------- turnout
Coef. Std. Err. t Pgtt 95 Conf.
Interval ---------------------------------------
---------------------------- diplomau 1101.359
504.4476 2.18 0.030 109.823 2092.895
mdnincm 1.111589 .4325834 2.57 0.011
.261308 1.961869 _cons 154154.4 9641.523
15.99 0.000 135203.1 173105.6 ------------
--------------------------------------------------
------
How do we interpret this?
13
Interpretation

Interpretation of multiple regression requires
First, a 1 unit change in education only
generates a 1101 person increase in turnout if we
hold median income constant. (Why?)
Second, as before, interpretation only makes
sense in terms of the expected value of y
(Remember, ). Thus, we
dont always see exactly an 1101 person increase.
Because this is the average of some distribution
around , we say on average

14
Instead of drawing a line

Now that there are two variables, we essentially
fit a plane to the data points
Its like sticking a piece of paper on some angle
through a square

15
(No Transcript)
16
What about 3 or more variables?

This gets too messy
Follows the same processtake partial derivatives
with respect to a and each bn, set them equal to
zero and solve.
An alternative method, matrix algebra, makes it
very easy
We can go a long ways without that, well do it
later
In the mean time, Stata will take care of things

17
Lets add several more controls

Add spending, age65, and black

. regress turnout diplomau mdnincm spending age65
black Source SS df MS
Number of obs 426 ----------------------
---------------- F( 5, 420) 27.69
Model 2.7474e11 5 5.4948e10 Prob gt
F 0.0000 Residual 8.3350e11 420
1.9845e09 R-squared
0.2479 --------------------------------------
Adj R-squared 0.2390 Total 1.1082e12
425 2.6076e09 Root MSE
44548 -------------------------------------------
-------------------------- turnout Coef.
Std. Err. t Pgtt 95 Conf.
Interval ---------------------------------------
----------------------------- diplomau
1473.793 473.9758 3.11 0.002 542.1332
2405.454 mdnincm .3228469 .421679 0.77
0.444 -.5060173 1.151711 spending .0052333
.0017983 2.91 0.004 .0016985 .0087681
age65 .2137993 .0997664 2.14 0.033
.0176956 .4099029 black -.1665995
.0255038 -6.53 0.000 -.2167304 -.1164685
_cons 164964.6 13711.48 12.03 0.000
138012.9 191916.2 ------------------------------
----------------------------------------
18
Goodness of fit in multiple regression

Our options for measuring goodness of fit are the
same formulas change slightly
Basic pieces are still the samewe want to know
how close is to

Total Sum of Squares (TSS)
Residual Sum of Squares (RSS)
TSS - RSS Regression Sum of Squares (RSS)
19
RMSE (Std. Err. of regression)

Before (for bivariate case)
Now (for any number of IVs)

It takes 1 d.f. to estimate each bn
20
r2

Formula remains the same
However, with multiple variables, adjusted r2 is
preferred

21
Final notes on r2 and fit

While r2 may be on the same scale across all
regressions, in some instances a low r2 may be
more acceptablesome things are hard to predict
r2 is not really a true statistic because there
is no standard error or measure of uncertainty
for it (it is based on a sample)
Explaining variation is not always the same as
explaining the political worlduse judgment and
theory too (not just fit)
Consequence Use adj. r2, but with caution

22
Assessing the relative importance of multiple
coefficients

Which variable is most important?

A 1 unit change in diplomau makes a difference of
over 1000 votes, on average holding all other
factors constant
Changing diplomau by 1 unit is hard
A 1 unit change in mdnincm only yields an
increase of .3 votes
Increasing mdnincm by 1 is not hard

24
Approach 1 Standardized Coefficients

There is relatively little variance in the
percentage of university diploma holders (about
5-50), a change of 45 points in the scale.
There is a lot of variance in mdnincm (about
15,000 to 55,000), a change of 40,000 points in
the scale
If we convert all of the variables to units of
their standard deviation, we answer the question,
how much change in y (in Std. Devs.) for a 1
standard deviation change in x?

25
Standardized Coefficients

Standardized coefficients are also colloquially
referred to as Standardized Betas or Betas
In Stata, after the command, type , beta

. regress turnout diplomau mdnincm spending age65
black, beta Source SS df
MS Number of obs 426 -----------------
------------------------- F( 5, 420)
27.69 Model 2.7474e11 5 5.4948e10
Prob gt F 0.0000 Residual
8.3350e11 420 1.9845e09 R-squared
0.2479 ------------------------------------------
Adj R-squared 0.2390 Total
1.1082e12 425 2.6076e09 Root MSE
44548 --------------------------------------------
------------------------ turnout
Coef. Std. Err. t Pgtt
Beta --------------------------------------------
----------------------- diplomau 1473.793
473.9758 3.11 0.002 .2315058
mdnincm .3228469 .421679 0.77 0.444
.0591382 spending .0052333
.0017983 2.91 0.004 .1253994
age65 .2137993 .0997664 2.14 0.033
.0932332 black -.1665995 .0255038
-6.53 0.000 -.2973935 _cons
164964.6 13711.48 12.03 0.000
. ------------------------------------------------
--------------------
26
Why this is Potentially Bad

A 1 standard deviation change in one variable
might not be conceptually equal to a 1 standard
deviation change in another (especially if the
variables are distributed asymmetrically)
We have only the sample std. dev., and samples
vary. Standardized coefficients are not
comparable across regressions based on different
samples
When we get to the point where we cram nominal
and ordinal IVs in, this doesnt work well
Result Generally avoid using these, but you are
likely to see them as you read older work

27
Approach 2 Use your Theory

Only theory and sound research design in concert
with your statistical analysis determine which
factors are really the most important.
There is no perfect statistical answer to this
question

28
Approach 3 If you must

Rescale all of your variables to range between
0-1.
Add or subtract some number to every observation
to make the lowest value in the variable 0
From these new values, find the highest value and
rescale it.
Coefficients tell you the effect of changing
across the whole range of the variable

29
. summarize turnout diplomau mdnincm spending
age65 black Variable Obs Mean
Std. Dev. Min Max ---------------
--------------------------------------------------
---- turnout 426 216568.7
51065.02 68770 401389 diplomau
427 20.1171 8.018776 5.3
51.4 mdnincm 427 36176.92
9356.165 16683 64199 spending
427 1165419 1222216 0
1.12e07 age65 427 71750.04
22264.35 7 174436 black
427 67012.02 91091.2 719
430627 . generate turnout01
(turnout-68770)/(401389-68770) . generate
diplomau01 (diplomau-5.3)/(51.4-5.3) . generate
mdnincm01 (mdnincm-16683)/(64199-16683) .
generate spending01 spending/11240972 .
generate age6501 (age65-7)/(174436-7) .
generate black01 (black-719)/(430627-719) .
summarize turnout01 diplomau01 mdnincm01
spending01 age6501 black01 Variable
Obs Mean Std. Dev. Min
Max ---------------------------------------------
------------------------ turnout01 426
.4443484 .1535241 0 1
diplomau01 427 .3214121 .1739431
0 1 mdnincm01 427
.41026 .1969056 0 1
spending01 427 .103676 .1087287
0 1 age6501 427
.4113023 .1276414 0 1
black01 427 .1542028 .2118853
0 1
30
And the results
. regress turnout01 diplomau01 mdnincm01
spending01 age6501 black01 Source SS
df MS Number of obs
426 ---------------------------------------
F( 5, 420) 27.69 Model 2.48331225
5 .496662451 Prob gt F
0.0000 Residual 7.53378351 420 .01793758
R-squared 0.2479 ----------------------
----------------- Adj R-squared 0.2390
Total 10.0170958 425 .023569637 Root
MSE .13393 ------------------------------
----------------------------------------
turnout01 Coef. Std. Err. t Pgtt
95 Con. Interval -----------------------------
---------------------------------------- diplomau0
1 .2042633 .0656916 3.11 0.002
.075138 .3333887 mdnincm01 .04612
.0602386 0.77 0.444 -.0722867
.1645267 spending01 .1768615 .060774 2.91
0.004 .0574024 .2963206 age6501
.1121186 .0523186 2.14 0.033 .0092798
.2149575 black01 -.2153288 .0329635 -6.53
0.000 -.2801227 -.1505348 _cons
.3285244 .03113 10.55 0.000 .2673344
.3897144 -----------------------------------------
------------------------------
31
Assumptions of OLS Regression

Linearity The regression model is linear in the
parameters

Not
32
Assumptions of OLS Regression

2. Y is conditional upon Xi
- Xi are fixed in repeated sampling
- We assume X is exogenous
- This is violated when y causes x instead of x
causing y

33
Assumptions of OLS Regression

3. Error term has a mean of zero
Two key Consequences
a.
Otherwise we could not drop the e
when we estimate the model.
b. The errors around E(y x) are random

34
When is this violated?

When x is correlated with the error term
When a variable is omitted that is correlated
with x and with y
When x and y are endogenous

35
Assumptions of OLS Regression

4. Homoskedastic Variance
Homo same/equal Skedanime disperse or
scatter
The errors around the regression line have a
mean of zero (by assumption 3). But are they
equal to each other? If not, we have
heteroskedastic variance.

36
Heteroskedasticity (bad)
Homoskedasticity (good)
37
Homoskedasticity

When is this violated?
When observations in one category (or at some
range of data) have more error around them than
observations in another category
Example Suppose we are trying to predict the
Sales in a firm by the number of sales people in
that firm. When there are 0 salespeople, there
are 0 sales, with 1 salesperson, there are a few
sales, and probably not a great deal of variance
around predicted sales. With many salespersons,
there is greater variance
Example We are studying support for lotteries
across states. There will be greater variance
around the predicted support in heterogeneous
states than in homogenous states .

38
Assumptions of OLS Regression

5. No autocorrelation between the errors
- Also called no serial correlation
- It says that the errors in predicting y for
two different levels of x are not correlated
- This is most often violated when we have data
observed over time.
- If we draw a high value of y from the
conditional distribution of y given one value of
x, we will also draw a high value of y on another
value of x. This would mean that

39
Assumptions of OLS Regression

No correlation between ei and xi
- This is automatically true if assumptions 2
3 hold
- If ei and xi have different, unrelated effects
on y, we can estimate the effect of x on y even
if we do not observe e
- If x and y are correlated, ei and xi we would
have to observe ei to separate the effects

Education
Income
Region
Own biz.
Turnout
F.T. for Bush
40
Assumptions of OLS Regression

n gt k
- You must have more observations than
independent variables
8. Variability in x
- You must have at least two values of x.
Otherwise its a constant, not a variable
- Side note All else being equal more variance
in x means better estimates.

41
Assumptions of OLS Regression