Title: Multiple Regression and Model Building
1Chapter 12
- Multiple Regression and Model Building
2Learning Objectives
- 1. Explain the Linear Multiple Regression Model
- 2. Test Overall Significance
- 3. Describe Various Types of Models
- 4. Evaluate Portions of a Regression Model
- 5. Interpret Linear Multiple Regression Computer
Output - 7. Explain Residual Analysis
- 8. Describe Regression Pitfalls
3Types of Regression Models
4Regression Modeling Steps
- 1. Hypothesize Deterministic Component
- 2. Estimate Unknown Model Parameters
- 3. Specify Probability Distribution of Random
Error Term - Estimate Standard Deviation of Error
- 4. Evaluate Model
- 5. Use Model for Prediction Estimation
5Regression Modeling Steps
- 1. Hypothesize Deterministic Component
- 2. Estimate Unknown Model Parameters
- 3. Specify Probability Distribution of Random
Error Term - Estimate Standard Deviation of Error
- 4. Evaluate Model
- 5. Use Model for Prediction Estimation
Expanded in Multiple Regression
6Linear Multiple Regression Model
- Hypothesizing the Deterministic Component
Expanded in Multiple Regression
7Regression Modeling Steps
- 1. Hypothesize Deterministic Component
- 2. Estimate Unknown Model Parameters
- 3. Specify Probability Distribution of Random
Error Term - Estimate Standard Deviation of Error
- 4. Evaluate Model
- 5. Use Model for Prediction Estimation
8Linear Multiple Regression Model
- 1. Relationship between 1 dependent 2 or more
independent variables is a linear function
Population slopes
Population Y-intercept
Random error
Dependent (response) variable
Independent (explanatory) variables
9Population Multiple Regression Model
Bivariate model
10Sample Multiple Regression Model
Bivariate model
11Parameter Estimation
Expanded in Multiple Regression
12Regression Modeling Steps
- 1. Hypothesize Deterministic Component
- 2. Estimate Unknown Model Parameters
- 3. Specify Probability Distribution of Random
Error Term - Estimate Standard Deviation of Error
- 4. Evaluate Model
- 5. Use Model for Prediction Estimation
13Multiple Linear Regression Equations
Too complicated by hand!
Ouch!
14Interpretation of Estimated Coefficients
15Interpretation of Estimated Coefficients
- 1. Slope (?k)
- Estimated Y Changes by ?k for Each 1 Unit
Increase in Xk Holding All Other Variables
Constant - Example If ?1 2, then Sales (Y) Is expected to
increase by 2 for each 1 unit increase in
Advertising (X1), holding fixed the number of
Sales Reps at any particular level - The effect of more advertising is the same for
any fixed number of sales reps
16Interpretation of Estimated Coefficients
- 1. Slope (?k)
- Estimated Y Changes by ?k for Each 1 Unit
Increase in Xk Holding All Other Variables
Constant - Example If ?1 2, then Sales (Y) Is Expected to
Increase by 2 for Each 1 Unit Increase in
Advertising (X1) Given the Number of Sales Reps
(X2) - 2. Y-Intercept (?0)
- Average Value of Y When Xk 0
17Parameter Estimation Example
- You work in advertising for the New York Times.
You want to find the effect of ad size (sq. in.)
newspaper circulation (000) on the number of ad
responses (00).
Youve collected the following data
Resp Size Circ 1 1 2 4 8 8 1 3 1 3 5 7 2 6
4 4 10 6
18Parameter Estimation Computer Output
. regress resp size circ Source SS
df MS Number of obs
6 -------------------------------------------
F( 2, 3) 55.44 Model
9.24973638 2 4.62486819 Prob gt F
0.0043 Residual .25026362 3
.083421207 R-squared
0.9737 ------------------------------------------
- Adj R-squared 0.9561 Total
9.5 5 1.9 Root
MSE .28883 ------------------------------
------------------------------------------------
resp Coef. Std. Err. t
Pgtt 95 Conf. Interval ------------------
--------------------------------------------------
--------- size .2049209 .0588218
3.48 0.040 .0177238 .392118
circ .2804921 .0686017 4.09 0.026
.062171 .4988132 _cons .0639719
.2598628 0.25 0.821 -.7630274
.8909712 -----------------------------------------
-------------------------------------
19Interpretation of Coefficients Solution
20Interpretation of Coefficients Solution
- 1. Slope (?1)
- Responses to Ad Is Expected to Increase by
.2049 (20.49) for Each 1 Sq. In. Increase in Ad
Size Holding Circulation Constant
21Interpretation of Coefficients Solution
- 1. Slope (?1)
- Responses to Ad Is Expected to Increase by
.2049 (20.49) for Each 1 Sq. In. Increase in Ad
Size Holding Circulation Constant - 2. Slope (?2)
- Responses to Ad Is Expected to Increase by
.2805 (28.05) for Each 1 Unit (1,000) Increase in
Circulation Holding Ad Size Constant
22Evaluating the Model
Expanded in Multiple Regression
23Regression Modeling Steps
- 1. Hypothesize Deterministic Component
- 2. Estimate Unknown Model Parameters
- 3. Specify Probability Distribution of Random
Error Term - Estimate Standard Deviation of Error
- 4. Evaluate Model
- 5. Use Model for Prediction Estimation
24Evaluating Multiple Regression Model Steps
- 1. Examine Variation Measures
- 2. Do Residual Analysis
- 3. Test Parameter Significance
- Overall Model
- Individual Coefficients
- 4. Test for Multicollinearity
25Evaluating Multiple Regression Model Steps
Expanded!
- 1. Examine Variation Measures
- 2. Do Residual Analysis
- 3. Test Parameter Significance
- Overall Model
- Individual Coefficients
- Test for Multicollinearity
New!
New!
New!
26Evaluating Multiple Regression Model Steps
Expanded!
- 1. Examine Variation Measures
- 2. Do Residual Analysis
- 3. Test Parameter Significance
- Overall Model
- Individual Coefficients
- 4. Test for Multicollinearity
New!
New!
New!
27Variation Measures
28Coefficient of Multiple Determination
- Proportion of Variation in Y Explained by All X
Variables Taken Together
29Check Your Understanding
- If you add a variable to the model
- How will that affect the R-squared value for the
model?
30Adjusted R2
- R2 Never Decreases When New X Variable Is Added
to Model - Only Y Values Determine SSyy
- Disadvantage When Comparing Models
- Solution Adjusted R2
- Each additional variable reduces adjusted R2,
unless SSE goes up enough to compensate
31Variance of Error
- Assuming model is correctly specified
- Best (unbiased) estimator ofis
- Used in formula for computing
- Exact formula is too complicated to show
- But higher value for s leads to higher
32Check Your Understanding
- If you add a variable to the model
- Exercise 12.5 How will that affect the estimate
of standard deviation (of the error term)?
33Individual Coefficients
34T Statistics
. regress resp size circ Source SS
df MS Number of obs
6 -------------------------------------------
F( 2, 3) 55.44 Model
9.24973638 2 4.62486819 Prob gt F
0.0043 Residual .25026362 3
.083421207 R-squared
0.9737 ------------------------------------------
- Adj R-squared 0.9561 Total
9.5 5 1.9 Root
MSE .28883 ------------------------------
------------------------------------------------
resp Coef. Std. Err. t
Pgtt 95 Conf. Interval ------------------
--------------------------------------------------
--------- size .2049209 .0588218
3.48 0.040 .0177238 .392118
circ .2804921 .0686017 4.09 0.026
.062171 .4988132 _cons .0639719
.2598628 0.25 0.821 -.7630274
.8909712 -----------------------------------------
-------------------------------------
35Exercise 12.7
- n30
- H0 beta2 0 NOT rejected
- H0 beta3 0 rejected
- Explain result despite beta2gtbeta3
36Evaluating Multiple Regression Model Steps
Expanded!
- 1. Examine Variation Measures
- 2. Do Residual Analysis
- 3. Test Parameter Significance
- Overall Model
- Individual Coefficients
- 4. Test for Multicollinearity
New!
New!
New!
37Testing Overall Significance
- 1. Shows If There Is a Linear Relationship
Between All X Variables Together Y - 2. Uses F Test Statistic
- 3. Hypotheses
- H0 ?1 ?2 ... ?k 0
- No Linear Relationship
- Ha At Least One Coefficient Is Not 0
- At Least One X Variable Affects Y
38T Statistics
k
n - k -1
. regress resp size circ Source SS
df MS Number of obs
6 -------------------------------------------
F( 2, 3) 55.44 Model
9.24973638 2 4.62486819 Prob gt F
0.0043 Residual .25026362 3
.083421207 R-squared
0.9737 ------------------------------------------
- Adj R-squared 0.9561 Total
9.5 5 1.9 Root
MSE .28883 ------------------------------
------------------------------------------------
resp Coef. Std. Err. t
Pgtt 95 Conf. Interval ------------------
--------------------------------------------------
--------- size .2049209 .0588218
3.48 0.040 .0177238 .392118
circ .2804921 .0686017 4.09 0.026
.062171 .4988132 _cons .0639719
.2598628 0.25 0.821 -.7630274
.8909712 -----------------------------------------
-------------------------------------
MS(Model) MS(Error)
P-Value
39Exercise 12.6
- See minitab printout p. 678
40Exercise 12.12
- F-test for model is significant
- Is the model the best available predictor for y?
- Are all the terms in the model important for
predicting y? - Or what does it mean?
41Exercise 12.26
- 18 variables
- N20
- R-squared.95
- Compute adjusted-R-squared
- Compute F-statistic
- Can you reject null hypothesis of all
coefficients0?
42Exercise 12.28 Soln
- 18 variables
- N20
- R-squared.95
43Exercise 12.28 Soln
- k18, n20, R-squared.95
- Would need an F-value gt245.9 to reject the null
hypothesis!
44Exercise 12.143
- Model salary based on gender
- Other variables included
- Race
- Education level
- Tenure in firm
- Number of hours/week worked
- e. Why would one want to adjust/control for these
other factors when testing for gender
discrimination?
45GFCLOCKS Dataset
- Dependent variable
- Auction price
- Independent variables
- Age
- Number of bidders
46Simple Linear Model (compare to Minitab p. 686)
. regress price age numbids Source
SS df MS Number of obs
32 ---------------------------------------
---- F( 2, 29) 120.19
Model 4283062.96 2 2141531.48
Prob gt F 0.0000 Residual
516726.54 29 17818.1565 R-squared
0.8923 -------------------------------------
------ Adj R-squared 0.8849
Total 4799789.5 31 154831.919
Root MSE 133.48 -------------------------
--------------------------------------------------
--- price Coef. Std. Err. t
Pgtt 95 Conf. Interval ----------------
--------------------------------------------------
----------- age 12.74057 .9047403
14.08 0.000 10.89017 14.59098
numbids 85.95298 8.728523 9.85 0.000
68.10115 103.8048 _cons -1338.951
173.8095 -7.70 0.000 -1694.432
-983.4711 ----------------------------------------
--------------------------------------
47Types of Regression Models
48Models With a Single Quantitative Variable
49Types of Regression Models
50First-Order Model With 1 Independent Variable
51First-Order Model With 1 Independent Variable
- 1. Relationship Between 1 Dependent 1
Independent Variable Is Linear
52First-Order Model With 1 Independent Variable
- 1. Relationship Between 1 Dependent 1
Independent Variable Is Linear - 2. Used When Expected Rate of Change in Y Per
Unit Change in X Is Stable
53First-Order Model With 1 Independent Variable
- 1. Relationship Between 1 Dependent 1
Independent Variable Is Linear - 2. Used When Expected Rate of Change in Y Per
Unit Change in X Is Stable - 3. Used With Curvilinear Relationships If
Relevant Range Is Linear
54First-Order Model Relationships
?1 lt 0
?1 gt 0
Y
Y
X
X
1
1
55First-Order Model Worksheet
Run regression with Y, X1
56(No Transcript)
57GFClocks Revisited
. regress price numbids Source SS
df MS Number of obs
32 -------------------------------------------
F( 1, 30) 5.55 Model
749662.281 1 749662.281 Prob gt F
0.0252 Residual 4050127.22 30
135004.241 R-squared
0.1562 ------------------------------------------
- Adj R-squared 0.1281 Total
4799789.5 31 154831.919 Root
MSE 367.43 ------------------------------
------------------------------------------------
price Coef. Std. Err. t
Pgtt 95 Conf. Interval ------------------
--------------------------------------------------
--------- numbids 54.76335 23.23972
2.36 0.025 7.30151 102.2252
_cons 804.9119 230.8305 3.49 0.002
333.4931 1276.331 ----------------------------
--------------------------------------------------
58Types of Regression Models
59Second-Order Model With 1 Independent Variable
- 1. Relationship Between 1 Dependent 1
Independent Variables Is a Quadratic Function - 2. Useful 1St Model If Non-Linear Relationship
Suspected
60Second-Order Model With 1 Independent Variable
- 1. Relationship Between 1 Dependent 1
Independent Variables Is a Quadratic Function - 2. Useful 1St Model If Non-Linear Relationship
Suspected - 3. Model
Curvilinear effect
Linear effect
61Second-Order Model Relationships
?2 gt 0
?2 gt 0
?2 lt 0
?2 lt 0
62Second-Order Model Worksheet
Create X12 column. Run regression with Y, X1,
X12.
63GFClocks revisited
. gen bidssq numbids numbids . regress price
numbids bidssq Source SS df
MS Number of obs
32 -------------------------------------------
F( 2, 29) 4.30 Model
1098297.13 2 549148.563 Prob gt F
0.0231 Residual 3701492.37 29
127637.668 R-squared
0.2288 ------------------------------------------
- Adj R-squared 0.1756 Total
4799789.5 31 154831.919 Root
MSE 357.26 ------------------------------
------------------------------------------------
price Coef. Std. Err. t
Pgtt 95 Conf. Interval ------------------
--------------------------------------------------
--------- numbids 326.1043 165.7274
1.97 0.059 -12.84632 665.0548
bidssq -13.44477 8.134995 -1.65 0.109
-30.0827 3.193167 _cons -454.8959
794.6254 -0.57 0.571 -2080.087
1170.295 -----------------------------------------
-------------------------------------
64Exercise 12.53, p. 705
- Graph the equations
- What effect does 2x term have on the graphs?
- What effect does xx term have on the graphs?
65Exercise 12.53, p. 705
66Exercise 12.55, p. 706
- Plot scattergram
- If only had data for xlt33, what kind of model
would you suggest? - If only xgt33?
- If all data?
67Types of Regression Models
68Third-Order Model With 1 Independent Variable
- 1. Relationship Between 1 Dependent 1
Independent Variable Has a Wave - 2. Used If 1 Reversal in Curvature
69Third-Order Model With 1 Independent Variable
- 1. Relationship Between 1 Dependent 1
Independent Variable Has a Wave - 2. Used If 1 Reversal in Curvature
- 3. Model
Curvilinear effects
Linear effect
70Third-Order Model Relationships
?3 lt 0
?3 gt 0
71Third-Order Model Worksheet
Multiply X1 by X1 to get X12. Multiply X1 by X1
by X1 to get X13. Run regression with Y, X1,
X12 , X13.
72Models With Two or More Quantitative Variables
73Types of Regression Models
74First-Order Model With 2 Independent Variables
- 1. Relationship Between 1 Dependent 2
Independent Variables Is a Linear Function - 2. Assumes No Interaction Between X1 X2
- Effect of X1 on E(Y) Is the Same Regardless of X2
Values
75First-Order Model With 2 Independent Variables
- 1. Relationship Between 1 Dependent 2
Independent Variables Is a Linear Function - 2. Assumes No Interaction Between X1 X2
- Effect of X1 on E(Y) Is the Same Regardless of X2
Values - 3. Model
76No Interaction
77No Interaction
E(Y)
E(Y) 1 2X1 3X2
12
8
4
0
X1
0
1
0.5
1.5
78No Interaction
E(Y)
E(Y) 1 2X1 3X2
12
8
4
E(Y) 1 2X1 3(0) 1 2X1
0
X1
0
1
0.5
1.5
79No Interaction
E(Y)
E(Y) 1 2X1 3X2
12
8
E(Y) 1 2X1 3(1) 4 2X1
4
E(Y) 1 2X1 3(0) 1 2X1
0
X1
0
1
0.5
1.5
80No Interaction
E(Y)
E(Y) 1 2X1 3X2
12
E(Y) 1 2X1 3(2) 7 2X1
8
E(Y) 1 2X1 3(1) 4 2X1
4
E(Y) 1 2X1 3(0) 1 2X1
0
X1
0
1
0.5
1.5
81No Interaction
E(Y)
E(Y) 1 2X1 3X2
E(Y) 1 2X1 3(3) 10 2X1
12
E(Y) 1 2X1 3(2) 7 2X1
8
E(Y) 1 2X1 3(1) 4 2X1
4
E(Y) 1 2X1 3(0) 1 2X1
0
X1
0
1
0.5
1.5
82No Interaction
E(Y)
E(Y) 1 2X1 3X2
E(Y) 1 2X1 3(3) 10 2X1
12
E(Y) 1 2X1 3(2) 7 2X1
8
E(Y) 1 2X1 3(1) 4 2X1
4
E(Y) 1 2X1 3(0) 1 2X1
0
X1
0
1
0.5
1.5
Effect (slope) of X1 on E(Y) does not depend on
X2 value
83First-Order Model Relationships
84First-Order Model Worksheet
Run regression with Y, X1, X2
85Types of Regression Models
86Interaction Model With 2 Independent Variables
- 1. Hypothesizes Interaction Between Pairs of X
Variables - Response to One X Variable Varies at Different
Levels of Another X Variable
87Interaction Model With 2 Independent Variables
- 1. Hypothesizes Interaction Between Pairs of X
Variables - Response to One X Variable Varies at Different
Levels of Another X Variable - 2. Contains Two-Way Cross Product Terms
88Interaction Model With 2 Independent Variables
- 1. Hypothesizes Interaction Between Pairs of X
Variables - Response to One X Variable Varies at Different
Levels of Another X Variable - 2. Contains Two-Way Cross Product Terms
- 3. Can Be Combined With Other Models
- Example Dummy-Variable Model
89Effect of Interaction
90Effect of Interaction
91Effect of Interaction
- 1. Given
- 2. Without Interaction Term, Effect of X1 on Y Is
Measured by ?1
92Effect of Interaction
- 1. Given
- 2. Without Interaction Term, Effect of X1 on Y Is
Measured by ?1 - 3. With Interaction Term, Effect of X1 onY Is
Measured by ?1 ?3X2 - Effect Increases As X2i Increases
93Interaction Model Relationships
94Interaction Model Relationships
E(Y) 1 2X1 3X2 4X1X2
E(Y)
12
8
4
0
X1
0
1
0.5
1.5
95Interaction Model Relationships
E(Y) 1 2X1 3X2 4X1X2
E(Y)
12
8
E(Y) 1 2X1 3(0) 4X1(0) 1 2X1
4
0
X1
0
1
0.5
1.5
96Interaction Model Relationships
E(Y) 1 2X1 3X2 4X1X2
E(Y) 1 2X1 3(1) 4X1(1) 4 6X1
E(Y) 1 2X1 3(0) 4X1(0) 1 2X1
97Interaction Model Relationships
E(Y) 1 2X1 3X2 4X1X2
E(Y) 1 2X1 3(1) 4X1(1) 4 6X1
E(Y) 1 2X1 3(0) 4X1(0) 1 2X1
Effect (slope) of X1 on E(Y) does depend on X2
value
98Interaction Model Worksheet
Multiply X1 by X2 to get X1X2. Run regression
with Y, X1, X2 , X1X2
99GFClocks Revisited (compare Minitab p. 693)
. regress price age numbids agebid Source
SS df MS Number
of obs 32 --------------------------------
----------- F( 3, 28) 193.04
Model 4578427.37 3 1526142.46
Prob gt F 0.0000 Residual
221362.133 28 7905.79047 R-squared
0.9539 ------------------------------------
------- Adj R-squared 0.9489
Total 4799789.5 31 154831.919
Root MSE 88.915 -------------------------
--------------------------------------------------
--- price Coef. Std. Err. t
Pgtt 95 Conf. Interval ----------------
--------------------------------------------------
----------- age .8781425 2.032156
0.43 0.669 -3.28454 5.040825
numbids -93.26482 29.89162 -3.12 0.004
-154.495 -32.03463 agebid 1.297846
.2123326 6.11 0.000 .8629022
1.732789 _cons 320.458 295.1413
1.09 0.287 -284.1115 925.0275 ------------
--------------------------------------------------
----------------
100Exercise 12.41
- Minitab printout p.695
- What is the prediction equation?
- Describe the geometric form of the response
surface - Plot for x21, 3, 5
- Explain what it means for x1, x2 to interact
- Specify null hypothesis for test of interaction
- Conduct test with alpha .01
101Exercise 12.43a
- p. 695
- Y frequency of alcohol consumption
- X1 personal attitude toward drinking
- X2 social support (?for drinking?)
- Interpret X1X2 interaction
102Types of Regression Models
103Second-Order Model With 2 Independent Variables
- 1. Relationship Between 1 Dependent 2 or More
Independent Variables Is a Quadratic Function - 2. Useful 1St Model If Non-Linear Relationship
Suspected
104Second-Order Model With 2 Independent Variables
- 1. Relationship Between 1 Dependent 2 or More
Independent Variables Is a Quadratic Function - 2. Useful 1St Model If Non-Linear Relationship
Suspected - 3. Model
105Second-Order Model Relationships
?4 ?5 gt 0
?4 ?5 lt 0
?32 gt 4 ?4 ?5
106Second-Order Model Worksheet
Multiply X1 by X2 to get X1X2 then X12, X22.
Run regression with Y, X1, X2 , X1X2, X12, X22.
107Stata Code
gen x1x2 x1x2 gen x1sq x1x1 gen x2sq
x2x2 regress y x1 x2 x1x2 x1sq x2sq
108Models With One Qualitative Independent Variable
109Types of Regression Models
110Dummy-Variable Model
- Involves Categorical X Variable With 2 (or More)
Levels - e.g., Male-Female College-No College
- If 2 levels, One Dummy Variable
- Coded 0 1
- May Be Combined With Quantitative Variable (1st
Order or 2nd Order Model)
111Dummy-Variable Model Worksheet
X2 levels 0 Group 1 1 Group 2. Run
regression with Y, X1, X2
112Interpreting Dummy-Variable Model Equation
113Interpreting Dummy-Variable Model Equation
?
?
?
?
Y
X
X
?
?
?
?
?
?
Given
i
i
i
0
1
1
2
2
Y
?
Starting s
alary of c
ollege gra
ds
X
?
GPA
1
0
i
f Male
X
?
2
1
if Female
114Interpreting Dummy-Variable Model Equation
?
?
?
?
Y
X
X
?
?
?
?
?
?
Given
i
i
i
0
1
1
2
2
Y
?
Starting s
alary of c
ollege gra
d'
s
X
?
GPA
1
0
i
f Male
X
?
2
1
if Female
Males (
)
X
?
0
2
?
?
?
?
?
?
Y
X
X
?
?
?
?
?
?
?
?
?
?
(0)
i
i
i
0
1
1
2
0
1
1
115Interpreting Dummy-Variable Model Equation
?
?
?
?
Y
X
X
?
?
?
?
?
?
Given
i
i
i
0
1
1
2
2
Y
?
Starting s
alary of c
ollege gra
d'
s
X
?
GPA
1
0
i
f Male
X
?
2
1
if Female
Same slopes
Males (
)
X
?
0
2
?
?
?
?
?
?
Y
X
X
?
?
?
?
?
?
?
?
?
?
(0)
i
i
i
0
1
1
2
0
1
1
Females (
)
X
?
1
2
?
?
?
?
?
?
?
Y
X
X
?
?
?
?
?
?
?
?
(?
?
?
? )
(1)
i
i
i
0
1
1
2
1
1
0
2
116Dummy-Variable Model Relationships
Y
Same Slopes ?1
Females
?0 ?2
?0
Males
0
X1
0
117Dummy-Variable Model Example
118Dummy-Variable Model Example
?
Y
X
X
?
?
?
3
5
7
Computer O
utput
i
i
i
1
2
i
0
f Male
X
?
2
1
if Female
119Dummy-Variable Model Example
?
Y
X
X
?
?
?
3
5
7
Computer O
utput
i
i
i
1
2
i
0
f Male
X
?
2
1
if Female
Males (
)
X
?
0
2
?
Y
X
X
?
?
?
?
?
3
5
7
3
5
(0)
i
i
i
1
1
120Dummy-Variable Model Example
?
Y
X
X
?
?
?
3
5
7
Computer O
utput
i
i
i
1
2
i
0
f Male
X
?
2
1
if Female
Same slopes
Males (
)
X
?
0
2
?
Y
X
X
?
?
?
?
?
3
5
7
3
5
(0)
i
i
i
1
1
Females
)
(X
?
1
2
?
Y
X
?
?
?
?
3
5
7
(1)
X
?
(3 7)
5
i
i
1
i
1
121Dummies for More than Two Levels
- Categorical variable X with k levels
- Choose one level as base
- The left-out value
- Generate dummy variables for other levels I
- Xi 1 if Xi, otherwise Xi 0
- Interpret coefficient on Xi
- Impact of move from base level to level i
-
122GOLFCRD (p.711)
insheet using golfcrd.txt rename v1 brand rename
v2 distance gen b 1 if brand "B" replace b
0 if brand !"B" gen c 1 if brand "C" replace
c0 if brand ! "C" gen d 1 if brand
"D" replace d 0 if brand ! "D"
123GOLFCRD (SPSS p. 712)
. regress distance b c d Source
SS df MS Number of obs
40 -----------------------------------------
-- F( 3, 36) 43.99 Model
2794.38913 3 931.463043 Prob gt
F 0.0000 Residual 762.300429 36
21.1750119 R-squared
0.7857 ------------------------------------------
- Adj R-squared 0.7678 Total
3556.68956 39 91.1971681 Root
MSE 4.6016 ------------------------------
------------------------------------------------
distance Coef. Std. Err. t
Pgtt 95 Conf. Interval ------------------
--------------------------------------------------
--------- b 10.28 2.057912
5.00 0.000 6.106356 14.45363
c 19.17 2.057912 9.32 0.000
14.99636 23.34364 d -1.460002
2.057912 -0.71 0.483 -5.633641
2.713637 _cons 250.78 1.455164
172.34 0.000 247.8288 253.7312 ----------
--------------------------------------------------
------------------
124Exercise 12.67
- p. 713
- What is least squares equation?
- Interpret the betas
- Interpret the null hypothesis beta1beta20 in
terms of mu values for the different levels - Conduct the hypothesis test from c.
125Exercise 12.80
126Exercise 12.79
127Dummy Interactions
- Quantitative Variable X1
- Dummy Variable X2
- Model with interaction term
- What is slope of X1 when X2 is 0?
- What is slope of X1 when X2 is 1?
- So Beta3 gives X2s impact on slope of X1
128Testing Model Portions
129Testing Model Portions
- 1. Tests the Contribution of a Set of X Variables
to the Relationship With Y - 2. Null Hypothesis H0 ?g1 ... ?k 0
- Variables in Set Do Not Improve Significantly the
Model When All Other Variables Are Included - 3. Used in Selecting X Variables or Models
- Part of Most Computer Programs
130F-Test for Nested Models
- Numerator
- Reduction in SSE from additional parameters
- df k-g number of additional parameters
- Denominator
- SSE of complete model
- dfn-(k1)error df
131Exercise 12.89
- Which of these models is nested?
- p. 732
132Exercise 12.93
133Exercise 12.90
- Why is the F-test a one-tailed, upper-tailed test?
134Selecting Variables in Model Building
135Selecting Variables in Model Building
A Butterfly Flaps its Wings in Japan, Which
Causes It to Rain in Nebraska. -- Anonymous
Use Theory Only!
Use Computer Search!
136Model Building with Computer Searches
- 1. Rule Use as Few X Variables As Possible
- 2. Stepwise Regression
- Computer Selects X Variable Most Highly
Correlated With Y - Continues to Add or Remove Variables Depending on
SSE - 3. Best Subset Approach
- Computer Examines All Possible Sets
137Should You Do It?
- Its quite problematic
- Youve run a large number of tests, so
probability of at least one error is high - P-values too low, confidence intervals too small
- Gives biased estimates of coefficients for
variables not dropped - See http//www.stata.com/support/faqs/stat/stepwis
e.html - But its commonly done
138Residual Analysis
139Evaluating Multiple Regression Model Steps
Expanded!
- 1. Examine Variation Measures
- 2. Do Residual Analysis
- 3. Test Parameter Significance
- Overall Model
- Individual Coefficients
- 4. Test for Multicollinearity
New!
New!
New!
140Residual Analysis
- 1. Graphical Analysis of Residuals
- Plot Estimated Errors vs. Xi Values
- Difference Between Actual Yi Predicted Yi
- Estimated Errors Are Called Residuals
- Plot Histogram or Stem--Leaf of Residuals
- 2. Purposes
- Examine Functional Form (Linear vs. Non-Linear
Model) - Evaluate Violations of Assumptions
141Linear Regression Assumptions
- 1. Mean of Probability Distribution of Error Is 0
- 2. Probability Distribution of Error Has Constant
Variance - 3. Probability Distribution of Error is Normal
- 4. Errors Are Independent
142Residual Plot for Functional Form
Add X2 Term
Correct Specification
143Residual Plot for Constant Variance
Unequal Variance
Correct Specification
Fan-shaped.Standardized residuals used typically
(residual divided by standard error of
prediction)
144Residual Plot for Independence
Not Independent
Correct Specification
145GFCLOCKS again
. regress price age numbids Source
SS df MS Number of obs
32 ---------------------------------------
---- F( 2, 29) 120.19
Model 4283062.96 2 2141531.48
Prob gt F 0.0000 Residual
516726.54 29 17818.1565 R-squared
0.8923 -------------------------------------
------ Adj R-squared 0.8849
Total 4799789.5 31 154831.919
Root MSE 133.48 -------------------------
--------------------------------------------------
--- price Coef. Std. Err. t
Pgtt 95 Conf. Interval ----------------
--------------------------------------------------
----------- age 12.74057 .9047403
14.08 0.000 10.89017 14.59098
numbids 85.95298 8.728523 9.85 0.000
68.10115 103.8048 _cons -1338.951
173.8095 -7.70 0.000 -1694.432
-983.4711 ----------------------------------------
--------------------------------------
146Calculating Residuals
. predict yhat, xb . predict ehat, residuals .
predict stdehat, rstandard . predict stderr, stdr
. list price yhat ehat stderr stdehat in 1/5
-----------------------------------------------
------ price yhat ehat
stderr stdehat -----------------------
------------------------------ 1. 1235
1396.49 -161.4904 127.7914 -1.263703 2.
1080 1157.651 -77.65049 127.9045
-.6070973 3. 845 880.7725 -35.77246
127.7805 -.2799525 4. 1522 1345.712
176.2884 131.2617 1.34303 5. 1047
1164.296 -117.2961 127.9363 -.9168323
------------------------------------------------
-----
147Plot Residuals Against Each X
148Plot Standardized Residuals
- . scatter stdehat numbids
149Regression Pitfalls
150Evaluating Multiple Regression Model Steps
Expanded!
- 1. Examine Variation Measures
- 2. Do Residual Analysis
- 3. Test Parameter Significance
- Overall Model
- Individual Coefficients
- 4. Test for Multicollinearity
New!
New!
New!
151Multicollinearity
- 1. High Correlation Between X Variables
- 2. Coefficients Measure Combined Effect
- 3. Leads to Unstable Coefficients Depending on X
Variables in Model - 4. Always Exists -- Matter of Degree
- 5. Example Using Both Age Height as
Explanatory Variables in Same Model
152Detecting Multicollinearity
- 1. Examine Correlation Matrix
- Correlations Between Pairs of X Variables Are
More than With Y Variable - 2. Examine Variance Inflation Factor (VIF)
- If VIFj gt 5 (or 10 according to text),
Multicollinearity is a Problem - 3. Few Remedies
- Obtain New Sample Data
- Eliminate One Correlated X Variable
153Correlation Matrix Computer Output
. corr price age numbids (obs32)
price age numbids ----------------------
------------------ price 1.0000
age 0.7296 1.0000 numbids 0.3952
-0.2537 1.0000
But only correlation among independent variables
matters
154Extrapolation
Y
Interpolation
Extrapolation
Extrapolation
X
Relevant Range
155Cause Effect
Liquor Consumption
Teachers
156Exercise 12.116
- p.764 whats wrong in each of the residual plots?
157Exercise 12.153
- p. 776
- Analyze FLAG dataset
- Any multicollinearity?
- Test regression model with interaction term
- Conduct residual analysis
- Good exercise, but we wont have time in class
158Conclusion
- 1. Explained the Linear Multiple Regression Model
- 2. Tested Overall Significance
- 3. Described Various Types of Models
- 4. Evaluated Portions of a Regression Model
- 5. Interpreted Linear Multiple Regression
Computer Output - 6. Described Stepwise Regression
- 7. Explained Residual Analysis
- 8. Described Regression Pitfalls
159End of Chapter
Any blank slides that follow are blank
intentionally.