Title: Statistics for Business and Economics
1Statistics for Business and Economics
- Chapter 11 Multiple Regression and Model
Building - John J. McGill/Lyn Noble
- Revisions by Peter Jurkat
2Learning Objectives
- Explain the Linear Multiple Regression Model
- Describe Inference About Individual Parameters
- Test Overall Significance
- Explain Estimation and Prediction
- Describe Various Types of Models
- Describe Model Building
- Explain Residual Analysis
- Describe Regression Pitfalls
3Types of Regression Models
4Models With Two or More Quantitative Variables
5Types of Regression Models
6Multiple Regression Model
- General form
- k independent variables
- x1, x2, , xk may be functions of variables
- e.g. x2 (x1)2
- Example PropertyPrice b0 b1(LotSize)
b2(LivingArea) b3(NoRooms) b4(BRs) b5(Pool)
7Regression Modeling Steps
- Hypothesize deterministic component
- Estimate unknown model parameters
- Specify probability distribution of random error
term - Estimate standard deviation of error
- Evaluate model
- Use model for prediction and estimation
8Probability Distribution of Random Error
9Linear Multiple Regression Model
10Types of Regression Models
11Regression Modeling Steps
- Hypothesize deterministic component
- Estimate unknown model parameters
- Specify probability distribution of random error
term - Estimate standard deviation of error
- Evaluate model
- Use model for prediction and estimation
12FirstOrder Multiple Regression Model
- Relationship between 1 dependent and 2 or more
independent variables is a linear function
Population slopes
Population Y-intercept
Random error
Dependent (response) variable
Independent (explanatory) variables
13Assumptions for Probability Distribution of e
- Mean is 0
- Constant variance, s2
- Normally Distributed
- Errors are independent
14First-Order Model With 2 Independent Variables
- Relationship between 1 dependent and 2
independent variables is a linear function - Model
- Assumes no interaction between x1 and x2
- Effect of x1 on E(y) is the same regardless of x2
values
15Population Multiple Regression Model
Bivariate model
y
(Observed y)
b
Response
e
0
i
Plane
x2
x1
(x1i , x2i)
16Sample Multiple Regression Model
Bivariate model
y
(Observed y)
b
Response
0
e
Plane
i
x2
x1
(x1i , x2i)
17Parameter Estimation
18Regression Modeling Steps
- Hypothesize Deterministic Component
- Estimate Unknown Model Parameters
- Specify Probability Distribution of Random Error
Term - Estimate Standard Deviation of Error
- Evaluate Model
- Use Model for Prediction Estimation
19First-Order Model Worksheet
Case, i
yi
x1i
x2i
1
1
1
3
2
4
8
5
3
1
3
2
4
3
5
6
Run regression with y, x1, x2
20Multiple Linear Regression Equations
Too complicated by hand!
Ouch!
21Interpretation of Estimated Coefficients
221st Order Model Example
- You work in advertising for the New York Times.
You want to find the effect of ad size (sq. in.)
and newspaper circulation (000) on the number of
ad responses (00). Estimate the unknown
parameters. - NYTAdSizeCirc.xls
Youve collected the following data (y)
(x1) (x2)Resp Size Circ 1 1 2 4 8 8 1 3
1 3 5 7 2 6 4 4 10 6
23Parameter Estimation Computer Output
- Parameter Estimates
- Parameter Standard T for H0
- Variable DF Estimate Error Param0 ProbgtT
- INTERCEP 1 0.0640 0.2599 0.246 0.8214
- ADSIZE 1 0.2049 0.0588 3.656 0.0399
- CIRC 1 0.2805 0.0686 4.089 0.0264
-
24Interpretation of Coefficients Solution
25Estimation of s2
26Regression Modeling Steps
- Hypothesize deterministic component
- Estimate unknown model parameters
- Specify probability distribution of random error
term - Estimate standard deviation of error
- Evaluate model
- Use model for prediction and estimation
27Regression Modeling Steps
- Hypothesize Deterministic Component
- Estimate Unknown Model Parameters
- Specify Probability Distribution of Random Error
Term - Estimate Standard Deviation of Error
- Evaluate Model
- Use Model for Prediction Estimation
28Estimation of s2
For a model with k independent variables
29Calculating s2 and s Example
- You work in advertising for the New York Times.
You want to find the effect of ad size (sq. in.),
x1, and newspaper circulation (000), x2, on the
number of ad responses (00), y. Find SSE, s2, and
s.
30Analysis of Variance Computer Output
- Analysis of Variance
-
- Source DF SS MS F
PRegression 2 9.249736 4.624868 55.44
.0043 Residual Error 3 .250264
.083421Total 5 9.5
31Evaluating the Model
32Regression Modeling Steps
- Hypothesize Deterministic Component
- Estimate Unknown Model Parameters
- Specify Probability Distribution of Random Error
Term - Estimate Standard Deviation of Error
- Evaluate Model
- Use Model for Prediction Estimation
33Evaluating Multiple Regression Model Steps
- Examine variation measures
- Test parameter significance
- Individual coefficients
- Overall model
- Do residual analysis
34Variation Measures
35Evaluating Multiple Regression Model Steps
- Examine variation measures
- Test parameter significance
- Individual coefficients
- Overall model
- Do residual analysis
36Multiple Coefficient of Determination
- Proportion of variation in y explained by all x
variables taken together -
- Never decreases when new x variable is added to
model - Only y values determine SSyy
- Disadvantage when comparing models
37Adjusted Multiple Coefficient of Determination
- Takes into account n and number of parameters
- Similar interpretation to R2
38Estimation of R2 and Ra2 Example
- You work in advertising for the New York Times.
You want to find the effect of ad size (sq. in.),
x1, and newspaper circulation (000), x2, on the
number of ad responses (00), y. Find R2 and Ra2.
39Excel Computer OutputSolution
40Testing Parameters
41Evaluating Multiple Regression Model Steps
- Examine variation measures
- Test parameter significance
- Individual coefficients
- Overall model
- Do residual analysis
42Inference for an Individual ß Parameter
- Confidence Interval
- Hypothesis Test Ho ßi 0 Ha ßi ? 0 (or lt or
gt ) - Test Statistic
43Confidence Interval Example
- You work in advertising for the New York Times.
You want to find the effect of ad size (sq. in.),
x1, and newspaper circulation (000), x2, on the
number of ad responses (00), y. Find a 95
confidence interval for ß1.
44Excel Computer OutputSolution
45Confidence IntervalSolution
46Hypothesis Test Example
- You work in advertising for the New York Times.
You want to find the effect of ad size (sq. in.),
x1, and newspaper circulation (000), x2, on the
number of ad responses (00), y. Test the
hypothesis that the mean ad response increases as
circulation increases (ad size constant). Use a
.05.
47Hypothesis Test Solution
- H0
- Ha
- ? ?
- df ?
- Critical Value(s)
Test Statistic Decision Conclusion
48Excel Computer OutputSolution
49Hypothesis Test Solution
- H0
- Ha
- ? ?
- df ?
- Critical Value(s)
Test Statistic Decision Conclusion
Reject at ? .05
There is evidence the mean ad response increases
as circulation increases
50Excel Computer OutputSolution
PValue
51Evaluating Multiple Regression Model Steps
- Examine variation measures
- Test parameter significance
- Individual coefficients
- Overall model
- Do residual analysis
52Testing Overall Significance
- Shows if there is a linear relationship between
all x variables together and y - Hypotheses
- H0 ?1 ?2 ... ?k 0
- No linear relationship
- Ha At least one coefficient is not 0
- At least one x variable affects y
53Testing Overall Significance
- Test Statistic
- Degrees of Freedom?1 k ?2 n (k 1)
- k Number of independent variables
- n Sample size
54Testing Overall Significance Example
- You work in advertising for the New York Times.
You want to find the effect of ad size (sq. in.),
x1, and newspaper circulation (000), x2, on the
number of ad responses (00), y. Conduct the
global Ftest of model usefulness. Use a .05.
55Testing Overall Significance Solution
- H0
- Ha
- ?
- ?1 ?2
- Critical Value(s)
Test Statistic Decision Conclusion
56Testing Overall SignificanceComputer Output
- Analysis of Variance
- Sum of Mean
- Source DF Squares Square F Value ProbgtF
- Model 2 9.2497 4.6249 55.440 0.0043
- Error 3 0.2503 0.0834
- C Total 5 9.5000
MS(Model)
n (k 1)
MS(Error)
57Testing Overall Significance Solution
- H0
- Ha
- ?
- ?1 ?2
- Critical Value(s)
Test Statistic Decision Conclusion
Reject at ? .05
There is evidence at least 1 of the coefficients
is not zero
58Testing Overall SignificanceComputer Output
Solution
- Analysis of Variance
- Sum of Mean
- Source DF Squares Square F Value ProbgtF
- Model 2 9.2497 4.6249 55.440 0.0043
- Error 3 0.2503 0.0834
- C Total 5 9.5000
MS(Model) MS(Error)
59Interaction Models
60Types of Regression Models
61Interaction Model With 2 Independent Variables
- Hypothesizes interaction between pairs of x
variables - Response to one x variable varies at different
levels of another x variable
- Can be combined with other models
- Example dummy-variable model
62Effect of Interaction
Given
- Without interaction term, effect of x1 on y is
measured by ?1 - With interaction term, effect of x1 on y is
measured by ?1 ?3x2 - Effect increases as x2 increases
63No Interaction
E(y) 1 2x1 3x2
E(y)
12
8
4
0
x1
0
1
0.5
1.5
Effect (slope) of x1 on E(y) does not depend on
x2 value
64Interaction Model Relationships
E(y) 1 2x1 3x2 4x1x2
E(y)
12
8
4
x1
0
0
1
0.5
1.5
Effect (slope) of x1 on E(y) depends on x2 value
65Interaction Model Worksheet
Case, i
yi
x1i
x2i
x1i x2i
1
1
1
3
3
2
4
8
5
40
3
1
3
2
6
4
3
5
6
30
Multiply x1 by x2 to get x1x2. Run regression
with y, x1, x2 , x1x2
66Interaction Example
- You work in advertising for the New York Times.
You want to find the effect of ad size (sq. in.),
x1, and newspaper circulation (000), x2, on the
number of ad responses (00), y. Conduct a test
for interaction. Use a .05.
67Interaction Model Worksheet
x1i
x2i
yi
x1i x2i
1
2
2
1
8
8
64
4
3
1
3
1
5
7
35
3
6
4
24
2
10
6
60
4
Multiply x1 by x2 to get x1x2. Run regression
with y, x1, x2 , x1x2
68Excel Computer OutputSolution
Global Ftest indicates at least one parameter is
not zero
F
P-Value
69Interaction Test Solution
- H0
- Ha
- ? ?
- df ?
- Critical Value(s)
Test Statistic Decision Conclusion
70Excel Computer OutputSolution
71Interaction Test Solution
- H0
- Ha
- ? ?
- df ?
- Critical Value(s)
Test Statistic Decision Conclusion
t 1.8528
Do no reject at ? .05
There is no evidence of interaction
72SecondOrder Models
73Types of Regression Models
74Second-Order Model With 1 Independent Variable
- Relationship between 1 dependent and 1
independent variable is a quadratic function - Useful 1st model if non-linear relationship
suspected - Model
75Second-Order Model Relationships
?2 gt 0
?2 gt 0
y
y
x1
x1
?2 lt 0
?2 lt 0
y
y
x1
x1
76Second-Order Model Worksheet
2
Case, i
yi
xi
xi
1
1
1
1
2
4
8
64
3
1
3
9
4
3
5
25
Create x2 column. Run regression with y, x, x2.
772nd Order Model Example
Errors (y) Weeks (x) 20 1 18 1
16 2 10 4 8 4 4 5 3 6 1 8 2 10 1 11 0 12
1 12
- The data shows the number of weeks employed and
the number of errors made per day for a sample of
assembly line workers. Find a 2nd order model,
conduct the global Ftest, and test if ß2 ? 0.
Use a .05 for all tests.
78Second-Order Model Worksheet
2
yi
xi
xi
1
1
20
1
1
18
2
4
16
4
16
10
Create x2 column. Run regression with y, x, x2.
79Excel Computer Output Solution
80Overall Model Test Solution
Global Ftest indicates at least one parameter is
not zero
F
P-Value
81ß2 Parameter Test Solution
ß2 test indicates curvilinear relationship exists
t
82Types of Regression Models
83Second-Order Model With 2 Independent Variables
- Relationship between 1 dependent and 2
independent variables is a quadratic function - Useful 1st model if non-linear relationship
suspected - Model
84Second-Order Model Relationships
x2
x1
x2
x1
85Second-Order Model Worksheet
2
2
Case, i
yi
x1i
x1i
x2i
x2i
x1ix2i
1
1
1
3
3
1
9
2
4
8
5
40
64
25
3
1
3
2
6
9
4
4
3
5
6
30
25
36
Multiply x1 by x2 to get x1x2 then create x12,
x22. Run regression with y, x1, x2 , x1x2, x12,
x22.
86Models With One Qualitative Independent Variable
87Types of Regression Models
88Dummy-Variable Model
- Involves categorical x variable with 2 levels
- e.g., male-female college-no college
- Variable levels coded 0 and 1
- Number of dummy variables is 1 less than number
of levels of variable - May be combined with quantitative variable (1st
order or 2nd order model)
See QtrGDPAnalyzed.xls
89Dummy-Variable Model Worksheet
Case, i
yi
x1i
x2i
1
1
1
1
2
4
8
0
3
1
3
1
4
3
5
1
x2 levels 0 Group 1 1 Group 2. Run
regression with y, x1, x2
90Interpreting Dummy-Variable Model Equation
Given
y Starting salary of college graduates
x1 GPA
0 if Male
x2
1 if Female
91Dummy-Variable Model Example
Computer Output
0 if Male
x2
1 if Female
92Dummy-Variable Model Relationships
y
Same Slopes ?1
Female
?0 ?2
Male
?0
x1
0
0
93Nested Models
94Comparing Nested Models
- Contains a subset of terms in the complete (full)
model - Tests the contribution of a set of x variables to
the relationship with y - Null hypothesis H0 ?g1 ... ?k 0
- Variables in set do not improve significantly the
model when all other variables are included - Used in selecting x variables or models
- Part of most computer programs
95Selecting Variables in Model Building
96Selecting Variables in Model Building
A butterfly flaps its wings in Japan, which
causes it to rain in Nebraska. -- Anonymous
Use Theory Only!
Use Computer Search!
97Model Building with Computer Searches
- Rule Use as few x variables as possible
- Stepwise Regression
- Computer selects x variable most highly
correlatedwith y - Continues to add or remove variables depending on
SSE - Best subset approach
- Computer examines all possible sets
98Residual Analysis
99Evaluating Multiple Regression Model Steps
- Examine variation measures
- Test parameter significance
- Individual coefficients
- Overall model
- Do residual analysis
100Residual Analysis
- Graphical analysis of residuals
- Plot estimated errors versus xi values
- Difference between actual yi and predicted yi
- Estimated errors are called residuals
- Plot histogram or stem--leaf of residuals
- Purposes
- Examine functional form (linear v. non-linear
model) - Evaluate violations of assumptions
101Residual Plot for Functional Form
Add x2 Term
Correct Specification
102Residual Plot for Equal Variance
Unequal Variance
Correct Specification
Fan-shaped.Standardized residuals used
typically.
103Residual Plot for Independence
Not Independent
Correct Specification
e
x
Plots reflect sequence data were collected.
104Residual Analysis Computer Output
- Dep Var Predict Student
- Obs SALES Value Residual Residual -2-1-0 1 2
- 1 1.0000 0.6000 0.4000 1.044
- 2 1.0000 1.3000 -0.3000 -0.592
- 3 2.0000 2.0000 0 0.000
- 4 2.0000 2.7000 -0.7000 -1.382
- 5 4.0000 3.4000 0.6000 1.567
Plot of standardized (student) residuals
105Regression Pitfalls
106Regression Pitfalls
- Parameter Estimability
- Number of different xvalues must be at least one
more than order of model - Multicollinearity
- Two or more xvariables in the model are
correlated - Extrapolation
- Predicting yvalues outside sampled range
- Correlated Errors
107Multicollinearity
- High correlation between x variables
- Coefficients measure combined effect
- Leads to unstable coefficients depending on x
variables in model - Always exists matter of degree
- Example using both age and height as
explanatory variables in same model
See RealestateAnalyzed.xls
108Detecting Multicollinearity
- Significant correlations between pairs of x
variables are more than with y variable - Nonsignificant ttests for most of the
individual parameters, but overall model test is
significant - Estimated parameters have wrong sign
109Solutions to Multicollinearity
- Eliminate one or more of the correlated x
variables - Avoid inference on individual parameters
- Do not extrapolate
110Extrapolation
y
Interpolation
Extrapolation
Extrapolation
x
Sampled Range
111NPP Not Straight
- When regression measures cannot guarantee
reliability (bent NPP, high Sig-F) can transform
variables usually applied to DV - Raise values to some power see ladder of
powers in Variable Transformations.doc - Powers gt 1 can make lift skewed less skewed
- Powers lt 1 can make right skewed more skewed
- Could make NPP straighter for DVs
112Conclusion
- Explained the Linear Multiple Regression Model
- Described Inference About Individual Parameters
- Tested Overall Significance
- Explained Estimation and Prediction
- Described Various Types of Models
- Described Model Building
- Explained Residual Analysis
- Described Regression Pitfalls