Title: Additional topics in Regression including Multiple Regression
1Part VI
- Additional topics in Regression including
Multiple Regression
Dr. Stephen H. Russell
Weber State University
2Statistical outliers
- Spose your sample data looked like this
Y
This observation appears inconsistent with the
rest of the sample data
X
3Statistical outliers
- Three possible explanations
- Error in data collection
- Relationship between X and Y is not strong
- Relationship between X and Y is non-linear
Statistical outliers should be taken out of the
data sample only if there was obviously an error
in taking that observation
4Statistical outliers
- MINITAB will highlight statistical outliers for
you. This MINITAB reminder invites you to
reassess your data and your model. - MINITAB defines an observation as unusual if
the observation is more than two standard
deviations away from the fitted regression line.
5Two types of data
- Time series
- Example Crime rates in the United States year
by year - Cross Sectional
- Example Crime rates at a particular point in
time for New York City, Detroit, Kansas City,
Reno, Phoenix, Fresno, Montgomery, Louisville,
etc.
Cross sectional studies are generally preferred
over time series due to risk of autocorrelation
in time series data.
6Multiple Regression Analysis
- Several independent variables are used to
estimate the value of an unknown dependent
variable - Each predictor variable explains part of the
total variation of the dependent variable - Intent is to increase R2 and get a solid,
statistically valid relationship between Y and a
whole set of pertinent explanatory variables.
k
k
Number of explanatory variables is k
7The Global Test F test
- We test the ability of the entire set of
independent variables in multiple regression to
explain the behavior of the dependent variable.
- Lets assume four independent variables
- The hypotheses are Ho ?1?2?3 ?4 0 HA
Not all the ?s 0 - These hypotheses are assessed on the basis of an
F Test calculation - The global test asks, Could the amount of
explained variation, R2, occur by chance?
8The Global Test F test
This test requires the employment of an ANOVA
Table based upon regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15
9The Degrees of Freedom Concept
- Degrees of Freedom refers to the number of
elements that are free to take on any value - Any Total has n-1 degrees of freedom
- Since the number of explanatory variables is
totally free, the degrees of freedom
regression is chosen when k is chosen. - Error degrees of freedom, by default, is n
l - k
10The Global Test F test
This test requires the employment of an ANOVA
Table based upon regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15
How many explanatory variables comprise this
model?
What is sample size here?
11The Global Test F Test
- The Global Test of multiple explanatory variables
is a ratio of two variances -
12ANOVA table for Regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15
- Reminder Regression degrees of freedom
is k (number of explanatory variables)
Error degrees of freedom is n 1 k
Total degrees of freedom is n - 1
13ANOVA table for Regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15
- Fill in an ANOVA table like the above for this
problem A researcher tests a model with 5
explanatory variables and 51 observations.
Results are SST 215 and SSE 180.
14ANOVA table for Regression
ANSWER
SOURCE DF SS MS F Regression 5
35 7 1.75 Error 45 180 4
Total 50 215
- Fill in an ANOVA table like the above for this
problem A researcher tests a model with 5
explanatory variables and 51 observations.
Results are SST 215 and SSE 180.
15The Global Test F Test
- The FCalc is compared to FCrit to make a
decision on the hypotheses Ho ?1?2?3 ?4 0
HA Not all the ?s 0 - Recall that the F Test is always a ratio of two
variances, with dfn and dfd. - In the regression case, dfn is k
- and dfd is n 1
k - Hence FCrit is of the form F? (k, n-1-k)
-
16A MINITAB example
17Another Way to Calculate F
18Another Way to Calculate F
19Another Way to Calculate F
Notice
20Another Way to Calculate F
21Multiple Regression Analysis
k
k
k
In addition to the Global F Test, we do a t
test on the coefficients of each explanatory
variable HO ß1 0 HA ß1 0
HO ß2 0 HA ß2 0
We keep or throw out potential explanatory
variables on the basis of the P-values associated
with each t test
Etc.
22Example
Suppose the manager of an automotive parts
distributor which serves the market of Virginia,
North Carolina, and South Carolina wants to
estimate total annual sales each year. Four
independent variables were recommended to him by
his marketing department Number of retail
outlets in the region who stock this companys
parts (X1), the number of automobiles registered
in this region (X2), the average age of
automobiles registered (X3), and the commission
rate paid to company sales people (X4).
23Example
Suppose the manager of an automotive parts
distributor which serves the market of Virginia,
North Carolina, and South Carolina wants to
estimate total annual sales each year. Four
independent variables were recommended to him by
his marketing department Number of retail
outlets in the region who stock this companys
parts (X1), the number of automobiles registered
in this region (X2), the average age of
automobiles registered (X3), and the commission
rate paid to company sales people (X4).
Predictor Coef StDev T P Constant
NumberOut 1427.00 9.41 0.000
Automobil 24.90 2.60 0.001
AverageAg
19.80 0.49 0.883
Commissio
4.33 1.89 0.078
Which explanatory variables are statistically
significant?
24Example
Specification Errors ? Including variables
that are not statistically significant. ?
Excluding variables that ought to be included.
Predictor Coef StDev T P Constant
NumberOut 1427.00 9.41 0.000
Automobil 24.90 2.60 0.001
AverageAg
19.80 0.49 0.883
Commissio
4.33 1.89 0.078
Which explanatory variables are statistically
significant?
25Example
Specification Errors ? Including variables
that are not statistically significant. ?
Excluding variables that ought to be included.
Predictor Coef StDev T P Constant
NumberOut 1427.00 9.41 0.000
Automobil 24.90 2.60 0.001
AverageAg
19.80 0.49 0.883
Commissio
4.33 1.89 0.078
We need to run this regression again with at
least X3 excluded.
26The Issue of Multicollinearity
k
k
k
k
k
Multicollinearity occurs when some of the
independent variables tend to move together
i.e., they are correlated with each other.
With multicollinearity we can no longer give
meaning to the b coefficients!
27Consider this example problem
- I want to explain earnings in the accounting
profession. I suspect two good explanatory
variables are age and years of experience. I
collect sample information and build these
estimating equations - First try
- Salary 13.0 0.912 Age R2
72.6
P-Value .002 - Second try
- Salary 35.8 0.878 Yrs Exper R2
77.3 P-Value
.001 - Third try
- Salary 36.1 0.014 Age 0.890 Yrs Exper R2
77.3
PValue .986 P-Value .269
28Consider this example problem
- I want to explain earnings in the accounting
profession. I suspect two good explanatory
variables are age and years of experience. I
collect sample information and build these
estimating equations - First try
- Salary 13.0 0.912 Age R2
72.6
P-Value .002 - Second try
- Salary 35.8 0.878 Yrs Exper R2
77.3 P-Value
.001 - Third try
- Salary 36.1 0.014 Age 0.890 Yrs Exper R2
77.3
PValue .986 P-Value .269
The problem here is that we have
multicollinearity!
29Multicollinearity
- The two classic signs of multicollinearity
- High R2 with low t values ( not low P-values)
- Wrong signs on one or more coefficients
- Implications
- Cannot give interpretations to the coefficients
of the individual explanatory variablescant
disentangle the separate influences on Y. - Have to ignore the results of the t tests (the P
values).
30Identifying Multicollinearity
- The sure way of assessing whether or not you are
employing predictor variables that are correlated
to each other is with a correlation matrix of all
predictor variables - The MTB command is
MTBgt correlation c2 c3 c4 c5 -
Note If you identify multicollinearity, you
must decide whether to live with it or drop one
of the correlated variables from your model.
31A clarifying reminder
- If the residuals (error terms) are correlated,
you have autocorrelationa serious violation of
assumptions. - If two or more of the explanatory variables are
correlated, you have multicollinearitywhich
makes it impossible to interpret the various
bs.
32Causes of wrong signs
- Multicollinearity
- Specification error (important explanatory
variable or variables left out) - Non-representative sample
- Sample too small
- Biased sample
33Adjusted R2
- r is the statistic that estimates the parameter
? - R2 is the statistic that estimates the
parameter ?2 (the true coefficient of
determination) - SSR/SST is a biased estimator of ?2
- R2 (adjusted) is the unbiased estimator
- MINITAB shows both R2 and R2 (adj)
-
34R2 (Adjusted)
- The adjustment to eliminate bias is
n sample size k number of
explanatory variables
Adjusted R2 is noticeably smaller than R2 only
when k is large in comparison to n. For example
k 2 and n 100 vs. k 4 and n 15.
35R2 (Adjusted)
- The adjustment to eliminate bias is
Adjustment factor is small (Only 1.02)
n sample size k number of
explanatory variables
Adjusted R2 is noticeably smaller than R2 only
when k is large in comparison to n. For example
k 2 and n 100 vs. k 4 and n 15.
36R2 (Adjusted)
- The adjustment to eliminate bias is
Adjustment factor is substantial (1.40)!
n sample size k number of
explanatory variables
Adjusted R2 is noticeably smaller than R2 only
when k is large in comparison to n. For example
k 2 and n 100 vs. k 4 and n 15.
37Three nifty model-building techniques
- Lagged independent variable
- Trend variables
- Dummy variables for qualitative data
38I. Lagged independent variable
- Spose we suspect that sales in March are
influenced by price in March and the advertising
we do in February. - Our regression specification would be of the form
X1 price X2 Advertising dollars
39Inserting data for a lagged specification
40Inserting data for a lagged specification
41Inserting data for a lagged specification
42II. Trend Variables
- Sometimes data have a momentum of their ownGDP
or the annual sales at Wal-Mart might be examples.
Sales
Time
43II. Trend Variables
- Sometimes data have a momentum of their ownGDP
or the annual sales at Wal-Mart might be examples.
Sales
Time
44Trend Variables
Sales
Time
45Data for a Trend Variable
- Period Sales
Advertising Trend Variable - 1Q 01 34.9 3.0 1
- 2Q 01 38.0 3.1 2
- 3Q 01 37.1 2.9 3
- 4Q 01 38.0 3.3 4
- 1Q 02 37.9 2.8 5
- 2Q 02 38.4 3.2 6
- 3Q 02 38.0 3.3 7
- 4Q 02 41.5 3.6 8
- 1Q 03 40.8 3.3 9
- 2Q 03
46Data for a Trend Variable
- Period Sales
Advertising Trend Variable - 1Q 01 34.9 3.0 1
- 2Q 01 38.0 3.1 2
- 3Q 01 37.1 2.9 3
- 4Q 01 38.0 3.3 4
- 1Q 02 37.9 2.8 5
- 2Q 02 38.4 3.2 6
- 3Q 02 38.0 3.3 7
- 4Q 02 41.5 3.6 8
- 1Q 03 40.8 3.3 9
- 2Q 03
47Data for a Trend Variable
- Period Sales
Advertising Trend Variable - 1Q 01 34.9 3.0 1
- 2Q 01 38.0 3.1 2
- 3Q 01 37.1 2.9 3
- 4Q 01 38.0 3.3 4
- 1Q 02 37.9 2.8 5
- 2Q 02 38.4 3.2 6
- 3Q 02 38.0 3.3 7
- 4Q 02 41.5 3.6 8
- 1Q 03 40.8 3.3 9
- 2Q 03
48Data for a Trend Variable
- Period Sales
Advertising Trend Variable - 1Q 01 34.9 3.0 1
- 2Q 01 38.0 3.1 2
- 3Q 01 37.1 2.9 3
- 4Q 01 38.0 3.3 4
- 1Q 02 37.9 2.8 5
- 2Q 02 38.4 3.2 6
- 3Q 02 38.0 3.3 7
- 4Q 02 41.5 3.6 8
- 1Q 03 40.8 3.3 9
- 2Q 03
?
10
49III. Dummy Variables
- Dummy Variables allow us to incorporate
qualitative explanatory variables (like type of
neighborhood, ethnic group, color of car, - season of the year, etc.) into our set of
explanatory variables. - Developed after World War II in an attempt to
explain consumer spending - Consumption f(National Income, and War or
Peace) - X1 National Income (a quantitative variable)
- X2 1 if War (a qualitative or
dummy variable) - 0 if Peace
50Dummy Variables
Consumption function during Peace
Y
50.1
Consumption function during War
42.1
X1
51Example of a data table
- Year Consumption (Y) X1 X2
- 1944 62.5 90 1
- 1945 62.9 91 1
- 1946 69.0 92 0
- 1947 78.0 96 0
- etc.
52The Number of Dummy Variables
- For each qualitative factor, the number of dummy
variables you need is one less than the states
of nature. . . - War or Peace?
- Seasons of the year?
- Whether or not a car is red?
- Neighborhood is exclusive, or above average,
or average, or below average, or blighted
area. - Whether or not it is December?
53Try this one
- Spose I want to do a regression that postulates
that the market value of a single-family home is
determined by square footage number of
bathrooms whether or not the home is in the
Ogden City School District and whether the home
is a tract home, a semi-custom home, or one of
a-kind home. How many dummy variables would I
need in specifying the regression equation?
54Try this one
- Spose I want to do a regression that postulates
that the market value of a single-family home is
determined by square footage number of
bathrooms whether or not the home is in the
Ogden City School District and whether the home
is a tract home, a semi-custom home, or one of
a-kind home. How many dummy variables would I
need in specifying the regression equation? - 3
55A regression problem with dummy variables
- A study for a chain of 16 restaurants examined
the relationship between restaurant sales in
thousands (Y) and number of households (in
thousands) in the restaurants trading area (X1)
and location of restaurant (near Interstate, in
shopping mall, on city street). - X1 number of households
- X2 1 if shopping mall location, 0 otherwise
- X3 1 if city street location, 0 otherwise
What does this equation reduce to if the
restaurants location is a shopping mall?
56A regression problem with dummy variables
- A study for a chain of 16 restaurants examined
the relationship between restaurant sales (in
thousands) during a recent period (Y) and number
of households (in thousands) in the restaurants
trading area (X1) and location of restaurant
(near Interstate, in shopping mall, on city
street) - X1 number of households
- X2 1 if shopping mall location, 0 otherwise
- X3 1 if city street location, 0 otherwise
This is the estimating equation if the
restaurants location is a shopping mall
57A regression problem with dummy variables
- A study for a chain of 16 restaurants examined
the relationship between restaurant sales (in
thousands) during a recent period (Y) and number
of households (in thousands) in the restaurants
trading area (X1) and location of restaurant
(near Interstate, in shopping mall, on city
street) - X1 number of households
- X2 1 if shopping mall location, 0 otherwise
- X3 1 if city street location, 0 otherwise
What does this estimating equation reduce to if
the restaurants location is near an Interstate?
58A regression problem with dummy variables
- A study for a chain of 16 restaurants examined
the relationship between restaurant sales (in
thousands) during a recent period (Y) and number
of households (in thousands) in the restaurants
trading area (X1) and location of restaurant
(near Interstate, in shopping mall, on city
street) - X1 number of households
- X2 1 if shopping mall location, 0 otherwise
- X3 1 if city street location, 0 otherwise
This is the estimating equation for sales if the
restaurants location is near an Interstate
59Stepwise Regression
- MINITAB has a controversial capability called
Stepwise regression wherein the computer will
select the best set of explanatory variables
for you.
60Summary
- Multiple regression analysis technique that uses
several independent variables to estimate or
explain the value of a single dependent variable. - Variables are retained or dropped on the basis of
individual t tests (and the associated P values) - The sample coefficient of multiple determination,
R2, is the most important measure of the overall
strength of association among the variables. - Assumptions underlying multiple regression
analysis must be at least roughly valid for the
model to be reliable in predicting (or explaining
movement in) the dependent variable.
61Final Course Comment . . .
The P-Value concept is critically important in
statistical analysis. When you read research
literature involving statistical analysis in any
discipline, low P-Values are consistent with what
the researcher is trying to demonstrate. Hypothet
ical practice exercise You read a scientific
paper that is attempting to demonstrate that
rubbing olive oil on a mans head daily prevents
baldness. If the reported P-value is .031 (on a
study of 4000 men over a 20 year period), how
would you interpret the authors results?
62A farewell thought . . .