Additional topics in Regression including Multiple Regression

About This Presentation

Title:

Additional topics in Regression including Multiple Regression

Description:

Part VI. Additional topics in Regression including Multiple Regression ... Adjusted R2 is noticeably smaller than R2 only when k is large in comparison to n. ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 63

Provided by: lauriea3

Learn more at: http://faculty.weber.edu

more less

Transcript and Presenter's Notes

Title: Additional topics in Regression including Multiple Regression

1
Part VI

Additional topics in Regression including
Multiple Regression

Dr. Stephen H. Russell
Weber State University
2
Statistical outliers

Spose your sample data looked like this

Y
This observation appears inconsistent with the
rest of the sample data
X
3
Statistical outliers

Three possible explanations
Error in data collection
Relationship between X and Y is not strong
Relationship between X and Y is non-linear

Statistical outliers should be taken out of the
data sample only if there was obviously an error
in taking that observation
4
Statistical outliers

MINITAB will highlight statistical outliers for
you. This MINITAB reminder invites you to
reassess your data and your model.
MINITAB defines an observation as unusual if
the observation is more than two standard
deviations away from the fitted regression line.

5
Two types of data

Time series
Example Crime rates in the United States year
by year
Cross Sectional
Example Crime rates at a particular point in
time for New York City, Detroit, Kansas City,
Reno, Phoenix, Fresno, Montgomery, Louisville,
etc.

Cross sectional studies are generally preferred
over time series due to risk of autocorrelation
in time series data.
6
Multiple Regression Analysis

Several independent variables are used to
estimate the value of an unknown dependent
variable
Each predictor variable explains part of the
total variation of the dependent variable
Intent is to increase R2 and get a solid,
statistically valid relationship between Y and a
whole set of pertinent explanatory variables.

k
k
Number of explanatory variables is k
7
The Global Test F test

We test the ability of the entire set of
independent variables in multiple regression to
explain the behavior of the dependent variable.
Lets assume four independent variables
The hypotheses are Ho ?1?2?3 ?4 0 HA
Not all the ?s 0
These hypotheses are assessed on the basis of an
F Test calculation
The global test asks, Could the amount of
explained variation, R2, occur by chance?

8
The Global Test F test
This test requires the employment of an ANOVA
Table based upon regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15
9
The Degrees of Freedom Concept

Degrees of Freedom refers to the number of
elements that are free to take on any value
Any Total has n-1 degrees of freedom
Since the number of explanatory variables is
totally free, the degrees of freedom
regression is chosen when k is chosen.
Error degrees of freedom, by default, is n
l - k

10
The Global Test F test
This test requires the employment of an ANOVA
Table based upon regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15
How many explanatory variables comprise this
model?
What is sample size here?
11
The Global Test F Test

The Global Test of multiple explanatory variables
is a ratio of two variances

12
ANOVA table for Regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15

Reminder Regression degrees of freedom
is k (number of explanatory variables)
Error degrees of freedom is n 1 k
Total degrees of freedom is n - 1

13
ANOVA table for Regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15

Fill in an ANOVA table like the above for this
problem A researcher tests a model with 5
explanatory variables and 51 observations.
Results are SST 215 and SSE 180.

14
ANOVA table for Regression
ANSWER
SOURCE DF SS MS F Regression 5
35 7 1.75 Error 45 180 4
Total 50 215

Fill in an ANOVA table like the above for this
problem A researcher tests a model with 5
explanatory variables and 51 observations.
Results are SST 215 and SSE 180.

15
The Global Test F Test

The FCalc is compared to FCrit to make a
decision on the hypotheses Ho ?1?2?3 ?4 0
HA Not all the ?s 0
Recall that the F Test is always a ratio of two
variances, with dfn and dfd.
In the regression case, dfn is k
and dfd is n 1
k
Hence FCrit is of the form F? (k, n-1-k)

16
A MINITAB example
17
Another Way to Calculate F
18
Another Way to Calculate F
19
Another Way to Calculate F
Notice
20
Another Way to Calculate F

Therefore

21
Multiple Regression Analysis
k
k
k
In addition to the Global F Test, we do a t
test on the coefficients of each explanatory
variable HO ß1 0 HA ß1 0
HO ß2 0 HA ß2 0
We keep or throw out potential explanatory
variables on the basis of the P-values associated
with each t test
Etc.
22
Example
Suppose the manager of an automotive parts
distributor which serves the market of Virginia,
North Carolina, and South Carolina wants to
estimate total annual sales each year. Four
independent variables were recommended to him by
his marketing department Number of retail
outlets in the region who stock this companys
parts (X1), the number of automobiles registered
in this region (X2), the average age of
automobiles registered (X3), and the commission
rate paid to company sales people (X4).
23
Example
Suppose the manager of an automotive parts
distributor which serves the market of Virginia,
North Carolina, and South Carolina wants to
estimate total annual sales each year. Four
independent variables were recommended to him by
his marketing department Number of retail
outlets in the region who stock this companys
parts (X1), the number of automobiles registered
in this region (X2), the average age of
automobiles registered (X3), and the commission
rate paid to company sales people (X4).
Predictor Coef StDev T P Constant
NumberOut 1427.00 9.41 0.000

Automobil 24.90 2.60 0.001
AverageAg
19.80 0.49 0.883
Commissio
4.33 1.89 0.078
Which explanatory variables are statistically
significant?
24
Example
Specification Errors ? Including variables
that are not statistically significant. ?
Excluding variables that ought to be included.
Predictor Coef StDev T P Constant
NumberOut 1427.00 9.41 0.000

Automobil 24.90 2.60 0.001
AverageAg
19.80 0.49 0.883
Commissio
4.33 1.89 0.078
Which explanatory variables are statistically
significant?
25
Example
Specification Errors ? Including variables
that are not statistically significant. ?
Excluding variables that ought to be included.
Predictor Coef StDev T P Constant
NumberOut 1427.00 9.41 0.000

Automobil 24.90 2.60 0.001
AverageAg
19.80 0.49 0.883
Commissio
4.33 1.89 0.078
We need to run this regression again with at
least X3 excluded.
26
The Issue of Multicollinearity
k
k
k
k
k
Multicollinearity occurs when some of the
independent variables tend to move together
i.e., they are correlated with each other.
With multicollinearity we can no longer give
meaning to the b coefficients!
27
Consider this example problem

I want to explain earnings in the accounting
profession. I suspect two good explanatory
variables are age and years of experience. I
collect sample information and build these
estimating equations
First try
Salary 13.0 0.912 Age R2
72.6
P-Value .002
Second try
Salary 35.8 0.878 Yrs Exper R2
77.3 P-Value
.001
Third try
Salary 36.1 0.014 Age 0.890 Yrs Exper R2
77.3
PValue .986 P-Value .269

28
Consider this example problem

I want to explain earnings in the accounting
profession. I suspect two good explanatory
variables are age and years of experience. I
collect sample information and build these
estimating equations
First try
Salary 13.0 0.912 Age R2
72.6
P-Value .002
Second try
Salary 35.8 0.878 Yrs Exper R2
77.3 P-Value
.001
Third try
Salary 36.1 0.014 Age 0.890 Yrs Exper R2
77.3
PValue .986 P-Value .269

The problem here is that we have
multicollinearity!
29
Multicollinearity

The two classic signs of multicollinearity
High R2 with low t values ( not low P-values)
Wrong signs on one or more coefficients
Implications
Cannot give interpretations to the coefficients
of the individual explanatory variablescant
disentangle the separate influences on Y.
Have to ignore the results of the t tests (the P
values).

30
Identifying Multicollinearity

The sure way of assessing whether or not you are
employing predictor variables that are correlated
to each other is with a correlation matrix of all
predictor variables
The MTB command is
MTBgt correlation c2 c3 c4 c5

Note If you identify multicollinearity, you
must decide whether to live with it or drop one
of the correlated variables from your model.
31
A clarifying reminder

If the residuals (error terms) are correlated,
you have autocorrelationa serious violation of
assumptions.
If two or more of the explanatory variables are
correlated, you have multicollinearitywhich
makes it impossible to interpret the various
bs.

32
Causes of wrong signs

Multicollinearity
Specification error (important explanatory
variable or variables left out)
Non-representative sample
Sample too small
Biased sample

33
Adjusted R2

r is the statistic that estimates the parameter
?
R2 is the statistic that estimates the
parameter ?2 (the true coefficient of
determination)
SSR/SST is a biased estimator of ?2
R2 (adjusted) is the unbiased estimator
MINITAB shows both R2 and R2 (adj)

34
R2 (Adjusted)

The adjustment to eliminate bias is

n sample size k number of
explanatory variables
Adjusted R2 is noticeably smaller than R2 only
when k is large in comparison to n. For example
k 2 and n 100 vs. k 4 and n 15.
35
R2 (Adjusted)

The adjustment to eliminate bias is

Adjustment factor is small (Only 1.02)
n sample size k number of
explanatory variables
Adjusted R2 is noticeably smaller than R2 only
when k is large in comparison to n. For example
k 2 and n 100 vs. k 4 and n 15.
36
R2 (Adjusted)

The adjustment to eliminate bias is

Adjustment factor is substantial (1.40)!
n sample size k number of
explanatory variables
Adjusted R2 is noticeably smaller than R2 only
when k is large in comparison to n. For example
k 2 and n 100 vs. k 4 and n 15.
37
Three nifty model-building techniques

Lagged independent variable
Trend variables
Dummy variables for qualitative data

38
I. Lagged independent variable

Spose we suspect that sales in March are
influenced by price in March and the advertising
we do in February.
Our regression specification would be of the form

X1 price X2 Advertising dollars
39
Inserting data for a lagged specification
40
Inserting data for a lagged specification
41
Inserting data for a lagged specification
42
II. Trend Variables

Sometimes data have a momentum of their ownGDP
or the annual sales at Wal-Mart might be examples.

Sales
Time
43
II. Trend Variables

Sometimes data have a momentum of their ownGDP
or the annual sales at Wal-Mart might be examples.

Sales
Time
44
Trend Variables

Sales
Time
45
Data for a Trend Variable

Period Sales
Advertising Trend Variable
1Q 01 34.9 3.0 1
2Q 01 38.0 3.1 2
3Q 01 37.1 2.9 3
4Q 01 38.0 3.3 4
1Q 02 37.9 2.8 5
2Q 02 38.4 3.2 6
3Q 02 38.0 3.3 7
4Q 02 41.5 3.6 8
1Q 03 40.8 3.3 9
2Q 03

46
Data for a Trend Variable

Period Sales
Advertising Trend Variable
1Q 01 34.9 3.0 1
2Q 01 38.0 3.1 2
3Q 01 37.1 2.9 3
4Q 01 38.0 3.3 4
1Q 02 37.9 2.8 5
2Q 02 38.4 3.2 6
3Q 02 38.0 3.3 7
4Q 02 41.5 3.6 8
1Q 03 40.8 3.3 9
2Q 03

47
Data for a Trend Variable

Period Sales
Advertising Trend Variable
1Q 01 34.9 3.0 1
2Q 01 38.0 3.1 2
3Q 01 37.1 2.9 3
4Q 01 38.0 3.3 4
1Q 02 37.9 2.8 5
2Q 02 38.4 3.2 6
3Q 02 38.0 3.3 7
4Q 02 41.5 3.6 8
1Q 03 40.8 3.3 9
2Q 03

48
Data for a Trend Variable

Period Sales
Advertising Trend Variable
1Q 01 34.9 3.0 1
2Q 01 38.0 3.1 2
3Q 01 37.1 2.9 3
4Q 01 38.0 3.3 4
1Q 02 37.9 2.8 5
2Q 02 38.4 3.2 6
3Q 02 38.0 3.3 7
4Q 02 41.5 3.6 8
1Q 03 40.8 3.3 9
2Q 03

?
10

49
III. Dummy Variables

Dummy Variables allow us to incorporate
qualitative explanatory variables (like type of
neighborhood, ethnic group, color of car,
season of the year, etc.) into our set of
explanatory variables.
Developed after World War II in an attempt to
explain consumer spending
Consumption f(National Income, and War or
Peace)
X1 National Income (a quantitative variable)
X2 1 if War (a qualitative or
dummy variable)
0 if Peace

50
Dummy Variables

Consumption function during Peace
Y
50.1
Consumption function during War
42.1
X1

51
Example of a data table

Year Consumption (Y) X1 X2
1944 62.5 90 1
1945 62.9 91 1
1946 69.0 92 0
1947 78.0 96 0
etc.

52
The Number of Dummy Variables

For each qualitative factor, the number of dummy
variables you need is one less than the states
of nature. . .
War or Peace?
Seasons of the year?
Whether or not a car is red?
Neighborhood is exclusive, or above average,
or average, or below average, or blighted
area.
Whether or not it is December?

53
Try this one

Spose I want to do a regression that postulates
that the market value of a single-family home is
determined by square footage number of
bathrooms whether or not the home is in the
Ogden City School District and whether the home
is a tract home, a semi-custom home, or one of
a-kind home. How many dummy variables would I
need in specifying the regression equation?

54
Try this one

Spose I want to do a regression that postulates
that the market value of a single-family home is
determined by square footage number of
bathrooms whether or not the home is in the
Ogden City School District and whether the home
is a tract home, a semi-custom home, or one of
a-kind home. How many dummy variables would I
need in specifying the regression equation?
3

55
A regression problem with dummy variables

A study for a chain of 16 restaurants examined
the relationship between restaurant sales in
thousands (Y) and number of households (in
thousands) in the restaurants trading area (X1)
and location of restaurant (near Interstate, in
shopping mall, on city street).
X1 number of households
X2 1 if shopping mall location, 0 otherwise
X3 1 if city street location, 0 otherwise

What does this equation reduce to if the
restaurants location is a shopping mall?
56
A regression problem with dummy variables

A study for a chain of 16 restaurants examined
the relationship between restaurant sales (in
thousands) during a recent period (Y) and number
of households (in thousands) in the restaurants
trading area (X1) and location of restaurant
(near Interstate, in shopping mall, on city
street)
X1 number of households
X2 1 if shopping mall location, 0 otherwise
X3 1 if city street location, 0 otherwise

This is the estimating equation if the
restaurants location is a shopping mall
57
A regression problem with dummy variables

A study for a chain of 16 restaurants examined
the relationship between restaurant sales (in
thousands) during a recent period (Y) and number
of households (in thousands) in the restaurants
trading area (X1) and location of restaurant
(near Interstate, in shopping mall, on city
street)
X1 number of households
X2 1 if shopping mall location, 0 otherwise
X3 1 if city street location, 0 otherwise

What does this estimating equation reduce to if
the restaurants location is near an Interstate?
58
A regression problem with dummy variables

A study for a chain of 16 restaurants examined
the relationship between restaurant sales (in
thousands) during a recent period (Y) and number
of households (in thousands) in the restaurants
trading area (X1) and location of restaurant
(near Interstate, in shopping mall, on city
street)
X1 number of households
X2 1 if shopping mall location, 0 otherwise
X3 1 if city street location, 0 otherwise

This is the estimating equation for sales if the
restaurants location is near an Interstate
59
Stepwise Regression

MINITAB has a controversial capability called
Stepwise regression wherein the computer will
select the best set of explanatory variables
for you.

60
Summary

Multiple regression analysis technique that uses
several independent variables to estimate or
explain the value of a single dependent variable.
Variables are retained or dropped on the basis of
individual t tests (and the associated P values)
The sample coefficient of multiple determination,
R2, is the most important measure of the overall
strength of association among the variables.
Assumptions underlying multiple regression
analysis must be at least roughly valid for the
model to be reliable in predicting (or explaining
movement in) the dependent variable.

61
Final Course Comment . . .
The P-Value concept is critically important in
statistical analysis. When you read research
literature involving statistical analysis in any
discipline, low P-Values are consistent with what
the researcher is trying to demonstrate. Hypothet
ical practice exercise You read a scientific
paper that is attempting to demonstrate that
rubbing olive oil on a mans head daily prevents
baldness. If the reported P-value is .031 (on a
study of 4000 men over a 20 year period), how
would you interpret the authors results?
62
A farewell thought . . .

Write a Comment

User Comments (0)