Additional topics in Regression including Multiple Regression

1 / 62
About This Presentation
Title:

Additional topics in Regression including Multiple Regression

Description:

Part VI. Additional topics in Regression including Multiple Regression ... Adjusted R2 is noticeably smaller than R2 only when k is large in comparison to n. ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 63
Provided by: lauriea3
Learn more at: http://faculty.weber.edu

less

Transcript and Presenter's Notes

Title: Additional topics in Regression including Multiple Regression


1
Part VI
  • Additional topics in Regression including
    Multiple Regression

Dr. Stephen H. Russell
Weber State University
2
Statistical outliers
  • Spose your sample data looked like this









Y
This observation appears inconsistent with the
rest of the sample data
X
3
Statistical outliers
  • Three possible explanations
  • Error in data collection
  • Relationship between X and Y is not strong
  • Relationship between X and Y is non-linear

Statistical outliers should be taken out of the
data sample only if there was obviously an error
in taking that observation
4
Statistical outliers
  • MINITAB will highlight statistical outliers for
    you. This MINITAB reminder invites you to
    reassess your data and your model.
  • MINITAB defines an observation as unusual if
    the observation is more than two standard
    deviations away from the fitted regression line.

5
Two types of data
  • Time series
  • Example Crime rates in the United States year
    by year
  • Cross Sectional
  • Example Crime rates at a particular point in
    time for New York City, Detroit, Kansas City,
    Reno, Phoenix, Fresno, Montgomery, Louisville,
    etc.

Cross sectional studies are generally preferred
over time series due to risk of autocorrelation
in time series data.
6
Multiple Regression Analysis
  • Several independent variables are used to
    estimate the value of an unknown dependent
    variable
  • Each predictor variable explains part of the
    total variation of the dependent variable
  • Intent is to increase R2 and get a solid,
    statistically valid relationship between Y and a
    whole set of pertinent explanatory variables.


k
k
Number of explanatory variables is k
7
The Global Test F test
  • We test the ability of the entire set of
    independent variables in multiple regression to
    explain the behavior of the dependent variable.
  • Lets assume four independent variables
  • The hypotheses are Ho ?1?2?3 ?4 0 HA
    Not all the ?s 0
  • These hypotheses are assessed on the basis of an
    F Test calculation
  • The global test asks, Could the amount of
    explained variation, R2, occur by chance?

8
The Global Test F test
This test requires the employment of an ANOVA
Table based upon regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15
9
The Degrees of Freedom Concept
  • Degrees of Freedom refers to the number of
    elements that are free to take on any value
  • Any Total has n-1 degrees of freedom
  • Since the number of explanatory variables is
    totally free, the degrees of freedom
    regression is chosen when k is chosen.
  • Error degrees of freedom, by default, is n
    l - k

10
The Global Test F test
This test requires the employment of an ANOVA
Table based upon regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15
How many explanatory variables comprise this
model?
What is sample size here?
11
The Global Test F Test
  • The Global Test of multiple explanatory variables
    is a ratio of two variances

12
ANOVA table for Regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15
  • Reminder Regression degrees of freedom
    is k (number of explanatory variables)
    Error degrees of freedom is n 1 k
    Total degrees of freedom is n - 1

13
ANOVA table for Regression
SOURCE DF SS MS F Regression 4
10 2.50 10.0 Error 20 5 0.25
Total 24 15
  • Fill in an ANOVA table like the above for this
    problem A researcher tests a model with 5
    explanatory variables and 51 observations.
    Results are SST 215 and SSE 180.

14
ANOVA table for Regression
ANSWER
SOURCE DF SS MS F Regression 5
35 7 1.75 Error 45 180 4
Total 50 215
  • Fill in an ANOVA table like the above for this
    problem A researcher tests a model with 5
    explanatory variables and 51 observations.
    Results are SST 215 and SSE 180.

15
The Global Test F Test
  • The FCalc is compared to FCrit to make a
    decision on the hypotheses Ho ?1?2?3 ?4 0
    HA Not all the ?s 0
  • Recall that the F Test is always a ratio of two
    variances, with dfn and dfd.
  • In the regression case, dfn is k
  • and dfd is n 1
    k
  • Hence FCrit is of the form F? (k, n-1-k)

16
A MINITAB example
17
Another Way to Calculate F
18
Another Way to Calculate F
19
Another Way to Calculate F
Notice
20
Another Way to Calculate F
  • Therefore

21
Multiple Regression Analysis
k
k
k
In addition to the Global F Test, we do a t
test on the coefficients of each explanatory
variable HO ß1 0 HA ß1 0
HO ß2 0 HA ß2 0
We keep or throw out potential explanatory
variables on the basis of the P-values associated
with each t test
Etc.
22
Example
Suppose the manager of an automotive parts
distributor which serves the market of Virginia,
North Carolina, and South Carolina wants to
estimate total annual sales each year. Four
independent variables were recommended to him by
his marketing department Number of retail
outlets in the region who stock this companys
parts (X1), the number of automobiles registered
in this region (X2), the average age of
automobiles registered (X3), and the commission
rate paid to company sales people (X4).
23
Example
Suppose the manager of an automotive parts
distributor which serves the market of Virginia,
North Carolina, and South Carolina wants to
estimate total annual sales each year. Four
independent variables were recommended to him by
his marketing department Number of retail
outlets in the region who stock this companys
parts (X1), the number of automobiles registered
in this region (X2), the average age of
automobiles registered (X3), and the commission
rate paid to company sales people (X4).
Predictor Coef StDev T P Constant
NumberOut 1427.00 9.41 0.000

Automobil 24.90 2.60 0.001
AverageAg
19.80 0.49 0.883
Commissio
4.33 1.89 0.078
Which explanatory variables are statistically
significant?
24
Example
Specification Errors ? Including variables
that are not statistically significant. ?
Excluding variables that ought to be included.
Predictor Coef StDev T P Constant
NumberOut 1427.00 9.41 0.000

Automobil 24.90 2.60 0.001
AverageAg
19.80 0.49 0.883
Commissio
4.33 1.89 0.078
Which explanatory variables are statistically
significant?
25
Example
Specification Errors ? Including variables
that are not statistically significant. ?
Excluding variables that ought to be included.
Predictor Coef StDev T P Constant
NumberOut 1427.00 9.41 0.000

Automobil 24.90 2.60 0.001
AverageAg
19.80 0.49 0.883
Commissio
4.33 1.89 0.078
We need to run this regression again with at
least X3 excluded.
26
The Issue of Multicollinearity
k
k
k
k
k
Multicollinearity occurs when some of the
independent variables tend to move together
i.e., they are correlated with each other.
With multicollinearity we can no longer give
meaning to the b coefficients!
27
Consider this example problem
  • I want to explain earnings in the accounting
    profession. I suspect two good explanatory
    variables are age and years of experience. I
    collect sample information and build these
    estimating equations
  • First try
  • Salary 13.0 0.912 Age R2
    72.6
    P-Value .002
  • Second try
  • Salary 35.8 0.878 Yrs Exper R2
    77.3 P-Value
    .001
  • Third try
  • Salary 36.1 0.014 Age 0.890 Yrs Exper R2
    77.3
    PValue .986 P-Value .269

28
Consider this example problem
  • I want to explain earnings in the accounting
    profession. I suspect two good explanatory
    variables are age and years of experience. I
    collect sample information and build these
    estimating equations
  • First try
  • Salary 13.0 0.912 Age R2
    72.6
    P-Value .002
  • Second try
  • Salary 35.8 0.878 Yrs Exper R2
    77.3 P-Value
    .001
  • Third try
  • Salary 36.1 0.014 Age 0.890 Yrs Exper R2
    77.3
    PValue .986 P-Value .269

The problem here is that we have
multicollinearity!
29
Multicollinearity
  • The two classic signs of multicollinearity
  • High R2 with low t values ( not low P-values)
  • Wrong signs on one or more coefficients
  • Implications
  • Cannot give interpretations to the coefficients
    of the individual explanatory variablescant
    disentangle the separate influences on Y.
  • Have to ignore the results of the t tests (the P
    values).

30
Identifying Multicollinearity
  • The sure way of assessing whether or not you are
    employing predictor variables that are correlated
    to each other is with a correlation matrix of all
    predictor variables
  • The MTB command is
    MTBgt correlation c2 c3 c4 c5

Note If you identify multicollinearity, you
must decide whether to live with it or drop one
of the correlated variables from your model.
31
A clarifying reminder
  • If the residuals (error terms) are correlated,
    you have autocorrelationa serious violation of
    assumptions.
  • If two or more of the explanatory variables are
    correlated, you have multicollinearitywhich
    makes it impossible to interpret the various
    bs.

32
Causes of wrong signs
  • Multicollinearity
  • Specification error (important explanatory
    variable or variables left out)
  • Non-representative sample
  • Sample too small
  • Biased sample

33
Adjusted R2
  • r is the statistic that estimates the parameter
    ?
  • R2 is the statistic that estimates the
    parameter ?2 (the true coefficient of
    determination)
  • SSR/SST is a biased estimator of ?2
  • R2 (adjusted) is the unbiased estimator
  • MINITAB shows both R2 and R2 (adj)

34
R2 (Adjusted)
  • The adjustment to eliminate bias is


n sample size k number of
explanatory variables
Adjusted R2 is noticeably smaller than R2 only
when k is large in comparison to n. For example
k 2 and n 100 vs. k 4 and n 15.
35
R2 (Adjusted)
  • The adjustment to eliminate bias is


Adjustment factor is small (Only 1.02)
n sample size k number of
explanatory variables
Adjusted R2 is noticeably smaller than R2 only
when k is large in comparison to n. For example
k 2 and n 100 vs. k 4 and n 15.
36
R2 (Adjusted)
  • The adjustment to eliminate bias is


Adjustment factor is substantial (1.40)!
n sample size k number of
explanatory variables
Adjusted R2 is noticeably smaller than R2 only
when k is large in comparison to n. For example
k 2 and n 100 vs. k 4 and n 15.
37
Three nifty model-building techniques
  • Lagged independent variable
  • Trend variables
  • Dummy variables for qualitative data

38
I. Lagged independent variable
  • Spose we suspect that sales in March are
    influenced by price in March and the advertising
    we do in February.
  • Our regression specification would be of the form


X1 price X2 Advertising dollars
39
Inserting data for a lagged specification
40
Inserting data for a lagged specification
41
Inserting data for a lagged specification
42
II. Trend Variables
  • Sometimes data have a momentum of their ownGDP
    or the annual sales at Wal-Mart might be examples.


Sales
Time
43
II. Trend Variables
  • Sometimes data have a momentum of their ownGDP
    or the annual sales at Wal-Mart might be examples.


Sales
Time
44
Trend Variables


Sales
Time
45
Data for a Trend Variable
  • Period Sales
    Advertising Trend Variable
  • 1Q 01 34.9 3.0 1
  • 2Q 01 38.0 3.1 2
  • 3Q 01 37.1 2.9 3
  • 4Q 01 38.0 3.3 4
  • 1Q 02 37.9 2.8 5
  • 2Q 02 38.4 3.2 6
  • 3Q 02 38.0 3.3 7
  • 4Q 02 41.5 3.6 8
  • 1Q 03 40.8 3.3 9
  • 2Q 03


46
Data for a Trend Variable
  • Period Sales
    Advertising Trend Variable
  • 1Q 01 34.9 3.0 1
  • 2Q 01 38.0 3.1 2
  • 3Q 01 37.1 2.9 3
  • 4Q 01 38.0 3.3 4
  • 1Q 02 37.9 2.8 5
  • 2Q 02 38.4 3.2 6
  • 3Q 02 38.0 3.3 7
  • 4Q 02 41.5 3.6 8
  • 1Q 03 40.8 3.3 9
  • 2Q 03


47
Data for a Trend Variable
  • Period Sales
    Advertising Trend Variable
  • 1Q 01 34.9 3.0 1
  • 2Q 01 38.0 3.1 2
  • 3Q 01 37.1 2.9 3
  • 4Q 01 38.0 3.3 4
  • 1Q 02 37.9 2.8 5
  • 2Q 02 38.4 3.2 6
  • 3Q 02 38.0 3.3 7
  • 4Q 02 41.5 3.6 8
  • 1Q 03 40.8 3.3 9
  • 2Q 03


48
Data for a Trend Variable
  • Period Sales
    Advertising Trend Variable
  • 1Q 01 34.9 3.0 1
  • 2Q 01 38.0 3.1 2
  • 3Q 01 37.1 2.9 3
  • 4Q 01 38.0 3.3 4
  • 1Q 02 37.9 2.8 5
  • 2Q 02 38.4 3.2 6
  • 3Q 02 38.0 3.3 7
  • 4Q 02 41.5 3.6 8
  • 1Q 03 40.8 3.3 9
  • 2Q 03

?
10

49
III. Dummy Variables
  • Dummy Variables allow us to incorporate
    qualitative explanatory variables (like type of
    neighborhood, ethnic group, color of car,
  • season of the year, etc.) into our set of
    explanatory variables.
  • Developed after World War II in an attempt to
    explain consumer spending
  • Consumption f(National Income, and War or
    Peace)
  • X1 National Income (a quantitative variable)
  • X2 1 if War (a qualitative or
    dummy variable)
  • 0 if Peace


50
Dummy Variables

Consumption function during Peace
Y
50.1
Consumption function during War
42.1
X1

51
Example of a data table
  • Year Consumption (Y) X1 X2
  • 1944 62.5 90 1
  • 1945 62.9 91 1
  • 1946 69.0 92 0
  • 1947 78.0 96 0
  • etc.

52
The Number of Dummy Variables
  • For each qualitative factor, the number of dummy
    variables you need is one less than the states
    of nature. . .
  • War or Peace?
  • Seasons of the year?
  • Whether or not a car is red?
  • Neighborhood is exclusive, or above average,
    or average, or below average, or blighted
    area.
  • Whether or not it is December?

53
Try this one
  • Spose I want to do a regression that postulates
    that the market value of a single-family home is
    determined by square footage number of
    bathrooms whether or not the home is in the
    Ogden City School District and whether the home
    is a tract home, a semi-custom home, or one of
    a-kind home. How many dummy variables would I
    need in specifying the regression equation?

54
Try this one
  • Spose I want to do a regression that postulates
    that the market value of a single-family home is
    determined by square footage number of
    bathrooms whether or not the home is in the
    Ogden City School District and whether the home
    is a tract home, a semi-custom home, or one of
    a-kind home. How many dummy variables would I
    need in specifying the regression equation?
  • 3

55
A regression problem with dummy variables
  • A study for a chain of 16 restaurants examined
    the relationship between restaurant sales in
    thousands (Y) and number of households (in
    thousands) in the restaurants trading area (X1)
    and location of restaurant (near Interstate, in
    shopping mall, on city street).
  • X1 number of households
  • X2 1 if shopping mall location, 0 otherwise
  • X3 1 if city street location, 0 otherwise

What does this equation reduce to if the
restaurants location is a shopping mall?
56
A regression problem with dummy variables
  • A study for a chain of 16 restaurants examined
    the relationship between restaurant sales (in
    thousands) during a recent period (Y) and number
    of households (in thousands) in the restaurants
    trading area (X1) and location of restaurant
    (near Interstate, in shopping mall, on city
    street)
  • X1 number of households
  • X2 1 if shopping mall location, 0 otherwise
  • X3 1 if city street location, 0 otherwise

This is the estimating equation if the
restaurants location is a shopping mall
57
A regression problem with dummy variables
  • A study for a chain of 16 restaurants examined
    the relationship between restaurant sales (in
    thousands) during a recent period (Y) and number
    of households (in thousands) in the restaurants
    trading area (X1) and location of restaurant
    (near Interstate, in shopping mall, on city
    street)
  • X1 number of households
  • X2 1 if shopping mall location, 0 otherwise
  • X3 1 if city street location, 0 otherwise

What does this estimating equation reduce to if
the restaurants location is near an Interstate?
58
A regression problem with dummy variables
  • A study for a chain of 16 restaurants examined
    the relationship between restaurant sales (in
    thousands) during a recent period (Y) and number
    of households (in thousands) in the restaurants
    trading area (X1) and location of restaurant
    (near Interstate, in shopping mall, on city
    street)
  • X1 number of households
  • X2 1 if shopping mall location, 0 otherwise
  • X3 1 if city street location, 0 otherwise

This is the estimating equation for sales if the
restaurants location is near an Interstate
59
Stepwise Regression
  • MINITAB has a controversial capability called
    Stepwise regression wherein the computer will
    select the best set of explanatory variables
    for you.

60
Summary
  • Multiple regression analysis technique that uses
    several independent variables to estimate or
    explain the value of a single dependent variable.
  • Variables are retained or dropped on the basis of
    individual t tests (and the associated P values)
  • The sample coefficient of multiple determination,
    R2, is the most important measure of the overall
    strength of association among the variables.
  • Assumptions underlying multiple regression
    analysis must be at least roughly valid for the
    model to be reliable in predicting (or explaining
    movement in) the dependent variable.

61
Final Course Comment . . .
The P-Value concept is critically important in
statistical analysis. When you read research
literature involving statistical analysis in any
discipline, low P-Values are consistent with what
the researcher is trying to demonstrate. Hypothet
ical practice exercise You read a scientific
paper that is attempting to demonstrate that
rubbing olive oil on a mans head daily prevents
baldness. If the reported P-value is .031 (on a
study of 4000 men over a 20 year period), how
would you interpret the authors results?
62
A farewell thought . . .
Write a Comment
User Comments (0)