Title: Forecasting with Multiple Regression
1Chapter 5
- Forecasting with Multiple Regression
2Selecting Independent Variables
- Ideally, we want each of our independent
variables to be correlated with Y, but not
linearly correlated with all the other
independent variables. - The reason is we do not want our RHS variables to
overlap too much in what they measure. - If our RHS variables are very correlated with one
another, they end up measuring the same part of
the variance in the dependant variable. - This causes a problem called multicollinearity
3Difficulties in finding RHS variables
- Sometimes its difficult to find a RHS variable
that measures what we want. - Average interest rates for all installment loans
might be similar to just mortgage interest rates - Sometimes its impossible and we need to be more
creative. - Demand for housing would be good to know if you
are trying to buy or sell a house, but how do we
proxy for demand if we are trying to predict
sales?
4Looking at Our First Multi-Variate Regression
- We are going to forecast seasonally adjusted new
houses sold (NHS). - Lets first look at the bivariate model using
interest rates (IR) for comparison later on. -
-
- NHS b0 b1 (IR)
- /-?
- Data
5Our bivariate estimate of b
- NHSF 5,543.74 415.90 (IR)
- When we were looking at a bivariate model as a
function of T, we just keep adding 1 to the last
T for the forecast. - What do we do here?
- We need to forecast interest rates (IR) to get
our forecast of NHS.
6Forecasting our RHS to Forecast our LHS
- What model should we use to forecast our RHS
variable? - In reality, you can choose from all the models
that are available to forecast. - What does ForecastX use in the simple bivariate
case? - Simple Naïve Model
- This is probably not the best you can do.
7Holt vs Naïve (bivariate)
Note the in-sample estimates are coming from the
bivariate model and actual IR. Only when IR
forecasts differ do the forecasts for NHS differ.
8Which is the better choice in this case?
- The RMSE in the holdout period (2004) is lower
for the Holt forecasted IR than for the Naïve
forecasted IR.
9Forecasting our RHS to Forecast our LHS
- Typically, we rely on one of the time-trend
methods, other forecasts or an expert opinion. - In this exercise, we use Holts exponential
model, but we arent limited to that.
10Multivariate Estimated Regression Model
- ?
- Y b0 b1X1 b2X2 .... bkXk where
- ?
- Y is forecast value of dependent variable
- X1 . . . Xk are explanatory variables
- b1 . . . bk show change in Y for one unit
change in X1, X2 , etc. when all other
explanatory variables are held constant.
11Now we look at NHS with 2 RHS variables(multivar
iate)
- Our second model incorporates per capita
disposable personal income (DPIPC) and the
interest rate (IR). -
- NHSF2 b0 b1 (DPIPC) b2 (IR)
- (/-?)
- Data
12Results
NHSF2 -324.33 0.17 (DPIPC) -168.13 (IR)
13The multivariate forecast
- NHSF2 -324.33 0.17 (DPIPC) -168.13 (IR)
Holt Forecasted DPIPC IR
Holdout RMSE
From Bivariate
14Evaluating the models(three questions)
- Do the signs on the coefficients make sense?
- Are they statistically significant?
- How much of the variance is explained?
15Evaluating the models
- Both beta coefficients have the expected signs
and are statistically significant. - Adding the new variable DPIPC adds to
explainatory power of the model. - Additionally
- RMSE decreased in-sample and in the holdout.
- R-sq increased from 62.41 in the bivariate model
to 92.03 in the multivariate model. But, more
importantly adjusted R-sq went up as well, from
61.50 to 91.68. - Conclusion Adding the second RHS variable
improved the model and increased its accuracy.
We also know the relationship between IR, DPIPC,
and Seasonally adjusted NHS. - What do we do to get actual UN-adjusted sales???
Just checking
16The Regression Plane
17Multicollinearity at Work
- Lets now consider a multivariate model with
three RHS variablesbut, two are very similar. - NHSF b0 b1 (DPIPC) b2 (GDP) b3 (IR)
- (/-?)
- Data
18Multicollinear Regression
GDP and DPIPC Are highly correlated
DPIPC changes sign and becomes insignificant when
GDP is added
19Checking the Correlation between the RHS variables
From a statistical standpoint, GDP and DPIPC are
essentially the same thing. They move together
99 of the time.
20New stats to consider in evaluating a model
- Adjusted R-Squared Adding one more independent
variable always increases R-sq, so Adjusted R-sq
factors in the loss in DF. For multivariate
models, we typically use only the adjusted R-Sq
to evaluate fit. - F-test tests the overall significance of the
regression. It simultaneously tests the
hypothesis that the regression has NO
explainatory power
21Adjusted R-Sq
Adjusted R-Sq is scaled down by the loss in DF
22Way of interpreting the Adjusted R-q
- The way R-Sq is calculated, adding more RHS
variables will always improve the fit, even if
they are insignificant. - The adjusted R-sq declines if the added RHS
variable does not predict well. It goes up if
the added RHS var does predict well. - So, we can use the adjusted R-sq to determine if
adding a particular RHS variable is beneficial to
our regression or not (i.e., determine if the
added explainatory power offsets the loss in DF).
23Rule of Thumb for getting rid of RHS vars that
could be causing MC problems
- if I t-stat I lt 1, removing variable ?
adjusted r2 ? - if I t-stat I gt 1, removing variable ?
adjusted r2 ?
24One Technique for Selecting Independent
Variables for forecasting
- Adjusted r2 Criterion
- choose set of variables that maximizes adjusted
r2 (minimizes Root Mean Square Error) - If removing variable in question causes adjusted
r2 to increase, leave it out. - If removing variable in question causes adjusted
r2 to decrease, leave it in.
25The F-test
- For the multiple regression with 2 RHS vars, K2.
We used 48 observations, so, n - (K1) 45. We
then go to our F-table.
26The F-distribution
If our F is larger than this one, we reject the
null that all bs0 jointly.
27F-statistic (Fcalculated) is related to r2
- r2 Explained Variation in Y / Total Variation
in Y - F Explained Variation / Unexplained Variation
- as Explained Variation in Y ? ? r2 ? and F ?
- F-test is used to test if r2 is statistically
different from zero - H0 r2 ? 0
- H1 r2 ? 0
28Handling Seasonality in a MV Regression
29What are Dummy Variables
- They are constructed RHS variables that act like
switches. They turn off and on when something
is true of false. - They often are used in situations when no
continuous measure is available or where we
expect there to be discrete differences in
effects (like seasons). - To estimate a model with seasonality, we need to
CONSTRUCT season dummy variables.
30Constructing the Seasonal Dummy
- In all the data sets we have seen thus far, we
have had data on month and year. - To construct a month dummy, we need to construct
11 (NOT 12) new columns of data. Each month gets
its own column.
31Data Construction
New Vars
Here we have made dummies for all months, but we
need to drop out one month for use in
regressions. Perfect Collinearity!!!
32Ungraded Homework (Ch. 4, 7)
- Provide a quarterly forecast of sales.
- Prepare a time series plot of the data and
explain what you see. Is a simple linear trend
forecast useful in this case? Estimate the trend
and address parts A-E. - Data
337, Part A
Although, theres probably some seasonality,
there also is also definitely a trend and it
looks fairly linear and positive.
347, Part B C Estimate the Trend line, does it
have a significant trend?
C. Yes, Signif.
357, Part D Forecast 4 qtrs of 2004
- T Forecast
- Sales 88,741.01 5,362.62 (41)
308,608.43 Mar-2004 - Sales 88,741.01 5,362.62 (42)
313,971.16 Jun-2004 - Sales 88,741.01 5,362.62 (43)
319,333.78 Sep-2004 - Sales 88,741.01 5,362.62 (44)
324,696.41 Dec-2004
367, Part E Accuracy for 2004
RMSE for the 4 quarters of 2004 is 19,571 or
about 5.9 of the average monthly sales for 2004.
37Ungraded Homework (Ch. 4, 7)
- Use the unemployment rate to estimate sales.
Unemployment Data - Part A does not actually ask you to do
anythinggo figure?!
388, Part B Plot a scattergram of sales vs
regional unemployment rate
There might be some positive relationship, which
is kinda odd. Typically, you would expect sales
to fall if unemployment rises.
398, Part C Bivariate Reg. of Sales as a fcn. of
Regional Unemployment Rate
???How did we do here???
408, Part D Take a memo
- Dear Ms. Lynch,
- The regression of Northern Regional Unemployment
Rate on Sales does not provide the kind of
accuracy we were seeking. Although the estimated
effect of of the unemployment rate on sales was
statistically significant, the model only
explained about 10 (R-sq9.99) of the variance
in the dependant variable, and the sign was not
what we expected. Furthermore, the MAPE was more
than 31, indicating a poor fit of the sample
data. Finally, our Thiels U indicates that we
would be much better off using a simple naïve
forecast rather than using one based on the
unemployment rate. - Your humble servant,
- Flippin Hades Turwilliger
418, Part E Forecast using the model and the
forecast for the regions unemployment rate
(FNRUR)
- Sales 73,222.19 15,369.38 (NRUR) Mar-2004
- Sales 73,222.19 15,369.38 (NRUR) Jun-2004
- Sales 73,222.19 15,369.38 (NRUR) Sep-2004
- Sales 73,222.19 15,369.38 (NRUR) Dec-2004
- 190,029.50 73,222.19 15,369.38 (7.6)
Mar-2004 - 191,566.44 73,222.19 15,369.38
(7.7) Jun-2004 - 188,492.56 73,222.19 15,369.38 (7.5)
Sep-2004 - 186,955.62 73,222.19 15,369.38 (7.4)
Dec-2004
428, Part F Calculate RMSE and Compare with the
earlier model using the Time Trend
Unemployment Regression
Trend Regression
438, Part G Scattergram of Sales and Inc.
Now, that looks more like what we want! It looks
like there is a relationship between inc. and
saleswhich makes sense, right?!
448, Part H I Estimate the Bivariate Reg. of
Income on Sales
- Sales 76,808.83 120.11 Income
- Dear Ms. Lynch,
- Using income, rather than unemployment rate,
substantially improved our estimates of sales.
We obtained the expected sign, with statistically
significant results. Our t-stats provide us with
a confidence interval in excess of 99. Our R-Sq
indicates that we explain about 86 of the
variance in sales with income, and the MAPE
decreased to 12.76. Overall, this is a much
better model than the one using unemployment
rate. Now, please promote me to a job that
doesnt require me to do this anymore. - Your personal slave,
- Flippin Hades Turwilliger