Title: Autocorrelation in Regression Analysis
1Autocorrelation in Regression Analysis
- Tests for Autocorrelation
- Examples
- Durbin-Watson Tests
- Modeling Autoregressive Relationships
2What causes autocorrelation?
- Misspecification
- Data Manipulation
- Before receipt
- After receipt
- Event Inertia
- Spatial ordering
3Checking for Autocorrelation
- Test Durbin-Watson statistic
4Consider the following regression
Source SS df MS
Number of obs 328 -------------------
------------------------ F( 2, 325)
52.63 Model .354067287 2
.177033643 Prob gt F 0.0000
Residual 1.09315071 325 .003363541
R-squared 0.2447 ------------------------
------------------- Adj R-squared
0.2400 Total 1.447218 327
.004425743 Root MSE
.058 --------------------------------------------
---------------------------------- price
Coef. Std. Err. t Pgtt 95
Conf. Interval ---------------------------------
--------------------------------------------
ice .060075 .006827 8.80 0.000
.0466443 .0735056 quantity -2.27e-06
2.91e-07 -7.79 0.000 -2.84e-06
-1.69e-06 _cons .2783773 .0077177
36.07 0.000 .2631944 .2935602 -----------
--------------------------------------------------
-----------------
Because this is time series data, we should
consider the possibility of autocorrelation. To
run the Durbin-Watson, first we have to specify
the data as time series with the tsset command.
Next we use the dwstat command.
Durbin-Watson d-statistic( 3, 328) .2109072
5Find the D-upper and D-lower
- Check a Durbin Watson table for the numbers for
d-upper and d-lower. - http//hadm.sph.sc.edu/courses/J716/Dw.html
- For n20 and k2, a .05 the values are
- Lower 1.643
- Upper 1.704
Durbin's alternative test for autocorrelation ----
--------------------------------------------------
--------------------- lags(p)
chi2 df Prob gt
chi2 --------------------------------------------
------------------------------ 1
1292.509 1
0.0000 -------------------------------------------
--------------------------------
H0 no serial correlation
6Alternatives to the d-statistic
- The d-statistic is not valid in models with a
lagged dependent variable - In the case of a lagged LHS variable you must use
the Durbin-a test (the command is durbina in
Stata) - Also, the d-statistic is only for first order
autocorrelation. In other instances you may use
the Durbin-a - Why would you suspect other than 1st order
autocorrelation?
7The Runs Test
- An alternative to the D-W test is a formalized
examination of the signs of the residuals. We
would expect that the signs of the residuals will
be random in the absence of autocorrelation. - The first step is to estimate the model and
predict the residuals.
8Runs continued
- Next, order the signs of the residuals against
time (or spatial ordering in the case of
cross-sectional data) and see if there are
excessive runs of positives or negatives.
Alternatively, you can graph the residuals and
look for the same trends.
9Runs test continued
The final step is to use the expected mean and
deviation in a standard t-test Stata does this
automatically with the runtest command!
10Visual diagnosis of autocorrelation (in a single
series)
- A correlogram is a good tool to identify if a
series is autocorrelated
11Dealing with autocorrelation
- D-W is not appropriate for auto-regressive (AR)
models, where - In this case, we use the Durbin alternative test
- For AR models, need to explicitly estimate the
correlation between Yi and Yi-1 as a model
parameter - Techniques
- AR1 models (closest to regression 1st order
only) - ARIMA (any order)
12Dealing with Autocorrelation
- There are several approaches to resolving
problems of autocorrelation. - Lagged dependent variables
- Differencing the Dependent variable
- GLS
- ARIMA
13Lagged dependent variables
- The most common solution
- Simply create a new variable that equals Y at
t-1, and use as a RHS variable - To do this in Stata, simply use the generate
command with the new variable equal to L.variable - gen lagy L.y
- gen laglagy L2.y
- This correction should be based on a theoretic
belief for the specification - May cause more problems than it solves
- Also costs a degree of freedom (lost observation)
- There are several advanced techniques for dealing
with this as well
14Differencing
- Differencing is simply the act of subtracting the
previous observation value from the current
observation. - To do this in Stata, again use the generate
command with a capital D (instead of the L for
lags) - This process is effective however, it is an
EXPENSIVE correction - This technique throws away long-term trends
- Assumes the Rho 1 exactly
15GLS and ARIMA
- GLS approaches use maximum likelihood to estimate
Rho and correct the model - These are good corrections, and can be replicated
in OLS - ARIMA is an acronym for Autoregressive Integrated
Moving Average - This process is a univariate filter used to
cleanse variables of a variety of pathologies
before analysis
16Corrections based on Rho
- There are several ways to estimate rho, the most
simple being calculating it from the residuals
We then estimate the regression by transforming
the regressors so that
and This
gives the regression
17High tech solutions
- Stata also offers the option of estimating the
model with the AR (with multiple ways of
estimating rho). There is also what is known as
a prais-winsten regression which generates values
for the lost observation - For the truly adventurous, there is also the
option of doing a full ARIMA model
18Prais-winsten regression
- Prais-Winsten AR(1) regression -- iterated
estimates - Source SS df MS
Number of obs 328 - -------------------------------------------
F( 2, 325) 15.39 - Model .012722308 2 .006361154
Prob gt F 0.0000 - Residual .134323736 325 .000413304
R-squared 0.0865 - -------------------------------------------
Adj R-squared 0.0809 - Total .147046044 327 .000449682
Root MSE .02033 - --------------------------------------------------
---------------------------- - price Coef. Std. Err. t
Pgtt 95 Conf. Interval - -------------------------------------------------
---------------------------- - ice .0098603 .0059994 1.64
0.101 -.0019422 .0216629 - quantity -1.11e-07 1.70e-07 -0.66
0.512 -4.45e-07 2.22e-07 - _cons .2517135 .0195727 12.86
0.000 .2132082 .2902188 - -------------------------------------------------
---------------------------- - rho .9436986
- --------------------------------------------------
---------------------------- - Durbin-Watson statistic (original) 0.210907
19ARIMA
- The ARIMA model allows us to test the hypothesis
of autocorrelation and remove it from the data. - This is an iterative process akin to the purging
we did when creating the ystar variable.
20The model
- ARIMA regression
- Sample 1 to 328
Number of obs 328 -
Wald chi2(1) 3804.80 - Log likelihood 811.6018
Prob gt chi2 0.0000 - --------------------------------------------------
---------------------------- - OPG
- price Coef. Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - price
- _cons .2558135 .0207937 12.30
0.000 .2150587 .2965683 - -------------------------------------------------
---------------------------- - ARMA
- ar
- L1. .9567067 .01551 61.68
0.000 .9263076 .9871058 - -------------------------------------------------
---------------------------- - /sigma .0203009 .000342 59.35
0.000 .0196305 .0209713 - --------------------------------------------------
----------------------------
Estimate of rho
Significant lag
21The residuals of the ARIMA model
There are a few significant lags a ways back.
Generally we should expect some, but this mess is
probably an indicator of a seasonal trend (well
beyond the scope of this lecture)!
22ARIMA with a covariate
- ARIMA regression
- Sample 1 to 328
Number of obs 328 -
Wald chi2(3) 3569.57 - Log likelihood 812.9607
Prob gt chi2 0.0000 - --------------------------------------------------
---------------------------- - OPG
- price Coef. Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - price
- ice .0095013 .0064945 1.46
0.143 -.0032276 .0222303 - quantity -1.04e-07 1.22e-07 -0.85
0.393 -3.43e-07 1.35e-07 - _cons .2531552 .0220777 11.47
0.000 .2098838 .2964267 - -------------------------------------------------
---------------------------- - ARMA
- ar
- L1. .9542692 .01628 58.62
0.000 .9223611 .9861773 - -------------------------------------------------
----------------------------
23Final thoughts
- Each correction has a best application.
- If we wanted to evaluate a mean shift (dummy
variable only model), calculating rho will not be
a good choice. Then we would want to use the
lagged dependent variable - Also, where we want to test the effect of
inertia, it is probably better to use the lag
24Final Thoughts Continued
- In Small N, calculating rho tends to be more
accurate - ARIMA is one of the best options, however, it is
very complicated! - When dealing with time, the number of time
periods and the spacing of the observations is
VERY IMPORTANT! - When using estimates of rho, a good rule of thumb
is to make sure you have 25-30 time points at a
minimum. More if the observations are too close
for the process you are observing!
25Next Time
- Review for Exam
- Plenary Session
- Exam Posting
- Available after class Wednesday