Title: Heteroskedasticity
1Heteroskedasticity
- Outline
- 1) What is it?
- 2) What are the consequences for our Least
Squares estimator when we have heteroskedasticity - 3) How do we test for heteroskedasticity?
- 4) How do we correct a model that has
heteroskedasticity
2What is Heteroskedasticity
Review the assumption of Gauss-Markov
- Linear Regression Model
y ?1 ?2x e - Error Term has a mean of zero E(e) 0 ? E(y)
?1 ?2x - Error term has constant variance Var(e) E(e2)
a CONSTANT - In other words we assume all the observations
are equally reliable - Error term is not correlated with itself (no
serial correlation) Cov(ei,ej) E(eiej) 0
i?j - Data on X are not random and thus are
uncorrelated with the error term Cov(X,e)
E(Xe) 0
This is the assumption of a homoskedastic error
?2
A homoskedastic error is one that has constant
variance. A heteroskedastic error is one that has
a nonconstant variance.
Heteroskedasticity is more commonly a problem for
cross-section data sets, although a time-series
model can also have a non-constant variance.
3This diagram shows a constant variance (OLD
assumption) for the error term The line shows Ey
(food) for any given X. Notice that a family
making 500 Is expected to deviate ( )
from its Ey the same as a family making 1500.
Y (food)
f(yx)
.
?2 SAME value
.
Ey x
.
x ( income)
x500
X1000
X1500
4This diagram shows a non-constant variance for
the error term that appears to increase as X
increases. The variation in the Gates food
budgetfrom their average is greater than for the
Correas.
f(yx)
y
.
.
?2 increases
E(yx)
.
x (income)
XCorreas 500
XTRUMP 1000
XGATES 1500
5What are the causes?
Direct
Indirect
- Scale effects
- Structual shifts
- Learning effects
- Omitted variables
- Outliers
- Parameter variation
Again, heteroskedasticity is more commonly a
problem for cross-section data sets, although a
time-series model can also have a non-constant
variance.
6What are the Implications for Least Squares?
- We have to ask where did we used the
assumption? Or why was the assumption needed in
the first place? - We used the assumption in the derivation of the
variance formulas for the least squares
estimators, b1 and b2. - For b2 the formula for Var(b2) was
BUT this last step uses the assumption that ?t2
is a constant ?2.
7If this is not the case, then the formula is
Remember
Therefore, if we ignore the problem of a
heteroskedastic error and estimate the variance
of b2 using the formula on the previous slide,
when in fact we should have used the formula
directly on this slide, then our estimates of
the variance of b2 are wrong. Any hypothesis
tests or confidence intervals based on them will
be invalid. Note, however, that the proof of
Unbiasedness E(b2) ??2 did not use the
assumption of a homoskedastic error. Therefore,
a heteroskedastic error will not bias the
coefficient estimates, but it will bias the
estimates of their variances.
8In other words, if there is heteroskedasticity
- OLS estimators are still LINEAR and UNBIASED
- OLS estimators are NOT EFFICIENT
- Usual formulas give INCORRECT STANDARD ERRORS for
OLS - Any hypothesis tests or confidence intervals
based on the usual formulas for the standard
errors are WRONG
9How do We Test for a Heteroskedastic Error
- 1) Visual Inspection of the residuals
- Because we never observe actual values for the
error term, we never know for sure whether it is
heteroskecastic or not. However, we can run a
least squares regression and examine the
residuals to see if they show a pattern
consistent with a non- constant variance.
10This regression resulted in the following
residuals plotted against the variable X (weekly
income). It appears as the the variation in the
residuals increases with higher values of X,
suggesting a heteroskedastic error.
11- Formal Tests for Heteroskedasticity There are
many different tests that can be used for
heteroskedasticity. We will look at 4 of them - 2) Goldfeld Quandt Test
- a) Suppose we think that the error might be
heteroskedastic. We examine the residuals and
notice that the variance in the residuals appears
to be larger for larger values of a (continuous)
dependent variable xj - Note that it is necessary to make some
assumption about the form of the
heteroskedasticity, that is, an assumption about
how the variance of et changes. For the food
expenditure problem, the residuals tell us that
an increasing function of xt (weekly income) is a
good candidate. Other models may have a variance
that is a decreasing function of xt or is a
function of some variable other than xt.
12- The idea behind the Goldfeld Quandt Test
- We want to test if there is heterok. of the kind
that is proportional to xj - Sort the data in descending order by the variable
xj that you think causes the heterosk., and then
split the data in half. Omit a few of the middle
observations. - Run the regression on each half of the data.
- Conduct a formal hypothesis test to decide
whether or not there is a heteroskedastic error
based on an examination of the SSE from each
half. - If the error is heteroskedastic with a larger
variance for the larger values of xt , then we
should find
And where SSElarge comes from the the regression
using the subset of large values of xt., which
has tlarge observations SSEsmall comes from the
regression using the subsetof small values of
xt, which has tsmall observations
Where
13The error is Homoskedastic so that
The error is Heteroskedastic
It can be shown that the GQ statistic has a
F-distribution with (tl-k) d.o.f. in the
numerator and (ts-k) d.o.f. in the
denominator. If GQ gt Fc ? we reject Ho. We find
that the error is heteroskedastic.
14Food Expenditure Example
This code sorts the data according to X because
we believe that the error variance is increasing
in xt.
proc sort datafood
by descending x
data food_large
set food
if _n_ lt 20
proc reg
bigvalues model y
x data food_small
set food
if _n_
gt 21 proc reg
littlevalues model y x run
This code estimates the model for the first 20
observations, which are the observations with
large values of xt.
This code estimates the model for the second 20
observations, which are the observations will
small values of xt.
15 The REG Procedure
Model bigvalues
Dependent Variable y
Analysis of Variance
Sum of
Mean Source DF Squares
Square F Value Pr gt F Model
1 4756.81422 4756.81422
2.08 0.1663 Error 18
41147 2285.93938 Corrected Total
19 45904 Root MSE
47.81150 R-Square 0.1036
Dependent Mean 148.32250 Adj R-Sq
0.0538 Coeff Var
32.23483 Parameter
Estimates Parameter
Standard Variable DF Estimate
Error t Value Pr gt t Intercept
1 48.17674 70.24191 0.69
0.5015 x 1 0.11767
0.08157 1.44 0.1663
The REG Procedure
Model littlevalues
Dependent Variable y
Analysis of Variance
Sum of
Mean Source DF Squares
Square F Value Pr gt F Model
1 8370.95124 8370.95124
12.27 0.0025 Error 18
12284 682.45537 Corrected Total
19 20655 Root MSE
26.12385 R-Square 0.4053
Dependent Mean 112.30350 Adj R-Sq
0.3722 Coeff Var
23.26183 Parameter
Estimates Parameter
Standard Variable DF Estimate
Error t Value Pr gt t Intercept
1 12.93884 28.96658 0.45
0.6604 x 1 0.18234
0.05206 3.50 0.0025
Fc 2.22 (see SAS) ? Reject Ho
16- 3) Parks Test
- this test is described in Gujarati 1995, p.369.
This test proposes that the error variance is a
log-log function of one (or more) explanatory
variable(s), say X. In in the form - Note that the relationship is NOT LINEAR, like
before. Look at Figure6.3(b) and Figure6.3(c) in
the book. - Follow OLS estimation, and use the OLS estimated
residuals êt in the auxiliary regression
ln(êt2) b0 b1 ln(Xt) ut - The test statistic is the t-ratio on the
parameter estimate for b1. If the t-ratio shows
that the estimated parameter b1 is significantly
different from zero then there is evidence for
heteroskedasticity. Since this is an approximate
test it is appropriate to consider that the test
statistic has an asymptotic normal distribution
so that at a 5 significance level the critical
value is 1.96. - The Park test and the Golfeldt-Quandt tests
require precise before hand knowledge of the
cause of the HET the xj variable(s) causing the
HET AND the functional form of this HET. If you
have such knowledge, then the Park test and the
Goldfeldt-Quandt tests are more powerful (less
Type II error, less often do you accept a FALSE
null compared to accepting a true alternate - an
alternate that you specified) than the subsequent
tests. The Park test is only asymptotically true.
17- 4) Breusch-Pagan Test Is there some variation in
the squared residuals which can be explained by
variation in some independent variables? - Estimate the OLS regression and obtain the
residuals - Use the squared residuals as the dependent
variables in a secondary equation that includes
the independent variables suspected of being
related to error term. - êt2 b0 b1 lnXt b1 lnXt
.ut - Test the joint hypothesis that coefficients of
ALL the Xs in the second regression are zero. (An
F test of significance. Use TEST in SAS. See Ch
8.1 and 8.2 ) - Can also test nR2?2df where R2 is the R-sqred
from the auxiliary regression and dfnumber of
regressors (Xs) in auxiliary regression. - The Park and Goldfeld-Quandt tests require
knowledge of the form of the HET - the particular
functional form. If you have such knowledge,
then the previous tests are more powerful (less
Type II error, less often do you accept a FALSE
null compared to accepting a true alternate - an
alternate that you specified). The Breusch-Pagan
test does not require knowledge of the functional
form of the HET, but it still assumes that we
know which variables cause it. It is also
sensitive to deviations in normality.
18(No Transcript)
19- 5) Whites Test variation of Breush-Pagan, but
using ALL THE Xs - Estimate the OLS regression and obtain the
residuals - Use the squared residuals as the dependent
variable in a secondary equation that includes
EVERY ONE of the explanatory variables, their
squares, and all their pair cross products (i.e.
x1x2, x1x3 and x2x3 but not x1x2x3) - êt2 b0 b1Xt b2Zt b3Xt2
- b4Zt2 b5XtZt.ut
- Test the joint hypothesis that coefficients of
ALL the Xs in the second regression are zero. (An
F test of significance. Use TEST in SAS. See Ch
8.1 and 8.2 ) - Can also test nR2?2df where R2 is the R-sqred
from the auxiliary regression and dfnumber of
regressors (Xs) in auxiliary regression. - This test does not assume knowledge of which
variables cause the HET. If you have such
knowledge, all the previous test are more
powerful (less Type II error). The White test is
only asymptotically true (needs lots of data),
and is more commonly used now. SAS can do it
automatically
PROC REG DATA whatever MODEL whatever
whatever /SPEC RUN QUIT
20I am running PROC REG with the ACOV and SPEC
options to obtain gtheteroscadascity consistent
(White-corrected") test statistics. I need to
gtcollect these test statistics into a SAS
dataset. The variance covariance matrix output
that is output into the parameters dataset using
the OUTEST and COVOUT options does not seem to
be White corrected. Any suggestions gt as gt to how
I might pull out the test statistics?
Thanks. Proc reg datafile1 model y x / acov
spec ods output ParameterEstimatesthe_parms
AcovEst the_acov SpecTest the_spec
which will yield files named (the_parms,
the_acov, and the_spec) with the tables from
those sections of the output.
21How Do We Correct for a Heteroskedastic Error?
- Just redefine the variables (for example use
income per capita instead of income). This works
some times - Robust OLS estimation using the White Standard
Errors earlier we saw that in the presence of
heteroskedasticity, the correct formula for the
variance of b2 is - So we just run OLS, and calculate the variance
of the betas separately, with the formula above.
In this formula, we use the squared residual for
each observation as the estimate of its variance,
which are called Whites Estimatorsof the error
variance. - Remember, OLS parameter estimates are still
UNBIASED Eb true ß. -
- We will not do this by hand though.. Fortunately,
White asymptotic covariance estimation can be
performed with the ACOV option in SAS PROC REG.
(Also explore PROC ROBUSTREG) - PROC REG DATA thedata MODEL depvar
indep vars / ACOV RUN QUIT - HOWEVER The variances are reported separtely in
the White var-cov section of SAS output. BEWARE
that the t stats that SAS reports in the regular
regression output are WRONG. So we have to
calculate them manual dividing
estimate/sqrt_of_variance.
22How Do We Correct for a Heteroskedastic Error?
- It is a pain to have to calculate the t
statistics manually from the regression output.
There is a way to make SAS do this for us too - PROC REG DATA thedata MODEL depvar
firstX secondX thridX / ACOV TEST firstX
0 - TEST secondX 0
- TEST thridX 0 RUN QUIT
- This will provide us with the correct
t-statistics and p-values for each of the
regressors, so we do not have to calculate them
manually. - It is important to say that this only works for
large samples (LOTS OF DATA!!!).
23- 3) Generalized Least Squares (GLS)
- Idea Transform the model with a heteroskedastic
error into a model with a homoskedastic error.
Then apply the method of least squares. This
requires us to assume a specification for the
error variance. As earlier, we will assume that
the variance increases with xt.
Where
Transform the model by dividing every piece of it
by the standard deviation of the error.
24This new model has an error term that is the
original error term divided by the square root of
xt. Its variance is constant.
This method is called Weighted Least
Squares. It is more efficient than simply
applying Least Squares to the model. Least
Squares gives equal weight to all observations.
Weighted Least Squares gives each observation a
weight that is inversely related to its value of
the square root of xt. Therefore, large values
of xt which we have assumed have a large variance
will get less weight than smaller values of xt
when estimating the intercept and slope of the
regression line
25We need to estimate this model
This requires us to construct 3 new variables .
. and to estimate the model
Notice that it does NOT have an INTERCEPTso use
the /NOINT option in SAS
It is possible to do this in SAS automatically
using PROC REG DATA thedata MODEL depvar
indep vars / noint WEIGHT variabletoweightby
RUN QUIT / or PROC MODEL or PROC GLM /
26SAS code to do test for Heterosk and perform
Weighted Least Squares NOTE look at 11.24 for
another way to do it
data whatever set whatever ystar y/sqrt(x)
x1star 1/sqrt(x)
x2star x/sqrt(x)
output run proc
reg data whatever foodglsmodel ystarx1star
x2star/noint / noint to run the model without
an intercept / run
27(No Transcript)
28SAS code to test for Heterok. another way
/ We have the variables dep_var, inc and height
from our dataset whatever / / The following
code just runs the tests / proc model
datawhatever parms a1 b1 b2 / declares
parameters of a model. Each parameter has a
single value associated with it which is the
same for all observations / dep_var a1
b1 inc b2 height fit / fit estimates
the model / / white pagan(inc height) / we do
Whites test, and Pagans test on the
varsinc and heigh / outresid1 outresid /
output residual to outside file / run /
white and pagan may also work with PROC REG.
We may not have to necessarily use proc model
/
29SAS code to perform weighted least squares
another way
proc model datawhatever parms a1 b1
b2 inc2_inv 1/inc2 / we create the weights.
In this case they are 1/var_squared because
we are assuming heterok. Is of the
form sigma2_t constant_sigma2 X_jt2
/ exp a1 b1 inc b2 inc2 fit exp
/ fit exp tell SAS to estimate just
the dependent variable exp. We could ommit
the exp and just write fit because there
is only one equation being fitted in this
model / / NOTE the model above DOES have an
intercept. This is because we assumed the form of
the heterok. to be such that the sigma depends on
the SQUARE of the X . See also 11.26 / /
white pagan(1 inc inc2) weight inc2_inv /
tells SAS to divide all the indep vars in
the model by the variable inc2_inv /
run / NOTE II WEIGHT vartoweightby
works also with proc reg and proc autoreg
commands, right before the run statement /
30- If instead, the proportional heteroskedasticity
is suspected to be - of the form
- We could proceed by forming the variable
- and then proceeding EXACTLY as we did before,
with the model
31- Testing for Heteroscedasticity White test in SAS
- The regression model is specified as , where the
's are identically and independently
distributed and .If the 's are not
independent or their variances are not constant,
the parameter estimates are unbiased, but the
estimate of the covariance matrix is
inconsistent. In the case of heteroscedasticity,
the ACOV option provides a consistent estimate of
the covariance matrix. If the regression data are
from a simple random sample, the ACOV option
produces the covariance matrix. This matrix is - (X'X)-1 (X' diag(ei2)X) (X'X)-1
- where
- ei yi - xi b
- The SPEC option performs a model specification
test. The null hypothesis for this test maintains
that the errors are homoscedastic, independent of
the regressors and that several technical
assumptions about the model specification are
valid. For details, see theorem 2 and assumptions
1 -7 of White (1980). When the model is correctly
specified and the errors are independent of the
regressors, the rejection of this null hypothesis
is evidence of heteroscedasticity. In
implementing this test, an estimator of the
average covariance matrix (White 1980, p. 822) is
constructed and inverted. The nonsingularity of
this matrix is one of the assumptions in the null
hypothesis about the model specification. When
PROC REG determines this matrix to be numerically
singular, a generalized inverse is used and a
note to this effect is written to the log. In
such cases, care should be taken in interpreting
the results of this test. - When you specify the SPEC option, tests listed in
the TEST statement are performed with both the
usual covariance matrix and the
heteroscedasticity consistent covariance matrix.
Tests performed with the consistent covariance
matrix are asymptotic. For more information,
refer to White (1980). - Both the ACOV and SPEC options can be specified
in a MODEL or PRINT statement.