2' Linear dependent variables

About This Presentation

Title:

2' Linear dependent variables

Description:

The adjusted R-squared corrects for this by accounting for the number of model ... nothing about whether our hypothesis about the determinants of Y is correct. ... – PowerPoint PPT presentation

Number of Views:192

Avg rating:3.0/5.0

Slides: 137

Provided by: accl5

Category:

more less

Transcript and Presenter's Notes

Title: 2' Linear dependent variables

1
2. Linear dependent variables

2.1 The basic idea underlying linear regression
2.2 Single variable OLS
2.3 Correctly interpreting the coefficients
2.4 Examining the residuals
2.5 Multiple regression
2.6 Heteroskedasticity
2.7 Correlated errors
2.8 Multicollinearity
2.9 Outlying observations
2.10 Median regression
2.11 Looping

2
2.1 The basic idea underlying linear regression

A simple linear regression aims to characterize
the relation between a dependent variable and one
independent variable using a straight line
You have already seen how to fit a line between
two variables using the scatter command
Linear regression does the same thing but it can
be extended to include multiple independent
variables

3
2.1 The basic idea

For example, you predict that larger companies
usually pay higher fees
You can formalize the effect of company size on
predicted fees using a simple equation
The parameter a0 represents what fees are
expected to be in the case that Size 0.
The parameter a1 captures the impact of an
increase in Size on expected fees.

4
2.1 The basic idea

The parameters a0 and a1 are assumed to be the
same for all observations and they are called
regression coefficients
You may argue that company size is not the only
variable that affects audit fees. For example,
the complexity of the audit engagement, or the
size of the audit firm may also matter.
If you do not know all the factors that influence
fees, the predicted fee that you calculate from
the above equation will differ from the
actual fee.

5
2.1 The basic idea

The deviation between the predicted fee and the
actual fee is called the residual. In general,
you might represent the relation between actual
fees and predicted fees in the following way
where represents the residual term (i.e., the
difference between actual and predicted fees)

6
2.1 The basic idea

Putting the two together we can express actual
fees using the following equation
The goal of regression analysis is to estimate
the parameters a0 and a1

7
2.1 The basic idea

One of the simplest techniques to estimate the
coefficients is known as ordinary least squares
(OLS).
The objective of OLS is to make the difference
between the predicted and actual values as small
as possible
In other words, the goal is to minimize the
magnitude of the residuals

8
2.1 The basic idea

Go to http//ihome.ust.hk/accl/Phd_teaching.htm
Download ols.dta to your hard drive and open in
STATA (use "C\phd\ols.dta", clear)
examine the graphical relation between the two
variables, twoway (scatter y x) (lfit y x)

9
2.1 The basic idea

This line is fitted by minimizing the sum of the
squared differences between the observed and
predicted values of y (known as the residual sum
of square, RSS)

The main assumptions required to obtain these
coefficients are that
The relation between y and x is linear
The x variable is uncorrelated with the residuals
(i.e., x is exogenous)
The residuals have a mean value of zero

10
2.1 The basic idea
11
(No Transcript)
12
Class exercise 2a

Using the formulas and the data currently in
STATA, calculate the parameters a1 and a0

13
2.2 Single variable OLS (regress)

Instead of using these formulas to calculate the
regression coefficients, we can instead use the
regress command
regress y x
The first variable (y) is the dependent variable
while the second (x) is the independent variable

14
2.2 Single variable OLS (regress)

This gives the following output

15
2.2 Single variable OLS (regress)

The coefficient estimates are 3.000 for the a0
parameter and 0.500 for the a1 parameter

We can use these to predict the values of Y for
any given value of X. For example, when X 5 we
predict that Y will be
display 3.0000910.50009095

16
2.2 Single variable OLS (_b)

Alternatively, we do not need to type the
coefficient estimates because STATA will remember
them for us. They are stored by STATA using the
name _bvarname where varname is replaced with
the name of the independent variable or the
constant (_cons)
display _b_cons_bx5

17
2.2 Single variable OLS

Note that the predicted value of y when x equals
5 differs from the actual value
list y if x5
The actual value is 5.68 compared to the
predicted value of 5.50. The difference for this
observation is the residual error that arises
because x is not a perfect predictor of y.

18
2.2 Single variable OLS

If we want to compute the predicted value of y
for each value of x in our dataset, we can use
the saved coefficients
gen y_hat_b_cons_bxx
The estimated residuals are the difference
between the observed y values and the predicted y
values
gen y_resy-y_hat
list x y_hat y y_res

19
2.2 Single variable OLS (predict)

A quicker way to do this would be to use the
predict command after regress
predict yhat
predict yres, resid
Checking that this gives the same answer
list yhat y_hat yres y_res
You should also note that the values of x, yhat
and yres correspond with those found on the
scatter graph
sort x
list x y y_hat y_res

20
2.2 Single variable OLS
21
2.2 Single variable OLS

Note that by construction, there is zero
correlation between the x variable and the
residuals
twoway (scatter y_res x) (lfit y_res x)

22
2.2 Single variable OLS

Standard errors
Typically our data comprises a sample that is
taken from a larger population
The coefficients are only estimates of the true
a0 and a1 values that describe the entire
population
If we obtained a second random sample from the
same population, we would obtain different
coefficient estimates for a0 and a1

23
2.2 Single variable OLS

We therefore need a way to describe the
variability that would obtain if we were to apply
OLS to many different samples
Equivalently, we want a measure of how
precisely our coefficients are estimated
The solution is to calculate standard errors,
which are simply the sample standard deviations
associated with the estimated coefficients
Standard errors (SEs) allow us to perform
statistical tests, e.g., is our estimate of a1
significantly greater than zero?

24
2.2 Single variable OLS

The techniques for estimating standard errors are
based on additional OLS assumptions
Homoscedasticity (i.e., the residuals have a
constant variance)
Non-correlation (i.e., the residuals are not
correlated with each other)
Normality (i.e., the residuals are normally
distributed)

25
2.2 Single variable OLS

The t-statistic is obtained by dividing the
coefficient estimate by the standard error

26
2.2 Single variable OLS

The p-values are from the t-distribution and they
tell you how likely it is that you would have
observed the estimated coefficient under the
assumption that the true coefficient in the
population is zero.
The p-value of 0.002 tells you that it is very
unlikely (prob 0.2) that the true coefficient
on x is zero.
The confidence intervals mean you can be 95
confident that the true coefficient of x lies
between 0.233 and 0.767.

27
2.2 Single variable OLS

To explain this we need some notation
captures the variation in y around its mean
captures the variation that is not explained by
x
captures the variation that is explained by x

28
2.2 Single variable OLS

The total sum of squares (TSS) 41.27
The explained sum of squares (ESS) 27.51
The residual sum of squares (RSS) 13.76
Note that TSS ESS RSS.

29
2.2 Single variable OLS

The column labeled df contains the number of
degrees of freedom
For the ESS, df k-1 where k number of
regression coefficients (df 2 1)
For the RSS, df n k where n number of
observations ( 11 - 2)
For the TSS, df n-1 ( 11 1)
The last column (MS) reports the ESS, RSS and TSS
divided by their respective degrees of freedom

30
2.2 Single variable OLS

The first number simply tells us how many
observations are used to estimate the model
The other statistics here tell you how well the
model explains the variation in Y

31
2.2 Single variable OLS

The R-squared ESS / TSS ( 27.51 / 41.27
0.666)
So x explains 66 of the variation in y.
Unfortunately, many researchers in accounting
(and other fields) evaluate the quality of a
model by looking only at the R-squared.
This is not only invalid it is also very
dangerous (I will explain why later)

32
2.2 Single variable OLS

One problem with the R-squared is that it will
always increase even when an independent variable
is added that has very little explanatory power.
Adding another variable is not always a good idea
as you lose one degree of freedom for each
additional coefficient that needs to be
estimated. Adding insignificant variables can be
especially inefficient if you are working with a
small sample size.
The adjusted R-squared corrects for this by
accounting for the number of model parameters, k,
that need to be estimated
Adj R-squared 1-(1-R2)(n-1)/(n-k)
1-(1-.666)(10)/9 0.629
In fact the adjusted R-squared can even take on
negative values. For example, suppose that y and
x are uncorrelated in which case the unadjusted
R-squared is zero
Adj R-squared 1-(n-1)/(n-2) (n-2-n1)/(n-2)
-1/(n-2)

33
2.2 Single variable OLS

You might think that another way to measure the
fit of the model is to add up the residuals.
However, by definition the residuals will sum to
zero.
An alternative is to square the residuals, add
them up (giving the RSS) and then take the square
root.
Root MSE square root of RSS/n-k
13.76 / (11-2)0.5 1.236
One way to interpret the root MSE is that it
shows how far away on average the model is from
explaining y
The F-statistic (ESS/k-1)/(RSS/n-k)
(27.51 / 1)/(13.76/9) 17.99
the F statistic is used to test whether the
R-squared is significantly greater than zero
(i.e., are the independent variables jointly
significant?)
Prob gt F gives the probability that the R-squared
we calculated will be observed if the true
R-squared in the population is actually equal to
zero
This F test is used to test the overall
statistical significance of the regression model

34
Class exercise 2b

Open your Fees.dta file and run the following two
regressions
audit fees on total assets
the log of audit fees on the log of total assets
What does the output of your regression mean?
Which model appears to have the better fit

35
2.3 Correctly interpreting the coefficients

So far we have considered the case where the
independent variable is continuous.
Interpretation of results is even more
straightforward when the independent variable is
a dummy.

reg auditfees big6
ttest auditfees, by(big6)

36
2.3 Correctly interpreting the coefficients

Suppose we wish to test whether the Big 6 fee
premium is significantly different between listed
and non-listed companies

37
2.3 Correctly interpreting the coefficients

gen listed0
replace listed1 if companytype2
companytype3 companytype5
reg auditfees big6 if listed0
ttest auditfees if listed0, by(big6)
reg auditfees big6 if listed1
ttest auditfees if listed1, by(big6)
gen listed_big6listedbig6
reg auditfees big6 listed listed_big6

38
2.3 Correctly interpreting the coefficients

Some studies report the economic significance
of the estimated coefficients as well as the
statistical significance
Economic significance refers to the magnitude of
the impact of x on y
There is no single way to evaluate economic
significance but many studies describe the
change in the predicted value of y as x increases
from the 25th percentile to the 75th (or as x
changes by one standard deviation around its mean)

39
2.3 Correctly interpreting the coefficients

For example, we can calculate the expected change
in audit fees as company size increases from the
25th to 75th percentiles
reg auditfees totalassets
sum totalassets if auditfeeslt., detail
gen fees_low_b_cons_btotalassetsr(p25)
gen fees_high_b_cons_btotalassetsr(p75)
sum fees_low fees_high

40
Class exercise 2c

Estimate the audit fee model in logs rather than
in absolute values
Calculate the expected change in audit fees as
company size increases from the 25th to 75th
percentiles
Compare your results for economic significance to
those we obtained when the fee model was
estimated using the absolute values of fees and
assets.
Hint you will need to take the exponential of
the predicted log of fees in order to make this
comparison.

41
2.3 Correctly interpreting the coefficients

When evaluating the economic significance of a
dummy variable coefficient, we usually do so
using the values zero and one rather than
percentiles
For example
reg lnaf big6
gen fees_nb6exp(_b_cons)
gen fees_b6exp(_b_cons_bbig6)
sum fees_nb6 fees_b6

42
2.3 Correctly interpreting the coefficients

Suppose we believe that the impact of a Big 6
audit on fees depends upon the size of the
company

Usually, we would quantify this impact using a
range of values for lnta (e.g., as lnta increases
from the 25th to the 75th percentile)

43
2.3 Correctly interpreting the coefficients

For example
gen big6_lnta big6lnta
reg lnaf big6 lnta big6_lnta
sum lnta if lnaflt. big6lt., detail
gen big6_low_bbig6_bbig6_lntar(p25)
gen big6_high_bbig6_bbig6_lntar(p75)
gen big6_mean_bbig6_bbig6_lntar(mean)
sum big6_low big6_high big6_mean

It is amazing how many studies give a misleading
interpretation of the coefficients when using
interaction terms. For example, Blackwell et al.

45
(No Transcript)
46

Class questions
Theoretically, how should auditing affect the
interest rate that the company has to pay?
Empirically, how do we measure the impact of
auditing on the interest rate using eq. (1)?

47
(No Transcript)
48

Class question At what values of total assets
(000) is the effect of the Audit Dummy on the
interest rate
negative, zero, positive?

49
(No Transcript)
50

Class questions
What is the mean value of total assets within
their sample?
How does auditing affect the interest rate for
the average company in their sample?

51
(No Transcript)
52

Verify that the above claim is true.
Suppose Blackwell et al. had reported the impact
for a firm with 11m in assets and another firm
with 15m in assets.
How would this have changed the conclusions
drawn?
Do you think the paper would have been published
if the authors had made this comparison?

53
(No Transcript)
54
2.4 Examining the residuals

Go to http//ihome.ust.hk/accl/Phd_teaching.htm
Download anscombe.dta to your hard drive (use
"C\phd\anscombe.dta", clear)
Run the following regressions
reg y1 x1
reg y2 x2
reg y3 x3
reg y4 x4
Note that the output from these regressions is
virtually identical
intercept 3.0 (t-stat2.67)
x coefficient 0.5 (t-stat4.24)
R-squared 66

55
Class exercise 2d

If you did not know about regression assumptions
or regression diagnostics you would probably stop
your analysis at this point, concluding that you
have a good fit for all four models.
In fact, only one of these four models is well
specified.
Draw scatter graphs for each of these four
associations (e.g., twoway (scatter y1 x1) (lfit
y1 x1)).
Of the four models, which do you think is the
well specified one?
Draw scatter graphs for the residuals against the
x variable for each of the four regressions is
there a pattern?
Which of the OLS assumptions are violated in
these four regressions?

56
2.4 Examining the residuals

Unfortunately, it is common among researchers to
judge whether a model is well-specified solely
in terms of its explanatory power (i.e., the
R-squared).
Many researchers fail to report other types of
diagnostic tests
is there significant heteroscedasticity?
is there any pattern to the residuals?
are there any problems of outliers?

57
2.4 Examining the residuals

For example, many audit fee studies claim that
their models are well-specified because they have
high R2
Carson et al. (2003)

58
2.4 Examining the residuals

Gu (2007) points out that
econometricians consider R2 values to be
relatively unimportant (accounting researchers
put far too much emphasis on the magnitude of the
R2)
regression R2s should not be compared across
different samples
in contrast there is a large accounting
literature that uses R2s to determine whether the
value relevance of accounting information has
changed over time

It is easy to show that the same economic model
can yield very different R2 depending on how the
variables are transformed

Using either eq. (1) or (2), we will obtain
exactly the same coefficient estimates because
the economic model is the same
If eq. (1) is well-specified, so also is eq. (2)
If eq. (1) is mis-specified, so also is eq. (2)
However, the R2 of eq. (1) will be very different
from the R2 of eq. (2)

Example
use "C\phd\Fees.dta", clear
gen lnafln(auditfees)
gen lntaln(totalassets)
sort companyid yearend
by companyid gen lnaf_laglnaf_n-1
egen missrmiss(lnaf lnta lnaf_lag)
gen chlnaflnaf-lnaf_lag
reg lnaf lnta lnaf_lag if miss0
reg chlnaf lnta lnaf_lag if miss0
The lnta coefficients are exactly the same in the
two models.
The lnaf_lag coefficient in eq. (2) equals the
lnaf_lag coefficient in eq. (1) minus one.
The R2 is much higher in eq. (1) than eq. (2).
The high R2 in eq. (1) does not imply that the
model is well-specified.
The low R2 in eq. (2) does not imply that the
model is mis-specified.
Either both equations are well-specified or they
are both mis-specified.
The R2 tells us nothing about whether our
hypothesis about the determinants of Y is
correct.

61
2.4 Examining the residuals

Instead of relying only on the R2, an examination
of the residuals can help us to identify whether
the model is well specified. For example compare
the audit fee model which is not logged
reg auditfees totalassets
predict res1, resid
twoway (scatter res1 totalassets, msize(tiny))
(lfit res1 totalassets)
With the logged audit fee model
reg lnaf lnta
predict res2, resid
twoway (scatter res2 lnta, msize(tiny)) (lfit
res2 lnta)
Notice that the residuals are more spherical
displaying less of an obvious pattern in the
logged model.

62
2.4 Examining the residuals

In order to obtain unbiased standard errors we
have to assume that the residuals are normally
distributed
We can test this using a histogram of the
residuals
hist res1
this does not give us what we need because there
are severe outliers
sum res1, detail
hist res1 if res1gt-22 res1lt208, normal
xlabel(-25(25)210)
hist res2
sum res2, detail
hist res2 if res2gt-2 res2lt1.8, normal
xlabel(-2(0.5)2)
The residuals are much closer to the assumed
normal distribution when the variables are
measured in logs

63
(No Transcript)
64
Class exercise 2e

Following Pong and Whittington (1994) estimate
the raw value of audit fees as a function of raw
assets and assets squared
Examine the residuals
Do you think this model is better specified than
the one in logs?

65
2.5 Multiple regression

Researchers use multiple regression when they
believe that Y is affected by multiple
independent variables
Y a0 a1 X1 a2 X2 e
Why is it important to control for multiple
factors that influence Y?

66
2.5 Multiple regression

Suppose the true model is
Y a0 a1 X1 a2 X2 e
where X1 and X2 is uncorrelated with the error, e
Suppose the OLS model that we estimate is
Y a0 a1 X1 u
where u a2 X2 e
OLS imposes the assumption that X1 is
uncorrelated with the residual term, u.
Since X1 is uncorrelated with e, the assumption
that X1 is uncorrelated with u is equivalent to
assuming that X1 is uncorrelated X2.

67
2.5 Multiple regression

If X1 is correlated with X2 the OLS estimate of
a1 is biased.
The magnitude of the bias depends upon the
strength of the correlation between X1 and X2.
Of course, we often do not know whether the model
we estimate is the true model
In other words, we are unsure whether there is an
omitted variable (X2) that affects Y and that is
correlated with our variable of interest (X1)

68
2.5 Multiple regression

We can judge whether or not there is likely to be
a correlated omitted variable problem using
theory
prior empirical studies
our understanding of the data generation process

69
2.5 Multiple regression

Theory
Does theory suggest that X2 affects Y?
Unfortunately, theory often fails to give a clear
guide to empirical researchers as to which
variables need to be controlled for

70
2.5 Multiple regression

The data generation process (DGP)
Many researchers go wrong simply because they
fail to understand the underlying process that
generates the data (e.g., they fail to understand
the institutional details).
Let me give you an example, from my research on
the reports issued by the Public Company
Accounting Oversight Board (PCAOB)
The PCAOB has been issuing reports about
weaknesses that they found in audit firms work

The dependent variable equals the number of
weaknesses disclosed in the PCAOBs report about
the audit firm
Ln(CLIENTS) is a continuous measure of audit
firm size (the log of the number of companies
audited by the audit firm)
BIG is a dummy variable capturing audit firm size

The audit firm size coefficients are positive and
highly significant
The PCAOB has been reporting more weaknesses at
the larger audit firms

72
2.5 Multiple regression

A working paper has also reported this positive
relation and concluded that the larger audit
firms have been offering lower quality audits
this conclusion contradicts evidence from many
other studies that find larger audit firms
provide higher quality audits
The researchers made this mistake because they
failed to understand the data generation
process for the weaknesses disclosed in PCAOB
reports.

73
2.5 Multiple regression

To understand the data generation process, it is
often important to understand how the data
originate http//www.pcaobus.org/Inspections/Publi
c_Reports/index.aspx

74
(No Transcript)
75

Lennox Pittman (2008)

76
(No Transcript)
77

In Col. (1) there is a severe omitted variable
problem because
the PCAOBs sample size is not included as a
control variable.
the PCAOBs sample size is highly correlated with
audit firm size.
In Col. (3), we see there is no significant
relation between audit firm size and the number
of reported weaknesses, after we control for the
size of the PCAOBs sample.
An understanding of the data generation process
is vital if we are to avoid drawing invalid
conclusions.
A PCAOB report discloses all serious weaknesses
found in the inspectors sample.
There is a biased association between audit firm
size and the number of reported weaknesses if the
size of the PCAOBs sample is not controlled for.

78
2.5 Multiple regression

What does it mean to control for the effect of
a variable?
In a multiple regression, the coefficient a1
captures the effect of a one-unit increase in X1
on Y, after controlling for (i.e., holding
constant) X2
Y a0 a1 X1 a2 X2 e
I will now explain this concept in more detail
using an empirical example

79
2.5 Multiple regression

We are going to look at the effect of non-audit
fees on audit fees after controlling for the
effect of company size
lnaf a0 a1 lnta a2 lnnaf e
gen lnnafln(1nonauditfees)
capture drop miss
egen missrmiss(lnaf lnta lnnaf)
First I estimate the following model and
calculate the residuals
lnaf a0 a1 lnta res1
reg lnaf lnta if miss0
predict res1 if miss0, resid
note that res1 reflects the part of lnaf that has
nothing to do with lnta (res1 a2 lnnaf e)
Next I estimate the following model and calculate
the residuals
lnnaf b0 b1 lnta res2
reg lnnaf lnta if miss0
predict res2 if miss0, resid
note that res2 reflects the part of lnnaf that
has nothing to do with lnta (by construction res2
is uncorrelated with lnta)

80
2.5 Multiple regression

Finally I estimate the following two models
lnaf a0 a1 lnta a2 lnnaf e (1)
res1 a2 res2 e (2)
reg lnaf lnta lnnaf if miss0
reg res1 res2 if miss0
Note that the coefficient and t-statistic on res2
in eq. (2) are exactly the same as the
coefficient and t-statistic for lnnaf in eq. (1)
What does all this mean?
The coefficient a2 in eq. (1) captures the impact
of lnnaf on lnaf after controlling for the fact
that
(1) lnta affects lnaf (a1 gt 0 in eq. (1)), and
(2) there is a significant positive correlation
between lnnaf and lnta (b1 gt 0)

81
2.5 Multiple regression

Note that if there had been zero correlation
between lnnaf and lnta (b1 0), the coefficient
a1 would be the same in both the simple and
multiple regression models
lnaf a0 a1 lnta a2 lnnaf e
lnaf a0 a1 lnta res1
The reason is that res1 would be uncorrelated
with lnta if there is zero correlation between
lnnaf and lnta (b1 0).
In other words the coefficient a1 is estimated
with bias only if res1 is significantly
correlated with lnta.

82
2.5 Multiple regression

This reinforces the intuition that it is only
necessary to control for those variables that
affect Y, AND
are correlated with the independent variable
whose coefficient we want to estimate
For example, if we want to estimate the effect of
lnnaf on lnaf, we must control for lnta because
lnta affects lnaf, and
lnta is correlated with lnnaf
Note that both of these conditions are necessary
for there to be an omitted variable problem.

83
2.5 Multiple regression

Previously, when we were using simple regression
with one independent variable, we checked whether
there was a pattern between the residuals and the
independent variable
lnaf a0 a1 lnta res1
twoway (scatter res1 lnta) (lfit res1 lnta)
When we are using multiple regression, we want to
test whether there is a pattern between the
residuals and the right hand side of the equation
as a whole
The right hand side of the equation as a whole
is the same thing as the predicted value of the
dependent variable

84
2.5 Multiple regression

So we should examine whether there is a pattern
between the residuals and the predicted values of
the dependent variable
For example, lets estimate a model where audit
fees depend on non-audit fees, company size,
audit firm size, whether the company is listed on
a stock market
gen listed0
replace listed1 if companytype2
companytype3 companytype5
reg lnaf lnnaf lnta big6 listed
predict lnaf_hat
predict lnaf_res, resid
twoway (scatter lnaf_res lnaf_hat) (lfit lnaf_res
lnaf_hat)

85
2.5 Multiple regression (rvfplot)

In fact, those nice guys at STATA have given us a
command which enables us to short-cut having to
use the predict command for calculating the
residuals and the fitted values
reg lnaf lnnaf lnta big6 listed
rvfplot
rvf stands for residuals versus fitted

86
2.6 Heteroscedasticity (hettest)

The OLS techniques for estimating standard errors
are based on an assumption that the variance of
the errors is the same for all values of the
independent variables (homoscedasticity)
In many cases, the homoscedasticity assumption is
clearly violated. For example
reg auditfees nonauditfees totalassets big6
listed
rvfplot
the homoscedasticity assumption can be tested
using the hettest command after we do the
regression
reg auditfees nonauditfees totalassets big6
listed
hettest
Heteroscedasticity does not bias the coefficient
estimates but it does bias the standard errors of
the coefficients

87
2.6 Heteroscedasticity (robust)

Heteroscedasticity is often caused by using a
dependent variable that is not symmetric
for example the auditfees variable is highly
skewed due to the fact that it has a lower bound
of zero
much of the heterosedasticity can often be
removed by transforming the dependent variable
(e.g., use the log of audit fees instead of the
raw values)
When you find that there is heteroscedasticity,
you need to adjust the standard errors using the
Huber/White/sandwich estimator
In STATA it is easy to do this adjustment using
the robust option
reg auditfees nonauditfees totalassets big6
listed, robust
Compare the adjusted and unadjusted results
reg auditfees nonauditfees totalassets big6
listed
note that the coefficients are exactly the same
the t-statistics on the independent variables are
much smaller when the standard errors are
adjusted for heteroscedasticity

88
Class exercise 2f

Esimate the audit fee model in logs rather than
absolute values
Using rvfplot, assess whether the residuals
appear to be non-constant
Using hettest, provide a formal test for
heteroscedasticity
Compare the coefficients and t-statistics when
you estimate the standard errors with and without
adjusting for heteroscedasticity.

89
2.7 Correlated errors

The OLS techniques for estimating standard errors
are based on an assumption that the errors are
not correlated
This assumption is typically violated when we use
repeated annual observations on the same
companies
The residuals of a given firm are correlated
across years (time series dependence)

90
Time-series dependence

Time-series dependence is nearly always a problem
when researchers use panel data
Panel data data that are pooled for the same
companies across time
In panel data, there are likely to be unobserved
company-specific characteristics that are
relatively constant over time

Lets start with a simple regression model where
the errors are assumed to be uncorrelated
We now relax the assumption of independent errors
by assuming that the error term has an unobserved
company-specific component that does not vary
over time and an idiosyncratic component that is
unique to each company-year observation
Similarly, we can assume that the X variable has
a company-specific component that does not vary
over time and an idiosyncratic component

92
Time-series dependence

In this case, the OLS standard errors tend to be
biased downwards and the magnitude of this bias
is increasing in the number of years within the
panel.
To understand the intuition, consider the extreme
case where the residuals and independent
variables are perfectly correlated across time.
In this case, each additional year provides no
additional information and will have no effect on
the true standard error
However, under the incorrect assumption of
time-series independence, it is assumed that each
additional year provides additional observations
and the estimated standard errors will shrink
accordingly and incorrectly
This problem can be avoided by adjusting the
standard errors for the clustering of yearly
observations across a given company

93
Time-series dependence

To understand all this, it is helpful to review
the following example
First, I estimate the model using just one
observation for each company (in the year 1998)
gen fyedate(yearend, "mdy")
gen yearyear(fye)
drop if year!1998
sort companyid
drop if companyidcompanyid_n-1
reg lnaf lnta big6 listed, robust

94
Time-series dependence

Now I create a dataset in which each observation
is duplicated
Each duplicated observation provides no
additional information and will have no effect on
the true standard error but it will reduce the
estimated standard error (i.e., the estimated
standard error will be biased downwards)
save "C\phd\Fees98.dta", replace
append using "C\phd\Fees98.dta"
reg lnaf lnta big6 listed, robust
Notice that the coefficient estimates in the
duplicated dataset are exactly the same as in the
dataset that had only observation per company.
However, the estimated standard errors are
smaller and the t-statistics are larger in the
duplicated dataset because we are using twice as
many observations.

95
Time-series dependence (robust cluster())

We can obtain correct standard errors in the
duplicate dataset using the robust cluster()
option which adjusts the standard errors for
clustering of observations (here they are
duplicated) for each company
reg lnaf lnta big6 listed, robust cluster
(companyid)
The t-statistics here are exactly the same as
when the model is estimated using just one
observation per year.

96
Time-series dependence

In reality the observations of a given company
are not exactly the same from one year to the
next (i.e., they are not exact duplicates).
However, the observations of a given company
often do not change much from one year to the
next.
For example, a companys size and the fees that
it pays may not change much over time (i.e.,
there is a strong unobserved company-specific
component to the variables).
Failing to account for this in panel data tends
to overstate the magnitude of the t-statistics.

97
Time-series dependence

It is easy to demonstrate that the residuals of a
given company tend to be very highly correlated
over time
First, start again with the original data
use "C\phd\Fees.dta", clear
gen fyedate(yearend, "mdy")
gen yearyear(fye)
gen lnafln(auditfees)
gen lntaln(totalassets)
save "C\phd\Fees1.dta", replace
Estimate the fee model and obtain the residuals
for each company-year observation
reg lnaf lnta
predict res, resid

98
Time-series dependence

Reshape the data so that we have each company as
a row and there are separate variables for each
yearly set of residuals
keep companyid year res
sort companyid year
drop if companyidcompanyid_n-1
yearyear_n-1
reshape wide res, i( companyid) j(year)
browse
Examine the correlations between the residuals of
a given company
pwcorr res1998- res2002

99
Time-series dependence

We can easily control for this problem of
time-series dependence using the robust cluster()
option
use "C\phd\Fees1.dta", clear
reg lnaf lnta, robust cluster(companyid)
Note that if we do not control for time-series
dependence, the t-statistic is biased upwards
even though we have controlled for the
heteroscedasticity
reg lnaf lnta, robust
If we do not control for heteroscedasticity, the
upward bias would be even worse
reg lnaf lnta
TOP TIP Whenever you use panel data you should
get into the habit of using the robust cluster()
option, otherwise your significant results from
pooled regressions may be spurious.

100
2.8 Multicollinearity

Perfect collinearity occurs if there is a perfect
linear relation between multiple variables of the
regression model.
For example, our dataset covers a sample period
of five years (1998-2002). Suppose we create a
dummy for each year and include all five year
dummies in the fee regression.
tabulate year, gen(year_)
reg lnaf year_1 year_2 year_3 year_4 year_5
Note that STATA excludes one of the year dummies
when estimating the model why is that?

101
2.8 Multicollinearity

The reason is that a linear combination of the
year dummies equals the constant in the model
year_1 year_2 year_3 year_4 year_5 1
where 1 is a constant
The model can only be estimated if one of the
year dummies or the constant is excluded
reg lnaf year_1 year_2 year_3 year_4 year_5,
nocons
STATA automatically throws away one of the year
dummies so that the model can be estimated

102
Class exercise 2g

Go to http//ihome.ust.hk/accl/Phd_teaching.htm
Download international.dta to your hard drive
and open in STATA
You are interested in testing whether legal
enforcement affects the importance of equity
markets to the economy
Create dummy variables for each country in your
dataset
Run a regression where importanceofequitymarket
is the dependent variable and legalenforcement is
the independent variable
How many country dummies can be included in your
regression? Explain.
Are your results for the legalenforcement
coefficient sensitive to your choice for which
country dummies to exclude? Explain.

103
2.8 Multicollinearity

We have seen that when there is perfect
collinearity between independent variables, STATA
will have to exclude one of them.
For example, a linear combination of all year
dummies equals the constant in the model
year_1 year_2 year_3 year_4 year_5
constant
so we cannot estimate a model that includes all
the year dummies and the constant term
Even if the independent variables are not
perfectly collinear, there can still be a problem
if they are highly correlated

104
2.8 Multicollinearity

Multicollinearity can cause
the standard errors of the coefficients to be
large (i.e., the coefficients are not estimated
precisely)
the coefficient estimates can be highly unstable
Example
use "C\phd\Fees.dta", clear
gen lnafln(auditfees)
gen lntaln(totalassets)
gen lnta1lnta
reg lnaf lnta lnta1
Obviously, you must exclude one of these
variables because lnta and lnta1 are perfectly
correlated

105
2.8 Multicollinearity

Lets see what happens if we change the value of
lnta1 for just one observation
list lnta if _n1
replace lnta18 if _n1
reg lnaf lnta
reg lnaf lnta1
reg lnaf lnta lnta1
Notice that the lnta and lnta1 coefficients are
highly significant when included separately but
they are insignificant when included together
The reason of course is that, by construction,
lnta and lnta1 are very highly correlated
pwcorr lnta lnta1, sig

106
2.8 Multicollinearity

As another example, we can see that the
coefficients can flip signs as a result of high
collinearity
sort lnaf lnta
replace lnta110 if _nlt100
reg lnaf lnta
reg lnaf lnta1
reg lnaf lnta lnta1
pwcorr lnta lnta1, sig

107
2.8 Multicollinearity (vif)

Variance-inflation factors (VIF) can be used to
assess whether multicollinearity is a problem for
a particular independent variable
The VIF takes account of the variables
correlations with all other independent variables
on the right hand side
The VIF shows the increase in the variance of the
coefficient estimate that is attributable to the
variables correlations with other independent
variables in the model
reg lnaf lnta big6 lnta1
vif
reg lnaf lnta big6
vif
Multicollinearity is generally regarded as high
(very high) if the VIF is greater than 10 (20)

108
2.9 Outlying observations

We have already seen that outlying observations
heavily influence the results of OLS models

109
2.9 Outlying observations

In simple regression (with just one independent
variable), it is easy to spot outliers from a
scatterplot of Y on X
For example, a company is an outlier if it is
very small in terms of size and it pays an audit
fee that is very high
In multiple regression (where there are multiple
X variables), some observations may be outliers
even though they do not show up on the
scatterplot
Moreover, observations that show up as outliers
on the scatterplot might actually be normal once
we control for other factors in the multiple
regression
For example the small company may pay a high
audit fee because other characteristics of that
company make it a complex audit.

110
2.9 Outlying observations (cooksd)

We can calculate the influence of each
observation on the estimated coefficients using
Cooks D
Values of Cooks D that are higher than 4/N are
considered large, where N is the number of
observations used in the regression
reg lnaf lnta big6
predict cook, cooksd
sum cook, detail
gen max4/e(N)
e(N) is the number of observations in the most
recent regression model (the estimation sample
size is stored by STATA as an internal result)
count if cookgtmax cooklt.

111
2.9 Outlying observations (cooksd)

We can discard the observations that have values
larger than Cooks D and re-estimate the model
reg lnaf lnta big6 if cookltmax
For example, Ke and Petroni (2004, p.906) explain
that they use Cooks D to exclude outliers and
the standard errors are adjusted for
heteroscedasticity and time-series dependence
(they are using a panel dataset)

112
2.9 Outlying observations (winsor)

Rather than drop outlying observations, some
researchers choose to winsorize the data
Winsorizing replaces the extreme values of a
variable with the values at certain percentiles
(e.g., the top and bottom 1)
You can winsorize variables in STATA using the
winsor command
winsor lnaf, gen(wlnaf) p(0.01)
winsor lnta, gen(wlnta) p(0.01)
sum lnaf wlnaf lnta wlnta, detail
reg wlnaf wlnta big6
A disadvantage with winsorizing is that the
researcher is assuming that outliers lie only at
the extremes of the variables distribution.

113
2.10 Median regression

Median regression is quite similar to OLS but it
can be more reliable especially when we have
problems of outlying observations
Recall that in OLS, the coefficient estimates are
chosen to minimize the sum of the squared
residuals

114
2.10 Median regression

In median regression, the coefficient estimates
are chosen to minimize the sum of the absolute
residuals
Squaring the residuals in OLS means that large
residuals are more heavily weighted than small
residuals.
Because the residuals are not squared in median
regression, the coefficient estimates are less
sensitive to outliers

115
2.10 Median regression

Median regression takes its name from its
predicted values, which are estimates of the
conditional median of the dependent variable.
In OLS, the predicted values are estimates of the
conditional mean of the dependent variable.
The predicted values of both regression
techniques therefore measure the central tendency
(i.e., mean or median) of the dependent variable.

116
2.10 Median regression

STATA treats median regression as a special case
of quantile regression.
In quantile regression, the coefficients are
estimated so that the sum of the weighted
absolute residuals is minimized
where the weights are wi

117
2.10 Median regression (qreg)

Weights can be different for positive and
negative residuals. If positive and negative
residuals are weighted equally, you get a median
regression. If positive residuals are weighted by
the factor 1.5 and negative residuals are
weighted by the factor 0.5, you get a 3rd
quartile regression, etc.
In STATA you perform quantile regressions using
the qreg command
qreg lnaf lnta big6
reg lnaf lnta big6

118
Class exercise 2h

Open the anscombe.dta file
Do a scatterplot of y3 and x3
Do an OLS regression of y3 on x3 for the full
sample.
Calculate Cooks D to test for the presence of
outliers.
Do an OLS regression of y3 on x3 after dropping
any outliers.
Do a median regression of y3 on x3 for the full
sample.

119
2.10 Median regression

Basu and Markov (2004) compare the results of OLS
and median regressions to determine whether
analysts who issue earnings forecasts attempt to
minimize
the sum of squared forecast errors (OLS), or
the sum of absolute forecast errors (median)

120
(No Transcript)
121

The LAD estimator is simply the median regression
command that we saw earlier (qreg)

122

Basu and Markov (2004) conclude that analysts
forecasts may accurately reflect their rational
expectations
Their study is a good example of how we can make
an important contribution to the literature if we
use an estimation technique that is not widely
used by accounting researchers

123
2.11 Looping

Looping can be very useful when we want to carry
out the same operations many times
Looping significantly reduces the length of our
do files because it means we do not have to state
the same commands many times
When software designers use the word
programming they mean they are creating a new
command
Usually we do not need new commands because what
we need has already been written for us in STATA
However, programming is necessary if we want to
use looping

124
2.11 Looping (program, forvalues)

Example
program ten
forvalues i 1(1)10
display i'
end
To run this program we simply type ten

125
2.11 Looping (program, forvalues)

Whats happening?
program ten we are telling STATA that the name
of our program is ten and that we are starting
to write a program
end we are telling STATA that we have finished
writing the program
everything inside these brackets is part of
a loop
forvalues i the program will perform the
commands inside the brackets for each value of i
(i is called a local macro)
1(1)10 i goes from one to ten, increasing by
the value one every time
display i' this is the command inside the
brackets and STATA will execute this command for
each value of i from one to ten. Note that is
at the top left of your keyboard whereas ' is
next to the Enter key

126
2.11 Looping (program, forvalues)

The macro i has single quotes around it. These
quotes tell Stata to replace the macro with the
value that it holds before executing the command.
So the first time through the loop, i holds the
value of 1. Stata first replaces i' with 1, and
then it executes the command
display 1
The next time through, i holds the value of 2.
Stata first replaces i' with the value 2, and
then it executes the command
display 2
This process continues through the values 3,
4,...,10.

127
2.11 Looping (capture)

Suppose we make a mistake in the program or we
want to modify the program in some way
We first need to drop this program from STATAs
memory
program drop ten
we can then go on to write a new program called
ten
It is good practice to drop any program that
might exist with the same name before writing a
new program
capture program drop ten

128
2.11 Looping

Our program is now
capture program drop ten
program ten
forvalues i 1(1)10
display i'
end
To run this program we simply type ten

129
Another example

Earnings management studies often estimate
abnormal accruals using the Jones model
ACCRUALSit a0k (1/ASSETit-1) a1k (?SALESit /
ASSETit-1) a2k (PPEit /ASSETit-1) uit
ACCRUALSit change in non-cash current assets
minus change in non-debt current liabilities,
scaled by lagged assets.
The k sub-scripts indicate that the model is
estimated separately for each industry.
Industries are identified using Standard
Industrial Classification (SIC) codes

130
Another example

The number of industries
10 using one digit codes,
100 using two digit codes,
1,000 using three digit codes, etc
Your do file could be very long if you had
separate lines for each industry
Estimate abnormal accruals for SIC 1
Estimate abnormal accruals for SIC 2
Estimate abnormal accruals for SIC 3
..
Estimate abnormal accruals for SIC 10, etc.

131
Another example

Your do file will be much shorter if you use the
looping technique
capture program drop ab_acc
program ab_acc
forvalues i 1(1)10
insert commands that you want to execute on each
industry SIC code
end

132
Another example

Go to http//ihome.ust.hk/accl/Phd_teaching.htm
open accruals.dta in STATA and generate the
variables we need
the regressions will be estimated at the
one-digit level
use "C\phd\accruals.dta", clear
gen one_sicint(sic/1000)
gen ncca current_assets- cash
gen ndcl current_liabilities- debt_in_current_lia
bilities
sort cik year
gen ch_nccancca-ncca_n-1 if cikcik_n-1
gen ch_ndclndcl-ndcl_n-1 if cikcik_n-1
gen accruals(ch_ncca-ch_ndcl)/assets_n-1 if
cikcik_n-1
gen lag_assetsassets_n-1 if cikcik_n-1
gen ppe_scaledppe/assets_n-1 if cikcik_n-1
gen chsales_scaled(sales-sales_n-1)/assets_n-1
if cikcik_n-1

133
Another example

gen ab_acc.
capture program drop ab_acc
program ab_acc
forvalues i 0(1)9
capture reg accruals lag_assets ppe_scaled
chsales_scaled if one_sici'
capture predict ab_acci' if one_sici', resid
replace ab_acc ab_acci' if one_sici'
capture drop ab_acci'
end
ab_acc

134
Explaining this program

forvalues i 0(1)9
the one_sic variable takes values from 0 to 9
capture reg accruals lag_assets ppe_scaled
chsales_scaled if one_sici'
the regressions are run at the one digit level
because some industries have insufficient
observations at the two-digit level
capture predict ab_acci' if one_sici', resid
For each industry, I create a separate abnormal
accrual variable (ab_acc1 if industry 1 ab_acc2
if industry 2, etc.).
If this line had been capture predict ab_acc if
one_sici', resid we would not have been able
to go beyond industry 1 as the ab_acc was
already defined
replace ab_acc ab_acci' if one_sici'
The overall abnormal accrual variable (ab_acc)
equals ab_acc1 if industry 1, equals ab_acc2 if
industry 2, etc.
before starting the program I had to gen ab_acc.
in order for this replace command to work
capture drop ab_acci'
I drop ab_acc1, ab_acc2, etc. because I only need
the ab_acc variable.

135
Conclusion

You should now have a good understanding of
how OLS models work,
how to interpret the results of OLS models,
how to find out whether the assumptions of OLS
are violated,
how to correct the standard errors for
heteroscedasticity, time-series dependence and
cross-sectional dependence,
how to handle problems of outliers

136
Conclusion

So far, we have been discussing the case where
our dependent variable is continuous (e.g., lnaf)
When the dependent variable is not continuous, we
cannot use OLS (or quantile) regression.
The next topic considers how to estimate models
where our dependent variable is not continuous.

Write a Comment

User Comments (0)