Title: 2' Linear dependent variables
12. Linear dependent variables
- 2.1 The basic idea underlying linear regression
- 2.2 Single variable OLS
- 2.3 Correctly interpreting the coefficients
- 2.4 Examining the residuals
- 2.5 Multiple regression
- 2.6 Heteroskedasticity
- 2.7 Correlated errors
- 2.8 Multicollinearity
- 2.9 Outlying observations
- 2.10 Median regression
- 2.11 Looping
22.1 The basic idea underlying linear regression
- A simple linear regression aims to characterize
the relation between a dependent variable and one
independent variable using a straight line - You have already seen how to fit a line between
two variables using the scatter command - Linear regression does the same thing but it can
be extended to include multiple independent
variables
32.1 The basic idea
- For example, you predict that larger companies
usually pay higher fees - You can formalize the effect of company size on
predicted fees using a simple equation - The parameter a0 represents what fees are
expected to be in the case that Size 0. - The parameter a1 captures the impact of an
increase in Size on expected fees.
42.1 The basic idea
- The parameters a0 and a1 are assumed to be the
same for all observations and they are called
regression coefficients - You may argue that company size is not the only
variable that affects audit fees. For example,
the complexity of the audit engagement, or the
size of the audit firm may also matter. - If you do not know all the factors that influence
fees, the predicted fee that you calculate from
the above equation will differ from the
actual fee.
52.1 The basic idea
- The deviation between the predicted fee and the
actual fee is called the residual. In general,
you might represent the relation between actual
fees and predicted fees in the following way - where represents the residual term (i.e., the
difference between actual and predicted fees)
62.1 The basic idea
- Putting the two together we can express actual
fees using the following equation - The goal of regression analysis is to estimate
the parameters a0 and a1
72.1 The basic idea
- One of the simplest techniques to estimate the
coefficients is known as ordinary least squares
(OLS). - The objective of OLS is to make the difference
between the predicted and actual values as small
as possible - In other words, the goal is to minimize the
magnitude of the residuals
82.1 The basic idea
- Go to http//ihome.ust.hk/accl/Phd_teaching.htm
- Download ols.dta to your hard drive and open in
STATA (use "C\phd\ols.dta", clear) - examine the graphical relation between the two
variables, twoway (scatter y x) (lfit y x)
92.1 The basic idea
- This line is fitted by minimizing the sum of the
squared differences between the observed and
predicted values of y (known as the residual sum
of square, RSS)
- The main assumptions required to obtain these
coefficients are that - The relation between y and x is linear
- The x variable is uncorrelated with the residuals
(i.e., x is exogenous) - The residuals have a mean value of zero
102.1 The basic idea
11(No Transcript)
12Class exercise 2a
- Using the formulas and the data currently in
STATA, calculate the parameters a1 and a0
132.2 Single variable OLS (regress)
- Instead of using these formulas to calculate the
regression coefficients, we can instead use the
regress command - regress y x
- The first variable (y) is the dependent variable
while the second (x) is the independent variable
142.2 Single variable OLS (regress)
- This gives the following output
152.2 Single variable OLS (regress)
- The coefficient estimates are 3.000 for the a0
parameter and 0.500 for the a1 parameter
- We can use these to predict the values of Y for
any given value of X. For example, when X 5 we
predict that Y will be - display 3.0000910.50009095
-
162.2 Single variable OLS (_b)
- Alternatively, we do not need to type the
coefficient estimates because STATA will remember
them for us. They are stored by STATA using the
name _bvarname where varname is replaced with
the name of the independent variable or the
constant (_cons) - display _b_cons_bx5
172.2 Single variable OLS
- Note that the predicted value of y when x equals
5 differs from the actual value - list y if x5
- The actual value is 5.68 compared to the
predicted value of 5.50. The difference for this
observation is the residual error that arises
because x is not a perfect predictor of y.
182.2 Single variable OLS
- If we want to compute the predicted value of y
for each value of x in our dataset, we can use
the saved coefficients - gen y_hat_b_cons_bxx
- The estimated residuals are the difference
between the observed y values and the predicted y
values - gen y_resy-y_hat
- list x y_hat y y_res
192.2 Single variable OLS (predict)
- A quicker way to do this would be to use the
predict command after regress - predict yhat
- predict yres, resid
- Checking that this gives the same answer
- list yhat y_hat yres y_res
- You should also note that the values of x, yhat
and yres correspond with those found on the
scatter graph - sort x
- list x y y_hat y_res
202.2 Single variable OLS
212.2 Single variable OLS
- Note that by construction, there is zero
correlation between the x variable and the
residuals - twoway (scatter y_res x) (lfit y_res x)
222.2 Single variable OLS
- Standard errors
- Typically our data comprises a sample that is
taken from a larger population - The coefficients are only estimates of the true
a0 and a1 values that describe the entire
population - If we obtained a second random sample from the
same population, we would obtain different
coefficient estimates for a0 and a1
232.2 Single variable OLS
- We therefore need a way to describe the
variability that would obtain if we were to apply
OLS to many different samples - Equivalently, we want a measure of how
precisely our coefficients are estimated - The solution is to calculate standard errors,
which are simply the sample standard deviations
associated with the estimated coefficients - Standard errors (SEs) allow us to perform
statistical tests, e.g., is our estimate of a1
significantly greater than zero?
242.2 Single variable OLS
- The techniques for estimating standard errors are
based on additional OLS assumptions - Homoscedasticity (i.e., the residuals have a
constant variance) - Non-correlation (i.e., the residuals are not
correlated with each other) - Normality (i.e., the residuals are normally
distributed)
252.2 Single variable OLS
- The t-statistic is obtained by dividing the
coefficient estimate by the standard error
262.2 Single variable OLS
- The p-values are from the t-distribution and they
tell you how likely it is that you would have
observed the estimated coefficient under the
assumption that the true coefficient in the
population is zero. - The p-value of 0.002 tells you that it is very
unlikely (prob 0.2) that the true coefficient
on x is zero. - The confidence intervals mean you can be 95
confident that the true coefficient of x lies
between 0.233 and 0.767.
272.2 Single variable OLS
- To explain this we need some notation
- captures the variation in y around its mean
- captures the variation that is not explained by
x - captures the variation that is explained by x
282.2 Single variable OLS
- The total sum of squares (TSS) 41.27
- The explained sum of squares (ESS) 27.51
- The residual sum of squares (RSS) 13.76
- Note that TSS ESS RSS.
292.2 Single variable OLS
- The column labeled df contains the number of
degrees of freedom - For the ESS, df k-1 where k number of
regression coefficients (df 2 1) - For the RSS, df n k where n number of
observations ( 11 - 2) - For the TSS, df n-1 ( 11 1)
- The last column (MS) reports the ESS, RSS and TSS
divided by their respective degrees of freedom
302.2 Single variable OLS
- The first number simply tells us how many
observations are used to estimate the model - The other statistics here tell you how well the
model explains the variation in Y
312.2 Single variable OLS
- The R-squared ESS / TSS ( 27.51 / 41.27
0.666) - So x explains 66 of the variation in y.
- Unfortunately, many researchers in accounting
(and other fields) evaluate the quality of a
model by looking only at the R-squared. - This is not only invalid it is also very
dangerous (I will explain why later)
322.2 Single variable OLS
- One problem with the R-squared is that it will
always increase even when an independent variable
is added that has very little explanatory power. - Adding another variable is not always a good idea
as you lose one degree of freedom for each
additional coefficient that needs to be
estimated. Adding insignificant variables can be
especially inefficient if you are working with a
small sample size. - The adjusted R-squared corrects for this by
accounting for the number of model parameters, k,
that need to be estimated - Adj R-squared 1-(1-R2)(n-1)/(n-k)
1-(1-.666)(10)/9 0.629 - In fact the adjusted R-squared can even take on
negative values. For example, suppose that y and
x are uncorrelated in which case the unadjusted
R-squared is zero - Adj R-squared 1-(n-1)/(n-2) (n-2-n1)/(n-2)
-1/(n-2)
332.2 Single variable OLS
- You might think that another way to measure the
fit of the model is to add up the residuals.
However, by definition the residuals will sum to
zero. - An alternative is to square the residuals, add
them up (giving the RSS) and then take the square
root. - Root MSE square root of RSS/n-k
- 13.76 / (11-2)0.5 1.236
- One way to interpret the root MSE is that it
shows how far away on average the model is from
explaining y - The F-statistic (ESS/k-1)/(RSS/n-k)
- (27.51 / 1)/(13.76/9) 17.99
- the F statistic is used to test whether the
R-squared is significantly greater than zero
(i.e., are the independent variables jointly
significant?) - Prob gt F gives the probability that the R-squared
we calculated will be observed if the true
R-squared in the population is actually equal to
zero - This F test is used to test the overall
statistical significance of the regression model
34Class exercise 2b
- Open your Fees.dta file and run the following two
regressions - audit fees on total assets
- the log of audit fees on the log of total assets
- What does the output of your regression mean?
- Which model appears to have the better fit
352.3 Correctly interpreting the coefficients
- So far we have considered the case where the
independent variable is continuous. - Interpretation of results is even more
straightforward when the independent variable is
a dummy.
- reg auditfees big6
- ttest auditfees, by(big6)
362.3 Correctly interpreting the coefficients
- Suppose we wish to test whether the Big 6 fee
premium is significantly different between listed
and non-listed companies
372.3 Correctly interpreting the coefficients
- gen listed0
- replace listed1 if companytype2
companytype3 companytype5 - reg auditfees big6 if listed0
- ttest auditfees if listed0, by(big6)
- reg auditfees big6 if listed1
- ttest auditfees if listed1, by(big6)
- gen listed_big6listedbig6
- reg auditfees big6 listed listed_big6
382.3 Correctly interpreting the coefficients
- Some studies report the economic significance
of the estimated coefficients as well as the
statistical significance - Economic significance refers to the magnitude of
the impact of x on y - There is no single way to evaluate economic
significance but many studies describe the
change in the predicted value of y as x increases
from the 25th percentile to the 75th (or as x
changes by one standard deviation around its mean)
392.3 Correctly interpreting the coefficients
- For example, we can calculate the expected change
in audit fees as company size increases from the
25th to 75th percentiles - reg auditfees totalassets
- sum totalassets if auditfeeslt., detail
- gen fees_low_b_cons_btotalassetsr(p25)
- gen fees_high_b_cons_btotalassetsr(p75)
- sum fees_low fees_high
40Class exercise 2c
- Estimate the audit fee model in logs rather than
in absolute values - Calculate the expected change in audit fees as
company size increases from the 25th to 75th
percentiles - Compare your results for economic significance to
those we obtained when the fee model was
estimated using the absolute values of fees and
assets. - Hint you will need to take the exponential of
the predicted log of fees in order to make this
comparison.
412.3 Correctly interpreting the coefficients
- When evaluating the economic significance of a
dummy variable coefficient, we usually do so
using the values zero and one rather than
percentiles - For example
- reg lnaf big6
- gen fees_nb6exp(_b_cons)
- gen fees_b6exp(_b_cons_bbig6)
- sum fees_nb6 fees_b6
422.3 Correctly interpreting the coefficients
- Suppose we believe that the impact of a Big 6
audit on fees depends upon the size of the
company
- Usually, we would quantify this impact using a
range of values for lnta (e.g., as lnta increases
from the 25th to the 75th percentile)
432.3 Correctly interpreting the coefficients
- For example
- gen big6_lnta big6lnta
- reg lnaf big6 lnta big6_lnta
- sum lnta if lnaflt. big6lt., detail
- gen big6_low_bbig6_bbig6_lntar(p25)
- gen big6_high_bbig6_bbig6_lntar(p75)
- gen big6_mean_bbig6_bbig6_lntar(mean)
- sum big6_low big6_high big6_mean
44- It is amazing how many studies give a misleading
interpretation of the coefficients when using
interaction terms. For example, Blackwell et al.
45(No Transcript)
46- Class questions
- Theoretically, how should auditing affect the
interest rate that the company has to pay? - Empirically, how do we measure the impact of
auditing on the interest rate using eq. (1)?
47(No Transcript)
48- Class question At what values of total assets
(000) is the effect of the Audit Dummy on the
interest rate - negative, zero, positive?
49(No Transcript)
50- Class questions
- What is the mean value of total assets within
their sample? - How does auditing affect the interest rate for
the average company in their sample?
51(No Transcript)
52- Verify that the above claim is true.
- Suppose Blackwell et al. had reported the impact
for a firm with 11m in assets and another firm
with 15m in assets. - How would this have changed the conclusions
drawn? - Do you think the paper would have been published
if the authors had made this comparison?
53(No Transcript)
542.4 Examining the residuals
- Go to http//ihome.ust.hk/accl/Phd_teaching.htm
- Download anscombe.dta to your hard drive (use
"C\phd\anscombe.dta", clear) - Run the following regressions
- reg y1 x1
- reg y2 x2
- reg y3 x3
- reg y4 x4
- Note that the output from these regressions is
virtually identical - intercept 3.0 (t-stat2.67)
- x coefficient 0.5 (t-stat4.24)
- R-squared 66
55Class exercise 2d
- If you did not know about regression assumptions
or regression diagnostics you would probably stop
your analysis at this point, concluding that you
have a good fit for all four models. - In fact, only one of these four models is well
specified. - Draw scatter graphs for each of these four
associations (e.g., twoway (scatter y1 x1) (lfit
y1 x1)). - Of the four models, which do you think is the
well specified one? - Draw scatter graphs for the residuals against the
x variable for each of the four regressions is
there a pattern? - Which of the OLS assumptions are violated in
these four regressions?
562.4 Examining the residuals
- Unfortunately, it is common among researchers to
judge whether a model is well-specified solely
in terms of its explanatory power (i.e., the
R-squared). - Many researchers fail to report other types of
diagnostic tests - is there significant heteroscedasticity?
- is there any pattern to the residuals?
- are there any problems of outliers?
572.4 Examining the residuals
- For example, many audit fee studies claim that
their models are well-specified because they have
high R2 - Carson et al. (2003)
582.4 Examining the residuals
- Gu (2007) points out that
- econometricians consider R2 values to be
relatively unimportant (accounting researchers
put far too much emphasis on the magnitude of the
R2) - regression R2s should not be compared across
different samples - in contrast there is a large accounting
literature that uses R2s to determine whether the
value relevance of accounting information has
changed over time
59- It is easy to show that the same economic model
can yield very different R2 depending on how the
variables are transformed
- Using either eq. (1) or (2), we will obtain
exactly the same coefficient estimates because
the economic model is the same - If eq. (1) is well-specified, so also is eq. (2)
- If eq. (1) is mis-specified, so also is eq. (2)
- However, the R2 of eq. (1) will be very different
from the R2 of eq. (2)
60- Example
- use "C\phd\Fees.dta", clear
- gen lnafln(auditfees)
- gen lntaln(totalassets)
- sort companyid yearend
- by companyid gen lnaf_laglnaf_n-1
- egen missrmiss(lnaf lnta lnaf_lag)
- gen chlnaflnaf-lnaf_lag
- reg lnaf lnta lnaf_lag if miss0
- reg chlnaf lnta lnaf_lag if miss0
- The lnta coefficients are exactly the same in the
two models. - The lnaf_lag coefficient in eq. (2) equals the
lnaf_lag coefficient in eq. (1) minus one. - The R2 is much higher in eq. (1) than eq. (2).
- The high R2 in eq. (1) does not imply that the
model is well-specified. - The low R2 in eq. (2) does not imply that the
model is mis-specified. - Either both equations are well-specified or they
are both mis-specified. - The R2 tells us nothing about whether our
hypothesis about the determinants of Y is
correct.
612.4 Examining the residuals
- Instead of relying only on the R2, an examination
of the residuals can help us to identify whether
the model is well specified. For example compare
the audit fee model which is not logged - reg auditfees totalassets
- predict res1, resid
- twoway (scatter res1 totalassets, msize(tiny))
(lfit res1 totalassets) - With the logged audit fee model
- reg lnaf lnta
- predict res2, resid
- twoway (scatter res2 lnta, msize(tiny)) (lfit
res2 lnta) - Notice that the residuals are more spherical
displaying less of an obvious pattern in the
logged model.
622.4 Examining the residuals
- In order to obtain unbiased standard errors we
have to assume that the residuals are normally
distributed - We can test this using a histogram of the
residuals - hist res1
- this does not give us what we need because there
are severe outliers - sum res1, detail
- hist res1 if res1gt-22 res1lt208, normal
xlabel(-25(25)210) - hist res2
- sum res2, detail
- hist res2 if res2gt-2 res2lt1.8, normal
xlabel(-2(0.5)2) - The residuals are much closer to the assumed
normal distribution when the variables are
measured in logs
63(No Transcript)
64Class exercise 2e
- Following Pong and Whittington (1994) estimate
the raw value of audit fees as a function of raw
assets and assets squared - Examine the residuals
- Do you think this model is better specified than
the one in logs?
652.5 Multiple regression
- Researchers use multiple regression when they
believe that Y is affected by multiple
independent variables - Y a0 a1 X1 a2 X2 e
- Why is it important to control for multiple
factors that influence Y?
662.5 Multiple regression
- Suppose the true model is
- Y a0 a1 X1 a2 X2 e
- where X1 and X2 is uncorrelated with the error, e
- Suppose the OLS model that we estimate is
- Y a0 a1 X1 u
- where u a2 X2 e
- OLS imposes the assumption that X1 is
uncorrelated with the residual term, u. - Since X1 is uncorrelated with e, the assumption
that X1 is uncorrelated with u is equivalent to
assuming that X1 is uncorrelated X2.
672.5 Multiple regression
- If X1 is correlated with X2 the OLS estimate of
a1 is biased. - The magnitude of the bias depends upon the
strength of the correlation between X1 and X2. - Of course, we often do not know whether the model
we estimate is the true model - In other words, we are unsure whether there is an
omitted variable (X2) that affects Y and that is
correlated with our variable of interest (X1)
682.5 Multiple regression
- We can judge whether or not there is likely to be
a correlated omitted variable problem using - theory
- prior empirical studies
- our understanding of the data generation process
692.5 Multiple regression
- Theory
- Does theory suggest that X2 affects Y?
- Unfortunately, theory often fails to give a clear
guide to empirical researchers as to which
variables need to be controlled for
702.5 Multiple regression
- The data generation process (DGP)
- Many researchers go wrong simply because they
fail to understand the underlying process that
generates the data (e.g., they fail to understand
the institutional details). - Let me give you an example, from my research on
the reports issued by the Public Company
Accounting Oversight Board (PCAOB) - The PCAOB has been issuing reports about
weaknesses that they found in audit firms work
71- The dependent variable equals the number of
weaknesses disclosed in the PCAOBs report about
the audit firm - Ln(CLIENTS) is a continuous measure of audit
firm size (the log of the number of companies
audited by the audit firm) - BIG is a dummy variable capturing audit firm size
- The audit firm size coefficients are positive and
highly significant - The PCAOB has been reporting more weaknesses at
the larger audit firms
722.5 Multiple regression
- A working paper has also reported this positive
relation and concluded that the larger audit
firms have been offering lower quality audits - this conclusion contradicts evidence from many
other studies that find larger audit firms
provide higher quality audits - The researchers made this mistake because they
failed to understand the data generation
process for the weaknesses disclosed in PCAOB
reports.
732.5 Multiple regression
- To understand the data generation process, it is
often important to understand how the data
originate http//www.pcaobus.org/Inspections/Publi
c_Reports/index.aspx
74(No Transcript)
75 76(No Transcript)
77- In Col. (1) there is a severe omitted variable
problem because - the PCAOBs sample size is not included as a
control variable. - the PCAOBs sample size is highly correlated with
audit firm size. - In Col. (3), we see there is no significant
relation between audit firm size and the number
of reported weaknesses, after we control for the
size of the PCAOBs sample. - An understanding of the data generation process
is vital if we are to avoid drawing invalid
conclusions. - A PCAOB report discloses all serious weaknesses
found in the inspectors sample. - There is a biased association between audit firm
size and the number of reported weaknesses if the
size of the PCAOBs sample is not controlled for.
782.5 Multiple regression
- What does it mean to control for the effect of
a variable? - In a multiple regression, the coefficient a1
captures the effect of a one-unit increase in X1
on Y, after controlling for (i.e., holding
constant) X2 - Y a0 a1 X1 a2 X2 e
- I will now explain this concept in more detail
using an empirical example
792.5 Multiple regression
- We are going to look at the effect of non-audit
fees on audit fees after controlling for the
effect of company size - lnaf a0 a1 lnta a2 lnnaf e
- gen lnnafln(1nonauditfees)
- capture drop miss
- egen missrmiss(lnaf lnta lnnaf)
- First I estimate the following model and
calculate the residuals - lnaf a0 a1 lnta res1
- reg lnaf lnta if miss0
- predict res1 if miss0, resid
- note that res1 reflects the part of lnaf that has
nothing to do with lnta (res1 a2 lnnaf e) - Next I estimate the following model and calculate
the residuals - lnnaf b0 b1 lnta res2
- reg lnnaf lnta if miss0
- predict res2 if miss0, resid
- note that res2 reflects the part of lnnaf that
has nothing to do with lnta (by construction res2
is uncorrelated with lnta)
802.5 Multiple regression
- Finally I estimate the following two models
- lnaf a0 a1 lnta a2 lnnaf e (1)
- res1 a2 res2 e (2)
- reg lnaf lnta lnnaf if miss0
- reg res1 res2 if miss0
- Note that the coefficient and t-statistic on res2
in eq. (2) are exactly the same as the
coefficient and t-statistic for lnnaf in eq. (1) - What does all this mean?
- The coefficient a2 in eq. (1) captures the impact
of lnnaf on lnaf after controlling for the fact
that - (1) lnta affects lnaf (a1 gt 0 in eq. (1)), and
- (2) there is a significant positive correlation
between lnnaf and lnta (b1 gt 0)
812.5 Multiple regression
- Note that if there had been zero correlation
between lnnaf and lnta (b1 0), the coefficient
a1 would be the same in both the simple and
multiple regression models - lnaf a0 a1 lnta a2 lnnaf e
- lnaf a0 a1 lnta res1
- The reason is that res1 would be uncorrelated
with lnta if there is zero correlation between
lnnaf and lnta (b1 0). - In other words the coefficient a1 is estimated
with bias only if res1 is significantly
correlated with lnta.
822.5 Multiple regression
- This reinforces the intuition that it is only
necessary to control for those variables that - affect Y, AND
- are correlated with the independent variable
whose coefficient we want to estimate - For example, if we want to estimate the effect of
lnnaf on lnaf, we must control for lnta because - lnta affects lnaf, and
- lnta is correlated with lnnaf
- Note that both of these conditions are necessary
for there to be an omitted variable problem.
832.5 Multiple regression
- Previously, when we were using simple regression
with one independent variable, we checked whether
there was a pattern between the residuals and the
independent variable - lnaf a0 a1 lnta res1
- twoway (scatter res1 lnta) (lfit res1 lnta)
- When we are using multiple regression, we want to
test whether there is a pattern between the
residuals and the right hand side of the equation
as a whole - The right hand side of the equation as a whole
is the same thing as the predicted value of the
dependent variable
842.5 Multiple regression
- So we should examine whether there is a pattern
between the residuals and the predicted values of
the dependent variable - For example, lets estimate a model where audit
fees depend on non-audit fees, company size,
audit firm size, whether the company is listed on
a stock market - gen listed0
- replace listed1 if companytype2
companytype3 companytype5 - reg lnaf lnnaf lnta big6 listed
- predict lnaf_hat
- predict lnaf_res, resid
- twoway (scatter lnaf_res lnaf_hat) (lfit lnaf_res
lnaf_hat)
852.5 Multiple regression (rvfplot)
- In fact, those nice guys at STATA have given us a
command which enables us to short-cut having to
use the predict command for calculating the
residuals and the fitted values - reg lnaf lnnaf lnta big6 listed
- rvfplot
- rvf stands for residuals versus fitted
862.6 Heteroscedasticity (hettest)
- The OLS techniques for estimating standard errors
are based on an assumption that the variance of
the errors is the same for all values of the
independent variables (homoscedasticity) - In many cases, the homoscedasticity assumption is
clearly violated. For example - reg auditfees nonauditfees totalassets big6
listed - rvfplot
- the homoscedasticity assumption can be tested
using the hettest command after we do the
regression - reg auditfees nonauditfees totalassets big6
listed - hettest
- Heteroscedasticity does not bias the coefficient
estimates but it does bias the standard errors of
the coefficients
872.6 Heteroscedasticity (robust)
- Heteroscedasticity is often caused by using a
dependent variable that is not symmetric - for example the auditfees variable is highly
skewed due to the fact that it has a lower bound
of zero - much of the heterosedasticity can often be
removed by transforming the dependent variable
(e.g., use the log of audit fees instead of the
raw values) - When you find that there is heteroscedasticity,
you need to adjust the standard errors using the
Huber/White/sandwich estimator - In STATA it is easy to do this adjustment using
the robust option - reg auditfees nonauditfees totalassets big6
listed, robust - Compare the adjusted and unadjusted results
- reg auditfees nonauditfees totalassets big6
listed - note that the coefficients are exactly the same
- the t-statistics on the independent variables are
much smaller when the standard errors are
adjusted for heteroscedasticity
88Class exercise 2f
- Esimate the audit fee model in logs rather than
absolute values - Using rvfplot, assess whether the residuals
appear to be non-constant - Using hettest, provide a formal test for
heteroscedasticity - Compare the coefficients and t-statistics when
you estimate the standard errors with and without
adjusting for heteroscedasticity.
892.7 Correlated errors
- The OLS techniques for estimating standard errors
are based on an assumption that the errors are
not correlated - This assumption is typically violated when we use
repeated annual observations on the same
companies - The residuals of a given firm are correlated
across years (time series dependence)
90Time-series dependence
- Time-series dependence is nearly always a problem
when researchers use panel data - Panel data data that are pooled for the same
companies across time - In panel data, there are likely to be unobserved
company-specific characteristics that are
relatively constant over time
91- Lets start with a simple regression model where
the errors are assumed to be uncorrelated - We now relax the assumption of independent errors
by assuming that the error term has an unobserved
company-specific component that does not vary
over time and an idiosyncratic component that is
unique to each company-year observation - Similarly, we can assume that the X variable has
a company-specific component that does not vary
over time and an idiosyncratic component
92Time-series dependence
- In this case, the OLS standard errors tend to be
biased downwards and the magnitude of this bias
is increasing in the number of years within the
panel. - To understand the intuition, consider the extreme
case where the residuals and independent
variables are perfectly correlated across time. - In this case, each additional year provides no
additional information and will have no effect on
the true standard error - However, under the incorrect assumption of
time-series independence, it is assumed that each
additional year provides additional observations
and the estimated standard errors will shrink
accordingly and incorrectly - This problem can be avoided by adjusting the
standard errors for the clustering of yearly
observations across a given company
93Time-series dependence
- To understand all this, it is helpful to review
the following example - First, I estimate the model using just one
observation for each company (in the year 1998) - gen fyedate(yearend, "mdy")
- gen yearyear(fye)
- drop if year!1998
- sort companyid
- drop if companyidcompanyid_n-1
- reg lnaf lnta big6 listed, robust
94Time-series dependence
- Now I create a dataset in which each observation
is duplicated - Each duplicated observation provides no
additional information and will have no effect on
the true standard error but it will reduce the
estimated standard error (i.e., the estimated
standard error will be biased downwards) - save "C\phd\Fees98.dta", replace
- append using "C\phd\Fees98.dta"
- reg lnaf lnta big6 listed, robust
- Notice that the coefficient estimates in the
duplicated dataset are exactly the same as in the
dataset that had only observation per company. - However, the estimated standard errors are
smaller and the t-statistics are larger in the
duplicated dataset because we are using twice as
many observations.
95Time-series dependence (robust cluster())
- We can obtain correct standard errors in the
duplicate dataset using the robust cluster()
option which adjusts the standard errors for
clustering of observations (here they are
duplicated) for each company - reg lnaf lnta big6 listed, robust cluster
(companyid) - The t-statistics here are exactly the same as
when the model is estimated using just one
observation per year.
96Time-series dependence
- In reality the observations of a given company
are not exactly the same from one year to the
next (i.e., they are not exact duplicates). - However, the observations of a given company
often do not change much from one year to the
next. - For example, a companys size and the fees that
it pays may not change much over time (i.e.,
there is a strong unobserved company-specific
component to the variables). - Failing to account for this in panel data tends
to overstate the magnitude of the t-statistics.
97Time-series dependence
- It is easy to demonstrate that the residuals of a
given company tend to be very highly correlated
over time - First, start again with the original data
- use "C\phd\Fees.dta", clear
- gen fyedate(yearend, "mdy")
- gen yearyear(fye)
- gen lnafln(auditfees)
- gen lntaln(totalassets)
- save "C\phd\Fees1.dta", replace
- Estimate the fee model and obtain the residuals
for each company-year observation - reg lnaf lnta
- predict res, resid
98Time-series dependence
- Reshape the data so that we have each company as
a row and there are separate variables for each
yearly set of residuals - keep companyid year res
- sort companyid year
- drop if companyidcompanyid_n-1
yearyear_n-1 - reshape wide res, i( companyid) j(year)
- browse
- Examine the correlations between the residuals of
a given company - pwcorr res1998- res2002
99Time-series dependence
- We can easily control for this problem of
time-series dependence using the robust cluster()
option - use "C\phd\Fees1.dta", clear
- reg lnaf lnta, robust cluster(companyid)
- Note that if we do not control for time-series
dependence, the t-statistic is biased upwards
even though we have controlled for the
heteroscedasticity - reg lnaf lnta, robust
- If we do not control for heteroscedasticity, the
upward bias would be even worse - reg lnaf lnta
- TOP TIP Whenever you use panel data you should
get into the habit of using the robust cluster()
option, otherwise your significant results from
pooled regressions may be spurious.
1002.8 Multicollinearity
- Perfect collinearity occurs if there is a perfect
linear relation between multiple variables of the
regression model. - For example, our dataset covers a sample period
of five years (1998-2002). Suppose we create a
dummy for each year and include all five year
dummies in the fee regression. - tabulate year, gen(year_)
- reg lnaf year_1 year_2 year_3 year_4 year_5
- Note that STATA excludes one of the year dummies
when estimating the model why is that?
1012.8 Multicollinearity
- The reason is that a linear combination of the
year dummies equals the constant in the model - year_1 year_2 year_3 year_4 year_5 1
- where 1 is a constant
- The model can only be estimated if one of the
year dummies or the constant is excluded - reg lnaf year_1 year_2 year_3 year_4 year_5,
nocons - STATA automatically throws away one of the year
dummies so that the model can be estimated
102Class exercise 2g
- Go to http//ihome.ust.hk/accl/Phd_teaching.htm
- Download international.dta to your hard drive
and open in STATA - You are interested in testing whether legal
enforcement affects the importance of equity
markets to the economy - Create dummy variables for each country in your
dataset - Run a regression where importanceofequitymarket
is the dependent variable and legalenforcement is
the independent variable - How many country dummies can be included in your
regression? Explain. - Are your results for the legalenforcement
coefficient sensitive to your choice for which
country dummies to exclude? Explain.
1032.8 Multicollinearity
- We have seen that when there is perfect
collinearity between independent variables, STATA
will have to exclude one of them. - For example, a linear combination of all year
dummies equals the constant in the model - year_1 year_2 year_3 year_4 year_5
constant - so we cannot estimate a model that includes all
the year dummies and the constant term - Even if the independent variables are not
perfectly collinear, there can still be a problem
if they are highly correlated
1042.8 Multicollinearity
- Multicollinearity can cause
- the standard errors of the coefficients to be
large (i.e., the coefficients are not estimated
precisely) - the coefficient estimates can be highly unstable
- Example
- use "C\phd\Fees.dta", clear
- gen lnafln(auditfees)
- gen lntaln(totalassets)
- gen lnta1lnta
- reg lnaf lnta lnta1
- Obviously, you must exclude one of these
variables because lnta and lnta1 are perfectly
correlated
1052.8 Multicollinearity
- Lets see what happens if we change the value of
lnta1 for just one observation - list lnta if _n1
- replace lnta18 if _n1
- reg lnaf lnta
- reg lnaf lnta1
- reg lnaf lnta lnta1
- Notice that the lnta and lnta1 coefficients are
highly significant when included separately but
they are insignificant when included together - The reason of course is that, by construction,
lnta and lnta1 are very highly correlated - pwcorr lnta lnta1, sig
1062.8 Multicollinearity
- As another example, we can see that the
coefficients can flip signs as a result of high
collinearity - sort lnaf lnta
- replace lnta110 if _nlt100
- reg lnaf lnta
- reg lnaf lnta1
- reg lnaf lnta lnta1
- pwcorr lnta lnta1, sig
1072.8 Multicollinearity (vif)
- Variance-inflation factors (VIF) can be used to
assess whether multicollinearity is a problem for
a particular independent variable - The VIF takes account of the variables
correlations with all other independent variables
on the right hand side - The VIF shows the increase in the variance of the
coefficient estimate that is attributable to the
variables correlations with other independent
variables in the model - reg lnaf lnta big6 lnta1
- vif
- reg lnaf lnta big6
- vif
- Multicollinearity is generally regarded as high
(very high) if the VIF is greater than 10 (20)
1082.9 Outlying observations
- We have already seen that outlying observations
heavily influence the results of OLS models
1092.9 Outlying observations
- In simple regression (with just one independent
variable), it is easy to spot outliers from a
scatterplot of Y on X - For example, a company is an outlier if it is
very small in terms of size and it pays an audit
fee that is very high - In multiple regression (where there are multiple
X variables), some observations may be outliers
even though they do not show up on the
scatterplot - Moreover, observations that show up as outliers
on the scatterplot might actually be normal once
we control for other factors in the multiple
regression - For example the small company may pay a high
audit fee because other characteristics of that
company make it a complex audit.
1102.9 Outlying observations (cooksd)
- We can calculate the influence of each
observation on the estimated coefficients using
Cooks D - Values of Cooks D that are higher than 4/N are
considered large, where N is the number of
observations used in the regression - reg lnaf lnta big6
- predict cook, cooksd
- sum cook, detail
- gen max4/e(N)
- e(N) is the number of observations in the most
recent regression model (the estimation sample
size is stored by STATA as an internal result) - count if cookgtmax cooklt.
1112.9 Outlying observations (cooksd)
- We can discard the observations that have values
larger than Cooks D and re-estimate the model - reg lnaf lnta big6 if cookltmax
- For example, Ke and Petroni (2004, p.906) explain
that they use Cooks D to exclude outliers and
the standard errors are adjusted for
heteroscedasticity and time-series dependence
(they are using a panel dataset)
1122.9 Outlying observations (winsor)
- Rather than drop outlying observations, some
researchers choose to winsorize the data - Winsorizing replaces the extreme values of a
variable with the values at certain percentiles
(e.g., the top and bottom 1) - You can winsorize variables in STATA using the
winsor command - winsor lnaf, gen(wlnaf) p(0.01)
- winsor lnta, gen(wlnta) p(0.01)
- sum lnaf wlnaf lnta wlnta, detail
- reg wlnaf wlnta big6
- A disadvantage with winsorizing is that the
researcher is assuming that outliers lie only at
the extremes of the variables distribution.
1132.10 Median regression
- Median regression is quite similar to OLS but it
can be more reliable especially when we have
problems of outlying observations - Recall that in OLS, the coefficient estimates are
chosen to minimize the sum of the squared
residuals
1142.10 Median regression
- In median regression, the coefficient estimates
are chosen to minimize the sum of the absolute
residuals - Squaring the residuals in OLS means that large
residuals are more heavily weighted than small
residuals. - Because the residuals are not squared in median
regression, the coefficient estimates are less
sensitive to outliers
1152.10 Median regression
- Median regression takes its name from its
predicted values, which are estimates of the
conditional median of the dependent variable. - In OLS, the predicted values are estimates of the
conditional mean of the dependent variable. - The predicted values of both regression
techniques therefore measure the central tendency
(i.e., mean or median) of the dependent variable.
1162.10 Median regression
- STATA treats median regression as a special case
of quantile regression. - In quantile regression, the coefficients are
estimated so that the sum of the weighted
absolute residuals is minimized - where the weights are wi
1172.10 Median regression (qreg)
- Weights can be different for positive and
negative residuals. If positive and negative
residuals are weighted equally, you get a median
regression. If positive residuals are weighted by
the factor 1.5 and negative residuals are
weighted by the factor 0.5, you get a 3rd
quartile regression, etc. - In STATA you perform quantile regressions using
the qreg command - qreg lnaf lnta big6
- reg lnaf lnta big6
118Class exercise 2h
- Open the anscombe.dta file
- Do a scatterplot of y3 and x3
- Do an OLS regression of y3 on x3 for the full
sample. - Calculate Cooks D to test for the presence of
outliers. - Do an OLS regression of y3 on x3 after dropping
any outliers. - Do a median regression of y3 on x3 for the full
sample.
1192.10 Median regression
- Basu and Markov (2004) compare the results of OLS
and median regressions to determine whether
analysts who issue earnings forecasts attempt to
minimize - the sum of squared forecast errors (OLS), or
- the sum of absolute forecast errors (median)
120(No Transcript)
121- The LAD estimator is simply the median regression
command that we saw earlier (qreg)
122- Basu and Markov (2004) conclude that analysts
forecasts may accurately reflect their rational
expectations - Their study is a good example of how we can make
an important contribution to the literature if we
use an estimation technique that is not widely
used by accounting researchers
1232.11 Looping
- Looping can be very useful when we want to carry
out the same operations many times - Looping significantly reduces the length of our
do files because it means we do not have to state
the same commands many times - When software designers use the word
programming they mean they are creating a new
command - Usually we do not need new commands because what
we need has already been written for us in STATA - However, programming is necessary if we want to
use looping
1242.11 Looping (program, forvalues)
- Example
- program ten
- forvalues i 1(1)10
- display i'
-
- end
- To run this program we simply type ten
1252.11 Looping (program, forvalues)
- Whats happening?
- program ten we are telling STATA that the name
of our program is ten and that we are starting
to write a program - end we are telling STATA that we have finished
writing the program - everything inside these brackets is part of
a loop - forvalues i the program will perform the
commands inside the brackets for each value of i
(i is called a local macro) - 1(1)10 i goes from one to ten, increasing by
the value one every time - display i' this is the command inside the
brackets and STATA will execute this command for
each value of i from one to ten. Note that is
at the top left of your keyboard whereas ' is
next to the Enter key
1262.11 Looping (program, forvalues)
- The macro i has single quotes around it. These
quotes tell Stata to replace the macro with the
value that it holds before executing the command.
So the first time through the loop, i holds the
value of 1. Stata first replaces i' with 1, and
then it executes the command - display 1
- The next time through, i holds the value of 2.
Stata first replaces i' with the value 2, and
then it executes the command - display 2
- This process continues through the values 3,
4,...,10.
1272.11 Looping (capture)
- Suppose we make a mistake in the program or we
want to modify the program in some way - We first need to drop this program from STATAs
memory - program drop ten
- we can then go on to write a new program called
ten - It is good practice to drop any program that
might exist with the same name before writing a
new program - capture program drop ten
1282.11 Looping
- Our program is now
- capture program drop ten
- program ten
- forvalues i 1(1)10
- display i'
-
- end
- To run this program we simply type ten
129Another example
- Earnings management studies often estimate
abnormal accruals using the Jones model - ACCRUALSit a0k (1/ASSETit-1) a1k (?SALESit /
ASSETit-1) a2k (PPEit /ASSETit-1) uit - ACCRUALSit change in non-cash current assets
minus change in non-debt current liabilities,
scaled by lagged assets. - The k sub-scripts indicate that the model is
estimated separately for each industry. - Industries are identified using Standard
Industrial Classification (SIC) codes
130Another example
- The number of industries
- 10 using one digit codes,
- 100 using two digit codes,
- 1,000 using three digit codes, etc
- Your do file could be very long if you had
separate lines for each industry - Estimate abnormal accruals for SIC 1
- Estimate abnormal accruals for SIC 2
- Estimate abnormal accruals for SIC 3
- ..
- Estimate abnormal accruals for SIC 10, etc.
131Another example
- Your do file will be much shorter if you use the
looping technique - capture program drop ab_acc
- program ab_acc
- forvalues i 1(1)10
- insert commands that you want to execute on each
industry SIC code -
- end
132Another example
- Go to http//ihome.ust.hk/accl/Phd_teaching.htm
- open accruals.dta in STATA and generate the
variables we need - the regressions will be estimated at the
one-digit level - use "C\phd\accruals.dta", clear
- gen one_sicint(sic/1000)
- gen ncca current_assets- cash
- gen ndcl current_liabilities- debt_in_current_lia
bilities - sort cik year
- gen ch_nccancca-ncca_n-1 if cikcik_n-1
- gen ch_ndclndcl-ndcl_n-1 if cikcik_n-1
- gen accruals(ch_ncca-ch_ndcl)/assets_n-1 if
cikcik_n-1 - gen lag_assetsassets_n-1 if cikcik_n-1
- gen ppe_scaledppe/assets_n-1 if cikcik_n-1
- gen chsales_scaled(sales-sales_n-1)/assets_n-1
if cikcik_n-1
133Another example
- gen ab_acc.
- capture program drop ab_acc
- program ab_acc
- forvalues i 0(1)9
- capture reg accruals lag_assets ppe_scaled
chsales_scaled if one_sici' - capture predict ab_acci' if one_sici', resid
- replace ab_acc ab_acci' if one_sici'
- capture drop ab_acci'
-
- end
- ab_acc
134Explaining this program
- forvalues i 0(1)9
- the one_sic variable takes values from 0 to 9
- capture reg accruals lag_assets ppe_scaled
chsales_scaled if one_sici' - the regressions are run at the one digit level
because some industries have insufficient
observations at the two-digit level - capture predict ab_acci' if one_sici', resid
- For each industry, I create a separate abnormal
accrual variable (ab_acc1 if industry 1 ab_acc2
if industry 2, etc.). - If this line had been capture predict ab_acc if
one_sici', resid we would not have been able
to go beyond industry 1 as the ab_acc was
already defined - replace ab_acc ab_acci' if one_sici'
- The overall abnormal accrual variable (ab_acc)
equals ab_acc1 if industry 1, equals ab_acc2 if
industry 2, etc. - before starting the program I had to gen ab_acc.
in order for this replace command to work - capture drop ab_acci'
- I drop ab_acc1, ab_acc2, etc. because I only need
the ab_acc variable.
135Conclusion
- You should now have a good understanding of
- how OLS models work,
- how to interpret the results of OLS models,
- how to find out whether the assumptions of OLS
are violated, - how to correct the standard errors for
heteroscedasticity, time-series dependence and
cross-sectional dependence, - how to handle problems of outliers
136Conclusion
- So far, we have been discussing the case where
our dependent variable is continuous (e.g., lnaf) - When the dependent variable is not continuous, we
cannot use OLS (or quantile) regression. - The next topic considers how to estimate models
where our dependent variable is not continuous.