Title: Objectives of this class
1Objectives of this class
- By the end of this class you should be able to
- explain how OLS models work
- interpret the results of OLS models
- spot potential problems of outliers and
heteroscedasticity - correct the standard errors for heteroscedasticity
22. Ordinary least squares (OLS) regression
- 2.1 The basic idea
- 2.2 Single variable OLS
- 2.3 Correctly interpreting the coefficients
- 2.4 Examining the residuals
- 2.5 Multiple regression
- 2.6 Heteroskedasticity
32.1 The basic idea
- A simple linear regression aims to characterize
the relation between one dependent variable and
one independent variable using a straight line - For example, you predict that larger companies
pay higher fees - You can formalize the effect of company size on
predicted fees using a simple equation - The parameter a0 represents what fees are
expected to be in the case that Size 0. - The parameter a1 captures the impact of an
increase in Size on expected fees.
42.1 The basic idea
- The parameters a0 and a1 are assumed to be the
same for all observations and they are called
regression coefficients. - However, company size is not the only variable
that affects audit fees. For example, the
complexity of the audit engagement, or the size
of the audit firm may also matter. - You do not know all the factors that influence
fees, so the predicted fee that you calculate
from the above equation will differ from the
actual fee.
52.1 The basic idea
- The deviation between the predicted fee and the
actual fee is called the residual. You can
represent the relation between actual fees and
predicted fees in the following way - where represents the residual term (i.e., the
difference between actual and predicted fees)
62.1 The basic idea
- Putting the two together we can express actual
fees using the following equation - The goal of regression analysis is to estimate
the parameters a0 and a1
72.1 The basic idea
- One of the simplest techniques involves
estimating the coefficients using ordinary least
squares (OLS) regression. - The objective of OLS is to make the difference
between the predicted and actual values as small
as possible
82.1 The basic idea
- First lets start with a very small and easy to
visualize dataset - Go to http//ihome.ust.hk/accl/Phd_teaching.htm
- Download ols.dta to your hard drive and open in
STATA (use "C\phd\ols.dta", clear) or open
directly from the internet (use
"http//ihome.ust.hk/accl/ols.dta", clear)
92.1 The basic idea
- Examine the a scatter plot between the two
variables twoway (scatter y x) (lfit y x)
102.1 The basic idea
- This line is fitted by minimizing the sum of the
squared differences between the observed and
predicted values of y (known as the residual sum
of square, RSS)
- The assumptions required to obtain the
coefficients are - The relation between y and x is linear
- The x variable is uncorrelated with the residuals
- The residuals have a mean value of zero
112.2 Single variable OLS (regress)
- We estimate the model using the regress command
- regress y x
- The first variable (y) is the dependent variable
while the second (x) is the independent variable
122.2 Single variable OLS (regress)
- This gives the following output
132.2 Single variable OLS (regress)
- The coefficient estimates are 3.000 for the a0
parameter and 0.500 for the a1 parameter
- We can use these to predict the values of Y for
any given value of X. For example, we can
predict what Y will be when X 5 using the
display command which performs simple
calculations - display 3.0000910.50009095
142.2 Single variable OLS (_b)
- Actually, we do not need to type the coefficient
estimates because STATA will remember them for
us. They are stored by STATA using the name
_bvarname where varname is replaced with the
name of the independent variable or the constant
(_cons) - display _b_cons_bx5
152.2 Single variable OLS
- Note that the predicted value of y when x equals
5 differs from the actual value - list y if x5
- The actual value is 5.68 compared to the
predicted value of 5.50. The difference is the
residual error that arises because x is not a
perfect predictor of y.
162.2 Single variable OLS
When x 5, the actual value y is 5.68 compared
to the predicted y value of 5.50. The residual
prediction error is the vertical distance between
the observation and the line.
172.2 Single variable OLS
- If we want to compute the predicted value of y
for each value of x in our dataset, we can use
the saved coefficients - gen y_hat_b_cons_bxx
- The estimated residuals are the difference
between the observed y values and the predicted y
values - gen y_resy-y_hat
- list x y_hat y y_res
182.2 Single variable OLS (predict)
- A quicker way to do this would be to use the
predict command after regress - predict yhat
- predict yres, resid
- Checking that this gives the same answer
- list yhat y_hat yres y_res
- You should note that the values of x, yhat and
yres correspond with those found on the scatter
graph - sort x
- list x y y_hat y_res
19(No Transcript)
202.2 Single variable OLS
- By construction, there is zero correlation
between the x variable and the residuals - twoway (scatter y_res x) (lfit y_res x) or reg
y_res x
212.2 Single variable OLS
- Standard errors
- Typically our data comprises a sample that is
taken from a larger population - The coefficients are only estimates of the true
a0 and a1 values that describe the entire
population - If we obtained a second random sample from the
same population, we would obtain different
coefficient estimates for a0 and a1
222.2 Single variable OLS
- We therefore need a way to describe the
variability that would obtain if we were to apply
OLS to many different samples - Equivalently, we want a measure of how
precisely our coefficients are estimated - The solution is to calculate standard errors,
which are simply the sample standard deviations
associated with the estimated coefficients - Standard errors (SEs) allow us to perform
statistical tests, e.g., is our estimate of a1
significantly greater than zero?
232.2 Single variable OLS
- The techniques for estimating standard errors are
based on OLS assumptions that often do not hold
in practice - Homoscedasticity (i.e., the residuals have a
constant variance) - Non-correlation (i.e., the residuals are not
correlated with each other) - Normality (i.e., the residuals are normally
distributed)
242.2 Single variable OLS
- The t-statistic is obtained by dividing the
coefficient estimate by the standard error
252.2 Single variable OLS
- The p-values are from the t-distribution and they
tell you how likely it is that you would have
observed the estimated coefficient under the
assumption that the true coefficient in the
population is zero. - The p-value of 0.002 tells you that it is very
unlikely (prob 0.2) that the true coefficient
on x is zero. - The confidence intervals mean you can be 95
confident that the true coefficient of x lies
between 0.23337 and 0.76681.
262.2 Single variable OLS
- The total sum of squares (TSS) 41.27
- The explained sum of squares (ESS) 27.51
- The residual sum of squares (RSS) 13.76
- Note that TSS ESS RSS.
272.2 Single variable OLS
- The column labeled df contains the number of
degrees of freedom - For the ESS, df k-1 where k number of
regression coefficients (df 2 1) - For the RSS, df n k where n number of
observations ( 11 - 2) - For the TSS, df n-1 ( 11 1)
- The last column (MS) reports the ESS, RSS and TSS
divided by their respective degrees of freedom
282.2 Single variable OLS
- The first number simply tells us how many
observations are used to estimate the model - The other statistics here tell you how well the
model explains the variation in Y
292.2 Single variable OLS
- R-squared ESS / TSS ( 27.51 / 41.27 0.666)
- So x explains 66 of the variation in y.
- Unfortunately, many researchers in accounting
(and other fields) evaluate the quality of a
model by looking only at the R-squared. - This is not only invalid it is also very
dangerous (I will explain why later)
302.2 Single variable OLS
- One problem with the R-squared is that it will
always increase even when an independent variable
is added that has very little explanatory power. - The adjusted R-squared corrects for this by
accounting for the number of model parameters, k,
that need to be estimated - Adj R-squared 1-(1-R2)(n-1)/(n-k)
1-(1-.666)(10)/9 0.629 - In fact the adjusted R-squared can even take on
negative values. For example, suppose that y and
x are uncorrelated in which case the unadjusted
R-squared is zero - Adj R-squared 1-(n-1)/(n-2) (n-2-n1)/(n-2)
-1/(n-2)
312.2 Single variable OLS
- You might think that another way to measure the
fit of the model is to add up the residuals.
However, by definition, the residuals will always
sum to zero. - An alternative is to square the residuals, add
them up (giving the RSS) and then take the square
root. - Root MSE square root of RSS/n-k 13.76 /
(11-2)0.5 1.236 - One way to interpret the root MSE is that it
shows how far away on average the model is from
explaining y - The F-statistic (ESS/k-1)/(RSS/n-k) (27.51 /
1)/(13.76/9) 17.99 - the F statistic is used to test whether the
R-squared is significantly greater than zero
(i.e., are the independent variables jointly
significant?) - Prob gt F gives the probability that the R-squared
we calculated will be observed if the true
R-squared in the population is actually equal to
zero - This F test is used to test the overall
statistical significance of the regression model
32Class exercise 1
- Import Fees.dta into STATA from
- http//ihome.ust.hk/accl/Phd_teaching.htm
- Run the following two regressions
- audit fees on total assets
- the log of audit fees on the log of total assets
- What does the output of your regression mean?
- Which model appears to have the better fit
332.3 Correctly interpreting the coefficients
- So far we have considered the case where the
independent variable is continuous. - Interpretation of results is even more
straightforward when the independent variable is
a dummy.
- reg auditfees big6
- ttest auditfees, by(big6)
34- Sometimes even published studies give an
incorrect interpretation of the estimated
coefficients. For example
35(No Transcript)
36- Class questions
- Theoretically, how should auditing affect the
interest rate that the company has to pay? - Empirically, how do we measure the impact of
auditing on the interest rate using eq. (1)?
37- Class question At what values of total assets
(000) is the effect of the Audit Dummy on the
interest rate - negative, zero, positive?
38- Class questions
- What is the mean value of total assets?
- How does auditing affect the interest rate for
the average company in their sample?
39- Verify that the above claim is true.
- Suppose Blackwell et al. had reported the impact
for a firm with 11m in assets and a firm with
15m in assets. - How would this have changed the conclusions drawn?
402.4 Examining the residuals
- Go to http//ihome.ust.hk/accl/Phd_teaching.htm
- Import anscombe.dta into STATA (use
"C\phd\anscombe.dta", clear) and run the
following regressions - reg y1 x1
- reg y2 x2
- reg y3 x3
- reg y4 x4
- Note that the output from these regressions is
virtually identical - intercept 3.0 (t-stat2.67)
- x coefficient 0.5 (t-stat4.24)
- R-squared 66
- If you did not know about OLS assumptions, you
would probably stop your analysis at this point,
concluding that you have a good fit for all four
models. - In fact, only one of these four models is well
specified.
41Class exercise 2
- Draw scatter graphs for each of these four
associations. For example - twoway (scatter y1 x1) (lfit y1 x1)
- Of the four models, which do you think is the
well specified one? - Draw scatter graphs of the residuals against the
x variable for each of the four regressions is
there a pattern? - Which of the OLS assumptions are violated in
these four regressions?
422.4 Examining the residuals
- Unfortunately, it is common among researchers to
judge whether a model is well-specified solely
in terms of its explanatory power (i.e., the
R-squared). - You should report other types of diagnostic tests
- is there significant heteroscedasticity?
- is there any pattern to the residuals?
- are there any problems of outliers?
432.4 Examining the residuals
- An examination of the residuals can help us to
identify whether the model is well specified. For
example - use "C\phd\Fees.dta", clear
- reg auditfees totalassets
- predict res1, resid
- twoway (scatter res1 totalassets, msize(tiny))
(lfit res1 totalassets) - gen lnafln(auditfees)
- gen lntaln(totalassets)
- reg lnaf lnta
- predict res2, resid
- twoway (scatter res2 lnta, msize(tiny)) (lfit
res2 lnta) - Notice that the residuals are more spherical
displaying less of an obvious pattern in the
logged model.
44(No Transcript)
45Class exercise 3
- Following Pong and Whittington (1994) estimate
the raw value of audit fees as a function of raw
assets and assets squared - Examine the residuals
- Do you think this model is better specified than
the one in logs?
462.5 Multiple regression
- Researchers use multiple regression when they
believe that Y is affected by multiple
independent variables - Y a0 a1 X1 a2 X2 e
- Why is it important to control for multiple
factors that influence Y?
472.5 Multiple regression
- Suppose the true model is
- Y a0 a1 X1 a2 X2 e
- where X1 and X2 is uncorrelated with the error, e
- Suppose the OLS model that we estimate is
- Y a0 a1 X1 u
- where u a2 X2 e
- OLS imposes the assumption that X1 is
uncorrelated with the residual term, u. - Since X1 is uncorrelated with e, the assumption
that X1 is uncorrelated with u is equivalent to
assuming that X1 is uncorrelated with X2.
482.5 Multiple regression
- If X1 is correlated with X2 the OLS estimate of
a1 is biased. - The magnitude of the bias depends upon the
strength of the correlation between X1 and X2. - Of course, we often do not know whether the model
we estimate is the true model - In other words, we are unsure whether there is an
omitted variable (X2) that affects Y and that is
correlated with our variable of interest (X1)
492.5 Multiple regression
- We can judge whether or not there is likely to be
a correlated omitted variable problem using - theory
- prior empirical studies
502.5 Multiple regression
- Previously, we checked whether there was a
pattern between the residuals and one independent
variable - lnaf a0 a1 lnta res2
- twoway (scatter res2 lnta) (lfit res2 lnta)
- When we are using multiple regression, we want to
test whether there is a pattern between the
residuals and the right hand side variables as a
whole - The right hand side of the equation is the same
thing as the predicted value of the dependent
variable
512.5 Multiple regression
- So we should examine whether there is a pattern
between the residuals and the predicted values of
the dependent variable - STATA has a command which enables us to examine
the relation between the residuals and the fitted
(i.e., predicted) values - reg lnaf lnta big6
- rvfplot
- rvf stands for residuals versus fitted
522.6 Heteroscedasticity (hettest)
- The OLS techniques for estimating standard errors
are based on an assumption that the variance of
the errors is the same for all values of the
independent variables (homoscedasticity) - In many cases, the homoscedasticity assumption is
violated. For example - reg auditfees totalassets big6
- rvfplot
- the homoscedasticity assumption can be tested
using the hettest command after we do the
regression - reg auditfees totalassets big6
- hettest
- Heteroscedasticity does not bias the coefficient
estimates but it biases the standard errors of
the coefficients downwards (the t-stats are
biased upwards)
532.6 Heteroscedasticity (robust)
- Heteroscedasticity often occurs if the dependent
variable is not symmetrically distributed - for example the auditfees variable is highly
skewed due to the fact that it has a lower bound
of zero - much of the heterosedasticity can often be
removed by transforming the dependent variable
(e.g., use the log of audit fees instead of the
raw values)
542.6 Heteroscedasticity (robust)
- When you find that there is heteroscedasticity,
you need to correct the standard errors using the
Huber/White/sandwich estimator - In STATA it is easy to do this adjustment using
the robust option - reg auditfees totalassets big6, robust
- Compare the adjusted and unadjusted results
- reg auditfees totalassets big6
- note that the coefficients are exactly the same
- the t-statistics on the independent variables are
much smaller when the standard errors are
adjusted for heteroscedasticity
55Class exercise 4
- Esimate the audit fee model in logs rather than
absolute values - Using rvfplot, assess whether the residuals
appear to be non-constant - Using hettest, provide a formal test for
heteroscedasticity - Compare the coefficients and t-statistics when
you estimate the standard errors with and without
adjusting for heteroscedasticity.
56Conclusion
- You should now understand
- how OLS models work
- how to interpret the results of OLS models
- how to spot potential problems of outliers and
heteroscedasticity - how to correct the standard errors for
heteroscedasticity