Objectives of this class

About This Presentation

Title:

Objectives of this class

Description:

So far we have considered the case where the independent variable is continuous. ... the t-statistics on the independent variables are much smaller when the standard ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 57

Provided by: accl5

Category:

more less

Transcript and Presenter's Notes

Title: Objectives of this class

1
Objectives of this class

By the end of this class you should be able to
explain how OLS models work
interpret the results of OLS models
spot potential problems of outliers and
heteroscedasticity
correct the standard errors for heteroscedasticity

2
2. Ordinary least squares (OLS) regression

2.1 The basic idea
2.2 Single variable OLS
2.3 Correctly interpreting the coefficients
2.4 Examining the residuals
2.5 Multiple regression
2.6 Heteroskedasticity

3
2.1 The basic idea

A simple linear regression aims to characterize
the relation between one dependent variable and
one independent variable using a straight line
For example, you predict that larger companies
pay higher fees
You can formalize the effect of company size on
predicted fees using a simple equation
The parameter a0 represents what fees are
expected to be in the case that Size 0.
The parameter a1 captures the impact of an
increase in Size on expected fees.

4
2.1 The basic idea

The parameters a0 and a1 are assumed to be the
same for all observations and they are called
regression coefficients.
However, company size is not the only variable
that affects audit fees. For example, the
complexity of the audit engagement, or the size
of the audit firm may also matter.
You do not know all the factors that influence
fees, so the predicted fee that you calculate
from the above equation will differ from the
actual fee.

5
2.1 The basic idea

The deviation between the predicted fee and the
actual fee is called the residual. You can
represent the relation between actual fees and
predicted fees in the following way
where represents the residual term (i.e., the
difference between actual and predicted fees)

6
2.1 The basic idea

Putting the two together we can express actual
fees using the following equation
The goal of regression analysis is to estimate
the parameters a0 and a1

7
2.1 The basic idea

One of the simplest techniques involves
estimating the coefficients using ordinary least
squares (OLS) regression.
The objective of OLS is to make the difference
between the predicted and actual values as small
as possible

8
2.1 The basic idea

First lets start with a very small and easy to
visualize dataset
Go to http//ihome.ust.hk/accl/Phd_teaching.htm
Download ols.dta to your hard drive and open in
STATA (use "C\phd\ols.dta", clear) or open
directly from the internet (use
"http//ihome.ust.hk/accl/ols.dta", clear)

9
2.1 The basic idea

Examine the a scatter plot between the two
variables twoway (scatter y x) (lfit y x)

10
2.1 The basic idea

This line is fitted by minimizing the sum of the
squared differences between the observed and
predicted values of y (known as the residual sum
of square, RSS)

The assumptions required to obtain the
coefficients are
The relation between y and x is linear
The x variable is uncorrelated with the residuals
The residuals have a mean value of zero

11
2.2 Single variable OLS (regress)

We estimate the model using the regress command
regress y x
The first variable (y) is the dependent variable
while the second (x) is the independent variable

12
2.2 Single variable OLS (regress)

This gives the following output

13
2.2 Single variable OLS (regress)

The coefficient estimates are 3.000 for the a0
parameter and 0.500 for the a1 parameter

We can use these to predict the values of Y for
any given value of X. For example, we can
predict what Y will be when X 5 using the
display command which performs simple
calculations
display 3.0000910.50009095

14
2.2 Single variable OLS (_b)

Actually, we do not need to type the coefficient
estimates because STATA will remember them for
us. They are stored by STATA using the name
_bvarname where varname is replaced with the
name of the independent variable or the constant
(_cons)
display _b_cons_bx5

15
2.2 Single variable OLS

Note that the predicted value of y when x equals
5 differs from the actual value
list y if x5
The actual value is 5.68 compared to the
predicted value of 5.50. The difference is the
residual error that arises because x is not a
perfect predictor of y.

16
2.2 Single variable OLS
When x 5, the actual value y is 5.68 compared
to the predicted y value of 5.50. The residual
prediction error is the vertical distance between
the observation and the line.
17
2.2 Single variable OLS

If we want to compute the predicted value of y
for each value of x in our dataset, we can use
the saved coefficients
gen y_hat_b_cons_bxx
The estimated residuals are the difference
between the observed y values and the predicted y
values
gen y_resy-y_hat
list x y_hat y y_res

18
2.2 Single variable OLS (predict)

A quicker way to do this would be to use the
predict command after regress
predict yhat
predict yres, resid
Checking that this gives the same answer
list yhat y_hat yres y_res
You should note that the values of x, yhat and
yres correspond with those found on the scatter
graph
sort x
list x y y_hat y_res

19
(No Transcript)
20
2.2 Single variable OLS

By construction, there is zero correlation
between the x variable and the residuals
twoway (scatter y_res x) (lfit y_res x) or reg
y_res x

21
2.2 Single variable OLS

Standard errors
Typically our data comprises a sample that is
taken from a larger population
The coefficients are only estimates of the true
a0 and a1 values that describe the entire
population
If we obtained a second random sample from the
same population, we would obtain different
coefficient estimates for a0 and a1

22
2.2 Single variable OLS

We therefore need a way to describe the
variability that would obtain if we were to apply
OLS to many different samples
Equivalently, we want a measure of how
precisely our coefficients are estimated
The solution is to calculate standard errors,
which are simply the sample standard deviations
associated with the estimated coefficients
Standard errors (SEs) allow us to perform
statistical tests, e.g., is our estimate of a1
significantly greater than zero?

23
2.2 Single variable OLS

The techniques for estimating standard errors are
based on OLS assumptions that often do not hold
in practice
Homoscedasticity (i.e., the residuals have a
constant variance)
Non-correlation (i.e., the residuals are not
correlated with each other)
Normality (i.e., the residuals are normally
distributed)

24
2.2 Single variable OLS

The t-statistic is obtained by dividing the
coefficient estimate by the standard error

25
2.2 Single variable OLS

The p-values are from the t-distribution and they
tell you how likely it is that you would have
observed the estimated coefficient under the
assumption that the true coefficient in the
population is zero.
The p-value of 0.002 tells you that it is very
unlikely (prob 0.2) that the true coefficient
on x is zero.
The confidence intervals mean you can be 95
confident that the true coefficient of x lies
between 0.23337 and 0.76681.

26
2.2 Single variable OLS

The total sum of squares (TSS) 41.27
The explained sum of squares (ESS) 27.51
The residual sum of squares (RSS) 13.76
Note that TSS ESS RSS.

27
2.2 Single variable OLS

The column labeled df contains the number of
degrees of freedom
For the ESS, df k-1 where k number of
regression coefficients (df 2 1)
For the RSS, df n k where n number of
observations ( 11 - 2)
For the TSS, df n-1 ( 11 1)
The last column (MS) reports the ESS, RSS and TSS
divided by their respective degrees of freedom

28
2.2 Single variable OLS

The first number simply tells us how many
observations are used to estimate the model
The other statistics here tell you how well the
model explains the variation in Y

29
2.2 Single variable OLS

R-squared ESS / TSS ( 27.51 / 41.27 0.666)
So x explains 66 of the variation in y.
Unfortunately, many researchers in accounting
(and other fields) evaluate the quality of a
model by looking only at the R-squared.
This is not only invalid it is also very
dangerous (I will explain why later)

30
2.2 Single variable OLS

One problem with the R-squared is that it will
always increase even when an independent variable
is added that has very little explanatory power.
The adjusted R-squared corrects for this by
accounting for the number of model parameters, k,
that need to be estimated
Adj R-squared 1-(1-R2)(n-1)/(n-k)
1-(1-.666)(10)/9 0.629
In fact the adjusted R-squared can even take on
negative values. For example, suppose that y and
x are uncorrelated in which case the unadjusted
R-squared is zero
Adj R-squared 1-(n-1)/(n-2) (n-2-n1)/(n-2)
-1/(n-2)

31
2.2 Single variable OLS

You might think that another way to measure the
fit of the model is to add up the residuals.
However, by definition, the residuals will always
sum to zero.
An alternative is to square the residuals, add
them up (giving the RSS) and then take the square
root.
Root MSE square root of RSS/n-k 13.76 /
(11-2)0.5 1.236
One way to interpret the root MSE is that it
shows how far away on average the model is from
explaining y
The F-statistic (ESS/k-1)/(RSS/n-k) (27.51 /
1)/(13.76/9) 17.99
the F statistic is used to test whether the
R-squared is significantly greater than zero
(i.e., are the independent variables jointly
significant?)
Prob gt F gives the probability that the R-squared
we calculated will be observed if the true
R-squared in the population is actually equal to
zero
This F test is used to test the overall
statistical significance of the regression model

32
Class exercise 1

Import Fees.dta into STATA from
http//ihome.ust.hk/accl/Phd_teaching.htm
Run the following two regressions
audit fees on total assets
the log of audit fees on the log of total assets
What does the output of your regression mean?
Which model appears to have the better fit

33
2.3 Correctly interpreting the coefficients

So far we have considered the case where the
independent variable is continuous.
Interpretation of results is even more
straightforward when the independent variable is
a dummy.

reg auditfees big6
ttest auditfees, by(big6)

Sometimes even published studies give an
incorrect interpretation of the estimated
coefficients. For example

35
(No Transcript)
36

Class questions
Theoretically, how should auditing affect the
interest rate that the company has to pay?
Empirically, how do we measure the impact of
auditing on the interest rate using eq. (1)?

Class question At what values of total assets
(000) is the effect of the Audit Dummy on the
interest rate
negative, zero, positive?

Class questions
What is the mean value of total assets?
How does auditing affect the interest rate for
the average company in their sample?

Verify that the above claim is true.
Suppose Blackwell et al. had reported the impact
for a firm with 11m in assets and a firm with
15m in assets.
How would this have changed the conclusions drawn?

40
2.4 Examining the residuals

Go to http//ihome.ust.hk/accl/Phd_teaching.htm
Import anscombe.dta into STATA (use
"C\phd\anscombe.dta", clear) and run the
following regressions
reg y1 x1
reg y2 x2
reg y3 x3
reg y4 x4
Note that the output from these regressions is
virtually identical
intercept 3.0 (t-stat2.67)
x coefficient 0.5 (t-stat4.24)
R-squared 66
If you did not know about OLS assumptions, you
would probably stop your analysis at this point,
concluding that you have a good fit for all four
models.
In fact, only one of these four models is well
specified.

41
Class exercise 2

Draw scatter graphs for each of these four
associations. For example
twoway (scatter y1 x1) (lfit y1 x1)
Of the four models, which do you think is the
well specified one?
Draw scatter graphs of the residuals against the
x variable for each of the four regressions is
there a pattern?
Which of the OLS assumptions are violated in
these four regressions?

42
2.4 Examining the residuals

Unfortunately, it is common among researchers to
judge whether a model is well-specified solely
in terms of its explanatory power (i.e., the
R-squared).
You should report other types of diagnostic tests
is there significant heteroscedasticity?
is there any pattern to the residuals?
are there any problems of outliers?

43
2.4 Examining the residuals

An examination of the residuals can help us to
identify whether the model is well specified. For
example
use "C\phd\Fees.dta", clear
reg auditfees totalassets
predict res1, resid
twoway (scatter res1 totalassets, msize(tiny))
(lfit res1 totalassets)
gen lnafln(auditfees)
gen lntaln(totalassets)
reg lnaf lnta
predict res2, resid
twoway (scatter res2 lnta, msize(tiny)) (lfit
res2 lnta)
Notice that the residuals are more spherical
displaying less of an obvious pattern in the
logged model.

44
(No Transcript)
45
Class exercise 3

Following Pong and Whittington (1994) estimate
the raw value of audit fees as a function of raw
assets and assets squared
Examine the residuals
Do you think this model is better specified than
the one in logs?

46
2.5 Multiple regression

Researchers use multiple regression when they
believe that Y is affected by multiple
independent variables
Y a0 a1 X1 a2 X2 e
Why is it important to control for multiple
factors that influence Y?

47
2.5 Multiple regression

Suppose the true model is
Y a0 a1 X1 a2 X2 e
where X1 and X2 is uncorrelated with the error, e
Suppose the OLS model that we estimate is
Y a0 a1 X1 u
where u a2 X2 e
OLS imposes the assumption that X1 is
uncorrelated with the residual term, u.
Since X1 is uncorrelated with e, the assumption
that X1 is uncorrelated with u is equivalent to
assuming that X1 is uncorrelated with X2.

48
2.5 Multiple regression

If X1 is correlated with X2 the OLS estimate of
a1 is biased.
The magnitude of the bias depends upon the
strength of the correlation between X1 and X2.
Of course, we often do not know whether the model
we estimate is the true model
In other words, we are unsure whether there is an
omitted variable (X2) that affects Y and that is
correlated with our variable of interest (X1)

49
2.5 Multiple regression

We can judge whether or not there is likely to be
a correlated omitted variable problem using
theory
prior empirical studies

50
2.5 Multiple regression

Previously, we checked whether there was a
pattern between the residuals and one independent
variable
lnaf a0 a1 lnta res2
twoway (scatter res2 lnta) (lfit res2 lnta)
When we are using multiple regression, we want to
test whether there is a pattern between the
residuals and the right hand side variables as a
whole
The right hand side of the equation is the same
thing as the predicted value of the dependent
variable

51
2.5 Multiple regression

So we should examine whether there is a pattern
between the residuals and the predicted values of
the dependent variable
STATA has a command which enables us to examine
the relation between the residuals and the fitted
(i.e., predicted) values
reg lnaf lnta big6
rvfplot
rvf stands for residuals versus fitted

52
2.6 Heteroscedasticity (hettest)

The OLS techniques for estimating standard errors
are based on an assumption that the variance of
the errors is the same for all values of the
independent variables (homoscedasticity)
In many cases, the homoscedasticity assumption is
violated. For example
reg auditfees totalassets big6
rvfplot
the homoscedasticity assumption can be tested
using the hettest command after we do the
regression
reg auditfees totalassets big6
hettest
Heteroscedasticity does not bias the coefficient
estimates but it biases the standard errors of
the coefficients downwards (the t-stats are
biased upwards)

53
2.6 Heteroscedasticity (robust)

Heteroscedasticity often occurs if the dependent
variable is not symmetrically distributed
for example the auditfees variable is highly
skewed due to the fact that it has a lower bound
of zero
much of the heterosedasticity can often be
removed by transforming the dependent variable
(e.g., use the log of audit fees instead of the
raw values)

54
2.6 Heteroscedasticity (robust)

When you find that there is heteroscedasticity,
you need to correct the standard errors using the
Huber/White/sandwich estimator
In STATA it is easy to do this adjustment using
the robust option
reg auditfees totalassets big6, robust
Compare the adjusted and unadjusted results
reg auditfees totalassets big6
note that the coefficients are exactly the same
the t-statistics on the independent variables are
much smaller when the standard errors are
adjusted for heteroscedasticity

55
Class exercise 4

Esimate the audit fee model in logs rather than
absolute values
Using rvfplot, assess whether the residuals
appear to be non-constant
Using hettest, provide a formal test for
heteroscedasticity
Compare the coefficients and t-statistics when
you estimate the standard errors with and without
adjusting for heteroscedasticity.

56
Conclusion