Objectives of this class - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Objectives of this class

Description:

So far we have considered the case where the independent variable is continuous. ... the t-statistics on the independent variables are much smaller when the standard ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 57
Provided by: accl5
Category:

less

Transcript and Presenter's Notes

Title: Objectives of this class


1
Objectives of this class
  • By the end of this class you should be able to
  • explain how OLS models work
  • interpret the results of OLS models
  • spot potential problems of outliers and
    heteroscedasticity
  • correct the standard errors for heteroscedasticity

2
2. Ordinary least squares (OLS) regression
  • 2.1 The basic idea
  • 2.2 Single variable OLS
  • 2.3 Correctly interpreting the coefficients
  • 2.4 Examining the residuals
  • 2.5 Multiple regression
  • 2.6 Heteroskedasticity

3
2.1 The basic idea
  • A simple linear regression aims to characterize
    the relation between one dependent variable and
    one independent variable using a straight line
  • For example, you predict that larger companies
    pay higher fees
  • You can formalize the effect of company size on
    predicted fees using a simple equation
  • The parameter a0 represents what fees are
    expected to be in the case that Size 0.
  • The parameter a1 captures the impact of an
    increase in Size on expected fees.

4
2.1 The basic idea
  • The parameters a0 and a1 are assumed to be the
    same for all observations and they are called
    regression coefficients.
  • However, company size is not the only variable
    that affects audit fees. For example, the
    complexity of the audit engagement, or the size
    of the audit firm may also matter.
  • You do not know all the factors that influence
    fees, so the predicted fee that you calculate
    from the above equation will differ from the
    actual fee.

5
2.1 The basic idea
  • The deviation between the predicted fee and the
    actual fee is called the residual. You can
    represent the relation between actual fees and
    predicted fees in the following way
  • where represents the residual term (i.e., the
    difference between actual and predicted fees)

6
2.1 The basic idea
  • Putting the two together we can express actual
    fees using the following equation
  • The goal of regression analysis is to estimate
    the parameters a0 and a1

7
2.1 The basic idea
  • One of the simplest techniques involves
    estimating the coefficients using ordinary least
    squares (OLS) regression.
  • The objective of OLS is to make the difference
    between the predicted and actual values as small
    as possible

8
2.1 The basic idea
  • First lets start with a very small and easy to
    visualize dataset
  • Go to http//ihome.ust.hk/accl/Phd_teaching.htm
  • Download ols.dta to your hard drive and open in
    STATA (use "C\phd\ols.dta", clear) or open
    directly from the internet (use
    "http//ihome.ust.hk/accl/ols.dta", clear)

9
2.1 The basic idea
  • Examine the a scatter plot between the two
    variables twoway (scatter y x) (lfit y x)

10
2.1 The basic idea
  • This line is fitted by minimizing the sum of the
    squared differences between the observed and
    predicted values of y (known as the residual sum
    of square, RSS)
  • The assumptions required to obtain the
    coefficients are
  • The relation between y and x is linear
  • The x variable is uncorrelated with the residuals
  • The residuals have a mean value of zero

11
2.2 Single variable OLS (regress)
  • We estimate the model using the regress command
  • regress y x
  • The first variable (y) is the dependent variable
    while the second (x) is the independent variable

12
2.2 Single variable OLS (regress)
  • This gives the following output

13
2.2 Single variable OLS (regress)
  • The coefficient estimates are 3.000 for the a0
    parameter and 0.500 for the a1 parameter
  • We can use these to predict the values of Y for
    any given value of X. For example, we can
    predict what Y will be when X 5 using the
    display command which performs simple
    calculations
  • display 3.0000910.50009095

14
2.2 Single variable OLS (_b)
  • Actually, we do not need to type the coefficient
    estimates because STATA will remember them for
    us. They are stored by STATA using the name
    _bvarname where varname is replaced with the
    name of the independent variable or the constant
    (_cons)
  • display _b_cons_bx5

15
2.2 Single variable OLS
  • Note that the predicted value of y when x equals
    5 differs from the actual value
  • list y if x5
  • The actual value is 5.68 compared to the
    predicted value of 5.50. The difference is the
    residual error that arises because x is not a
    perfect predictor of y.

16
2.2 Single variable OLS
When x 5, the actual value y is 5.68 compared
to the predicted y value of 5.50. The residual
prediction error is the vertical distance between
the observation and the line.
17
2.2 Single variable OLS
  • If we want to compute the predicted value of y
    for each value of x in our dataset, we can use
    the saved coefficients
  • gen y_hat_b_cons_bxx
  • The estimated residuals are the difference
    between the observed y values and the predicted y
    values
  • gen y_resy-y_hat
  • list x y_hat y y_res

18
2.2 Single variable OLS (predict)
  • A quicker way to do this would be to use the
    predict command after regress
  • predict yhat
  • predict yres, resid
  • Checking that this gives the same answer
  • list yhat y_hat yres y_res
  • You should note that the values of x, yhat and
    yres correspond with those found on the scatter
    graph
  • sort x
  • list x y y_hat y_res

19
(No Transcript)
20
2.2 Single variable OLS
  • By construction, there is zero correlation
    between the x variable and the residuals
  • twoway (scatter y_res x) (lfit y_res x) or reg
    y_res x

21
2.2 Single variable OLS
  • Standard errors
  • Typically our data comprises a sample that is
    taken from a larger population
  • The coefficients are only estimates of the true
    a0 and a1 values that describe the entire
    population
  • If we obtained a second random sample from the
    same population, we would obtain different
    coefficient estimates for a0 and a1

22
2.2 Single variable OLS
  • We therefore need a way to describe the
    variability that would obtain if we were to apply
    OLS to many different samples
  • Equivalently, we want a measure of how
    precisely our coefficients are estimated
  • The solution is to calculate standard errors,
    which are simply the sample standard deviations
    associated with the estimated coefficients
  • Standard errors (SEs) allow us to perform
    statistical tests, e.g., is our estimate of a1
    significantly greater than zero?

23
2.2 Single variable OLS
  • The techniques for estimating standard errors are
    based on OLS assumptions that often do not hold
    in practice
  • Homoscedasticity (i.e., the residuals have a
    constant variance)
  • Non-correlation (i.e., the residuals are not
    correlated with each other)
  • Normality (i.e., the residuals are normally
    distributed)

24
2.2 Single variable OLS
  • The t-statistic is obtained by dividing the
    coefficient estimate by the standard error

25
2.2 Single variable OLS
  • The p-values are from the t-distribution and they
    tell you how likely it is that you would have
    observed the estimated coefficient under the
    assumption that the true coefficient in the
    population is zero.
  • The p-value of 0.002 tells you that it is very
    unlikely (prob 0.2) that the true coefficient
    on x is zero.
  • The confidence intervals mean you can be 95
    confident that the true coefficient of x lies
    between 0.23337 and 0.76681.

26
2.2 Single variable OLS
  • The total sum of squares (TSS) 41.27
  • The explained sum of squares (ESS) 27.51
  • The residual sum of squares (RSS) 13.76
  • Note that TSS ESS RSS.

27
2.2 Single variable OLS
  • The column labeled df contains the number of
    degrees of freedom
  • For the ESS, df k-1 where k number of
    regression coefficients (df 2 1)
  • For the RSS, df n k where n number of
    observations ( 11 - 2)
  • For the TSS, df n-1 ( 11 1)
  • The last column (MS) reports the ESS, RSS and TSS
    divided by their respective degrees of freedom

28
2.2 Single variable OLS
  • The first number simply tells us how many
    observations are used to estimate the model
  • The other statistics here tell you how well the
    model explains the variation in Y

29
2.2 Single variable OLS
  • R-squared ESS / TSS ( 27.51 / 41.27 0.666)
  • So x explains 66 of the variation in y.
  • Unfortunately, many researchers in accounting
    (and other fields) evaluate the quality of a
    model by looking only at the R-squared.
  • This is not only invalid it is also very
    dangerous (I will explain why later)

30
2.2 Single variable OLS
  • One problem with the R-squared is that it will
    always increase even when an independent variable
    is added that has very little explanatory power.
  • The adjusted R-squared corrects for this by
    accounting for the number of model parameters, k,
    that need to be estimated
  • Adj R-squared 1-(1-R2)(n-1)/(n-k)
    1-(1-.666)(10)/9 0.629
  • In fact the adjusted R-squared can even take on
    negative values. For example, suppose that y and
    x are uncorrelated in which case the unadjusted
    R-squared is zero
  • Adj R-squared 1-(n-1)/(n-2) (n-2-n1)/(n-2)
    -1/(n-2)

31
2.2 Single variable OLS
  • You might think that another way to measure the
    fit of the model is to add up the residuals.
    However, by definition, the residuals will always
    sum to zero.
  • An alternative is to square the residuals, add
    them up (giving the RSS) and then take the square
    root.
  • Root MSE square root of RSS/n-k 13.76 /
    (11-2)0.5 1.236
  • One way to interpret the root MSE is that it
    shows how far away on average the model is from
    explaining y
  • The F-statistic (ESS/k-1)/(RSS/n-k) (27.51 /
    1)/(13.76/9) 17.99
  • the F statistic is used to test whether the
    R-squared is significantly greater than zero
    (i.e., are the independent variables jointly
    significant?)
  • Prob gt F gives the probability that the R-squared
    we calculated will be observed if the true
    R-squared in the population is actually equal to
    zero
  • This F test is used to test the overall
    statistical significance of the regression model

32
Class exercise 1
  • Import Fees.dta into STATA from
  • http//ihome.ust.hk/accl/Phd_teaching.htm
  • Run the following two regressions
  • audit fees on total assets
  • the log of audit fees on the log of total assets
  • What does the output of your regression mean?
  • Which model appears to have the better fit

33
2.3 Correctly interpreting the coefficients
  • So far we have considered the case where the
    independent variable is continuous.
  • Interpretation of results is even more
    straightforward when the independent variable is
    a dummy.
  • reg auditfees big6
  • ttest auditfees, by(big6)

34
  • Sometimes even published studies give an
    incorrect interpretation of the estimated
    coefficients. For example

35
(No Transcript)
36
  • Class questions
  • Theoretically, how should auditing affect the
    interest rate that the company has to pay?
  • Empirically, how do we measure the impact of
    auditing on the interest rate using eq. (1)?

37
  • Class question At what values of total assets
    (000) is the effect of the Audit Dummy on the
    interest rate
  • negative, zero, positive?

38
  • Class questions
  • What is the mean value of total assets?
  • How does auditing affect the interest rate for
    the average company in their sample?

39
  • Verify that the above claim is true.
  • Suppose Blackwell et al. had reported the impact
    for a firm with 11m in assets and a firm with
    15m in assets.
  • How would this have changed the conclusions drawn?

40
2.4 Examining the residuals
  • Go to http//ihome.ust.hk/accl/Phd_teaching.htm
  • Import anscombe.dta into STATA (use
    "C\phd\anscombe.dta", clear) and run the
    following regressions
  • reg y1 x1
  • reg y2 x2
  • reg y3 x3
  • reg y4 x4
  • Note that the output from these regressions is
    virtually identical
  • intercept 3.0 (t-stat2.67)
  • x coefficient 0.5 (t-stat4.24)
  • R-squared 66
  • If you did not know about OLS assumptions, you
    would probably stop your analysis at this point,
    concluding that you have a good fit for all four
    models.
  • In fact, only one of these four models is well
    specified.

41
Class exercise 2
  • Draw scatter graphs for each of these four
    associations. For example
  • twoway (scatter y1 x1) (lfit y1 x1)
  • Of the four models, which do you think is the
    well specified one?
  • Draw scatter graphs of the residuals against the
    x variable for each of the four regressions is
    there a pattern?
  • Which of the OLS assumptions are violated in
    these four regressions?

42
2.4 Examining the residuals
  • Unfortunately, it is common among researchers to
    judge whether a model is well-specified solely
    in terms of its explanatory power (i.e., the
    R-squared).
  • You should report other types of diagnostic tests
  • is there significant heteroscedasticity?
  • is there any pattern to the residuals?
  • are there any problems of outliers?

43
2.4 Examining the residuals
  • An examination of the residuals can help us to
    identify whether the model is well specified. For
    example
  • use "C\phd\Fees.dta", clear
  • reg auditfees totalassets
  • predict res1, resid
  • twoway (scatter res1 totalassets, msize(tiny))
    (lfit res1 totalassets)
  • gen lnafln(auditfees)
  • gen lntaln(totalassets)
  • reg lnaf lnta
  • predict res2, resid
  • twoway (scatter res2 lnta, msize(tiny)) (lfit
    res2 lnta)
  • Notice that the residuals are more spherical
    displaying less of an obvious pattern in the
    logged model.

44
(No Transcript)
45
Class exercise 3
  • Following Pong and Whittington (1994) estimate
    the raw value of audit fees as a function of raw
    assets and assets squared
  • Examine the residuals
  • Do you think this model is better specified than
    the one in logs?

46
2.5 Multiple regression
  • Researchers use multiple regression when they
    believe that Y is affected by multiple
    independent variables
  • Y a0 a1 X1 a2 X2 e
  • Why is it important to control for multiple
    factors that influence Y?

47
2.5 Multiple regression
  • Suppose the true model is
  • Y a0 a1 X1 a2 X2 e
  • where X1 and X2 is uncorrelated with the error, e
  • Suppose the OLS model that we estimate is
  • Y a0 a1 X1 u
  • where u a2 X2 e
  • OLS imposes the assumption that X1 is
    uncorrelated with the residual term, u.
  • Since X1 is uncorrelated with e, the assumption
    that X1 is uncorrelated with u is equivalent to
    assuming that X1 is uncorrelated with X2.

48
2.5 Multiple regression
  • If X1 is correlated with X2 the OLS estimate of
    a1 is biased.
  • The magnitude of the bias depends upon the
    strength of the correlation between X1 and X2.
  • Of course, we often do not know whether the model
    we estimate is the true model
  • In other words, we are unsure whether there is an
    omitted variable (X2) that affects Y and that is
    correlated with our variable of interest (X1)

49
2.5 Multiple regression
  • We can judge whether or not there is likely to be
    a correlated omitted variable problem using
  • theory
  • prior empirical studies

50
2.5 Multiple regression
  • Previously, we checked whether there was a
    pattern between the residuals and one independent
    variable
  • lnaf a0 a1 lnta res2
  • twoway (scatter res2 lnta) (lfit res2 lnta)
  • When we are using multiple regression, we want to
    test whether there is a pattern between the
    residuals and the right hand side variables as a
    whole
  • The right hand side of the equation is the same
    thing as the predicted value of the dependent
    variable

51
2.5 Multiple regression
  • So we should examine whether there is a pattern
    between the residuals and the predicted values of
    the dependent variable
  • STATA has a command which enables us to examine
    the relation between the residuals and the fitted
    (i.e., predicted) values
  • reg lnaf lnta big6
  • rvfplot
  • rvf stands for residuals versus fitted

52
2.6 Heteroscedasticity (hettest)
  • The OLS techniques for estimating standard errors
    are based on an assumption that the variance of
    the errors is the same for all values of the
    independent variables (homoscedasticity)
  • In many cases, the homoscedasticity assumption is
    violated. For example
  • reg auditfees totalassets big6
  • rvfplot
  • the homoscedasticity assumption can be tested
    using the hettest command after we do the
    regression
  • reg auditfees totalassets big6
  • hettest
  • Heteroscedasticity does not bias the coefficient
    estimates but it biases the standard errors of
    the coefficients downwards (the t-stats are
    biased upwards)

53
2.6 Heteroscedasticity (robust)
  • Heteroscedasticity often occurs if the dependent
    variable is not symmetrically distributed
  • for example the auditfees variable is highly
    skewed due to the fact that it has a lower bound
    of zero
  • much of the heterosedasticity can often be
    removed by transforming the dependent variable
    (e.g., use the log of audit fees instead of the
    raw values)

54
2.6 Heteroscedasticity (robust)
  • When you find that there is heteroscedasticity,
    you need to correct the standard errors using the
    Huber/White/sandwich estimator
  • In STATA it is easy to do this adjustment using
    the robust option
  • reg auditfees totalassets big6, robust
  • Compare the adjusted and unadjusted results
  • reg auditfees totalassets big6
  • note that the coefficients are exactly the same
  • the t-statistics on the independent variables are
    much smaller when the standard errors are
    adjusted for heteroscedasticity

55
Class exercise 4
  • Esimate the audit fee model in logs rather than
    absolute values
  • Using rvfplot, assess whether the residuals
    appear to be non-constant
  • Using hettest, provide a formal test for
    heteroscedasticity
  • Compare the coefficients and t-statistics when
    you estimate the standard errors with and without
    adjusting for heteroscedasticity.

56
Conclusion
  • You should now understand
  • how OLS models work
  • how to interpret the results of OLS models
  • how to spot potential problems of outliers and
    heteroscedasticity
  • how to correct the standard errors for
    heteroscedasticity
Write a Comment
User Comments (0)
About PowerShow.com