Objectives of this class - PowerPoint PPT Presentation

1 / 56
About This Presentation

Objectives of this class


So far we have considered the case where the independent variable is continuous. ... the t-statistics on the independent variables are much smaller when the standard ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 57
Provided by: accl5


Transcript and Presenter's Notes

Title: Objectives of this class

Objectives of this class
  • By the end of this class you should be able to
  • explain how OLS models work
  • interpret the results of OLS models
  • spot potential problems of outliers and
  • correct the standard errors for heteroscedasticity

2. Ordinary least squares (OLS) regression
  • 2.1 The basic idea
  • 2.2 Single variable OLS
  • 2.3 Correctly interpreting the coefficients
  • 2.4 Examining the residuals
  • 2.5 Multiple regression
  • 2.6 Heteroskedasticity

2.1 The basic idea
  • A simple linear regression aims to characterize
    the relation between one dependent variable and
    one independent variable using a straight line
  • For example, you predict that larger companies
    pay higher fees
  • You can formalize the effect of company size on
    predicted fees using a simple equation
  • The parameter a0 represents what fees are
    expected to be in the case that Size 0.
  • The parameter a1 captures the impact of an
    increase in Size on expected fees.

2.1 The basic idea
  • The parameters a0 and a1 are assumed to be the
    same for all observations and they are called
    regression coefficients.
  • However, company size is not the only variable
    that affects audit fees. For example, the
    complexity of the audit engagement, or the size
    of the audit firm may also matter.
  • You do not know all the factors that influence
    fees, so the predicted fee that you calculate
    from the above equation will differ from the
    actual fee.

2.1 The basic idea
  • The deviation between the predicted fee and the
    actual fee is called the residual. You can
    represent the relation between actual fees and
    predicted fees in the following way
  • where represents the residual term (i.e., the
    difference between actual and predicted fees)

2.1 The basic idea
  • Putting the two together we can express actual
    fees using the following equation
  • The goal of regression analysis is to estimate
    the parameters a0 and a1

2.1 The basic idea
  • One of the simplest techniques involves
    estimating the coefficients using ordinary least
    squares (OLS) regression.
  • The objective of OLS is to make the difference
    between the predicted and actual values as small
    as possible

2.1 The basic idea
  • First lets start with a very small and easy to
    visualize dataset
  • Go to http//ihome.ust.hk/accl/Phd_teaching.htm
  • Download ols.dta to your hard drive and open in
    STATA (use "C\phd\ols.dta", clear) or open
    directly from the internet (use
    "http//ihome.ust.hk/accl/ols.dta", clear)

2.1 The basic idea
  • Examine the a scatter plot between the two
    variables twoway (scatter y x) (lfit y x)

2.1 The basic idea
  • This line is fitted by minimizing the sum of the
    squared differences between the observed and
    predicted values of y (known as the residual sum
    of square, RSS)
  • The assumptions required to obtain the
    coefficients are
  • The relation between y and x is linear
  • The x variable is uncorrelated with the residuals
  • The residuals have a mean value of zero

2.2 Single variable OLS (regress)
  • We estimate the model using the regress command
  • regress y x
  • The first variable (y) is the dependent variable
    while the second (x) is the independent variable

2.2 Single variable OLS (regress)
  • This gives the following output

2.2 Single variable OLS (regress)
  • The coefficient estimates are 3.000 for the a0
    parameter and 0.500 for the a1 parameter
  • We can use these to predict the values of Y for
    any given value of X. For example, we can
    predict what Y will be when X 5 using the
    display command which performs simple
  • display 3.0000910.50009095

2.2 Single variable OLS (_b)
  • Actually, we do not need to type the coefficient
    estimates because STATA will remember them for
    us. They are stored by STATA using the name
    _bvarname where varname is replaced with the
    name of the independent variable or the constant
  • display _b_cons_bx5

2.2 Single variable OLS
  • Note that the predicted value of y when x equals
    5 differs from the actual value
  • list y if x5
  • The actual value is 5.68 compared to the
    predicted value of 5.50. The difference is the
    residual error that arises because x is not a
    perfect predictor of y.

2.2 Single variable OLS
When x 5, the actual value y is 5.68 compared
to the predicted y value of 5.50. The residual
prediction error is the vertical distance between
the observation and the line.
2.2 Single variable OLS
  • If we want to compute the predicted value of y
    for each value of x in our dataset, we can use
    the saved coefficients
  • gen y_hat_b_cons_bxx
  • The estimated residuals are the difference
    between the observed y values and the predicted y
  • gen y_resy-y_hat
  • list x y_hat y y_res

2.2 Single variable OLS (predict)
  • A quicker way to do this would be to use the
    predict command after regress
  • predict yhat
  • predict yres, resid
  • Checking that this gives the same answer
  • list yhat y_hat yres y_res
  • You should note that the values of x, yhat and
    yres correspond with those found on the scatter
  • sort x
  • list x y y_hat y_res

(No Transcript)
2.2 Single variable OLS
  • By construction, there is zero correlation
    between the x variable and the residuals
  • twoway (scatter y_res x) (lfit y_res x) or reg
    y_res x

2.2 Single variable OLS
  • Standard errors
  • Typically our data comprises a sample that is
    taken from a larger population
  • The coefficients are only estimates of the true
    a0 and a1 values that describe the entire
  • If we obtained a second random sample from the
    same population, we would obtain different
    coefficient estimates for a0 and a1

2.2 Single variable OLS
  • We therefore need a way to describe the
    variability that would obtain if we were to apply
    OLS to many different samples
  • Equivalently, we want a measure of how
    precisely our coefficients are estimated
  • The solution is to calculate standard errors,
    which are simply the sample standard deviations
    associated with the estimated coefficients
  • Standard errors (SEs) allow us to perform
    statistical tests, e.g., is our estimate of a1
    significantly greater than zero?

2.2 Single variable OLS
  • The techniques for estimating standard errors are
    based on OLS assumptions that often do not hold
    in practice
  • Homoscedasticity (i.e., the residuals have a
    constant variance)
  • Non-correlation (i.e., the residuals are not
    correlated with each other)
  • Normality (i.e., the residuals are normally

2.2 Single variable OLS
  • The t-statistic is obtained by dividing the
    coefficient estimate by the standard error

2.2 Single variable OLS
  • The p-values are from the t-distribution and they
    tell you how likely it is that you would have
    observed the estimated coefficient under the
    assumption that the true coefficient in the
    population is zero.
  • The p-value of 0.002 tells you that it is very
    unlikely (prob 0.2) that the true coefficient
    on x is zero.
  • The confidence intervals mean you can be 95
    confident that the true coefficient of x lies
    between 0.23337 and 0.76681.

2.2 Single variable OLS
  • The total sum of squares (TSS) 41.27
  • The explained sum of squares (ESS) 27.51
  • The residual sum of squares (RSS) 13.76
  • Note that TSS ESS RSS.

2.2 Single variable OLS
  • The column labeled df contains the number of
    degrees of freedom
  • For the ESS, df k-1 where k number of
    regression coefficients (df 2 1)
  • For the RSS, df n k where n number of
    observations ( 11 - 2)
  • For the TSS, df n-1 ( 11 1)
  • The last column (MS) reports the ESS, RSS and TSS
    divided by their respective degrees of freedom

2.2 Single variable OLS
  • The first number simply tells us how many
    observations are used to estimate the model
  • The other statistics here tell you how well the
    model explains the variation in Y

2.2 Single variable OLS
  • R-squared ESS / TSS ( 27.51 / 41.27 0.666)
  • So x explains 66 of the variation in y.
  • Unfortunately, many researchers in accounting
    (and other fields) evaluate the quality of a
    model by looking only at the R-squared.
  • This is not only invalid it is also very
    dangerous (I will explain why later)

2.2 Single variable OLS
  • One problem with the R-squared is that it will
    always increase even when an independent variable
    is added that has very little explanatory power.
  • The adjusted R-squared corrects for this by
    accounting for the number of model parameters, k,
    that need to be estimated
  • Adj R-squared 1-(1-R2)(n-1)/(n-k)
    1-(1-.666)(10)/9 0.629
  • In fact the adjusted R-squared can even take on
    negative values. For example, suppose that y and
    x are uncorrelated in which case the unadjusted
    R-squared is zero
  • Adj R-squared 1-(n-1)/(n-2) (n-2-n1)/(n-2)

2.2 Single variable OLS
  • You might think that another way to measure the
    fit of the model is to add up the residuals.
    However, by definition, the residuals will always
    sum to zero.
  • An alternative is to square the residuals, add
    them up (giving the RSS) and then take the square
  • Root MSE square root of RSS/n-k 13.76 /
    (11-2)0.5 1.236
  • One way to interpret the root MSE is that it
    shows how far away on average the model is from
    explaining y
  • The F-statistic (ESS/k-1)/(RSS/n-k) (27.51 /
    1)/(13.76/9) 17.99
  • the F statistic is used to test whether the
    R-squared is significantly greater than zero
    (i.e., are the independent variables jointly
  • Prob gt F gives the probability that the R-squared
    we calculated will be observed if the true
    R-squared in the population is actually equal to
  • This F test is used to test the overall
    statistical significance of the regression model

Class exercise 1
  • Import Fees.dta into STATA from
  • http//ihome.ust.hk/accl/Phd_teaching.htm
  • Run the following two regressions
  • audit fees on total assets
  • the log of audit fees on the log of total assets
  • What does the output of your regression mean?
  • Which model appears to have the better fit

2.3 Correctly interpreting the coefficients
  • So far we have considered the case where the
    independent variable is continuous.
  • Interpretation of results is even more
    straightforward when the independent variable is
    a dummy.
  • reg auditfees big6
  • ttest auditfees, by(big6)

  • Sometimes even published studies give an
    incorrect interpretation of the estimated
    coefficients. For example

(No Transcript)
  • Class questions
  • Theoretically, how should auditing affect the
    interest rate that the company has to pay?
  • Empirically, how do we measure the impact of
    auditing on the interest rate using eq. (1)?

  • Class question At what values of total assets
    (000) is the effect of the Audit Dummy on the
    interest rate
  • negative, zero, positive?

  • Class questions
  • What is the mean value of total assets?
  • How does auditing affect the interest rate for
    the average company in their sample?

  • Verify that the above claim is true.
  • Suppose Blackwell et al. had reported the impact
    for a firm with 11m in assets and a firm with
    15m in assets.
  • How would this have changed the conclusions drawn?

2.4 Examining the residuals
  • Go to http//ihome.ust.hk/accl/Phd_teaching.htm
  • Import anscombe.dta into STATA (use
    "C\phd\anscombe.dta", clear) and run the
    following regressions
  • reg y1 x1
  • reg y2 x2
  • reg y3 x3
  • reg y4 x4
  • Note that the output from these regressions is
    virtually identical
  • intercept 3.0 (t-stat2.67)
  • x coefficient 0.5 (t-stat4.24)
  • R-squared 66
  • If you did not know about OLS assumptions, you
    would probably stop your analysis at this point,
    concluding that you have a good fit for all four
  • In fact, only one of these four models is well

Class exercise 2
  • Draw scatter graphs for each of these four
    associations. For example
  • twoway (scatter y1 x1) (lfit y1 x1)
  • Of the four models, which do you think is the
    well specified one?
  • Draw scatter graphs of the residuals against the
    x variable for each of the four regressions is
    there a pattern?
  • Which of the OLS assumptions are violated in
    these four regressions?

2.4 Examining the residuals
  • Unfortunately, it is common among researchers to
    judge whether a model is well-specified solely
    in terms of its explanatory power (i.e., the
  • You should report other types of diagnostic tests
  • is there significant heteroscedasticity?
  • is there any pattern to the residuals?
  • are there any problems of outliers?

2.4 Examining the residuals
  • An examination of the residuals can help us to
    identify whether the model is well specified. For
  • use "C\phd\Fees.dta", clear
  • reg auditfees totalassets
  • predict res1, resid
  • twoway (scatter res1 totalassets, msize(tiny))
    (lfit res1 totalassets)
  • gen lnafln(auditfees)
  • gen lntaln(totalassets)
  • reg lnaf lnta
  • predict res2, resid
  • twoway (scatter res2 lnta, msize(tiny)) (lfit
    res2 lnta)
  • Notice that the residuals are more spherical
    displaying less of an obvious pattern in the
    logged model.

(No Transcript)
Class exercise 3
  • Following Pong and Whittington (1994) estimate
    the raw value of audit fees as a function of raw
    assets and assets squared
  • Examine the residuals
  • Do you think this model is better specified than
    the one in logs?

2.5 Multiple regression
  • Researchers use multiple regression when they
    believe that Y is affected by multiple
    independent variables
  • Y a0 a1 X1 a2 X2 e
  • Why is it important to control for multiple
    factors that influence Y?

2.5 Multiple regression
  • Suppose the true model is
  • Y a0 a1 X1 a2 X2 e
  • where X1 and X2 is uncorrelated with the error, e
  • Suppose the OLS model that we estimate is
  • Y a0 a1 X1 u
  • where u a2 X2 e
  • OLS imposes the assumption that X1 is
    uncorrelated with the residual term, u.
  • Since X1 is uncorrelated with e, the assumption
    that X1 is uncorrelated with u is equivalent to
    assuming that X1 is uncorrelated with X2.

2.5 Multiple regression
  • If X1 is correlated with X2 the OLS estimate of
    a1 is biased.
  • The magnitude of the bias depends upon the
    strength of the correlation between X1 and X2.
  • Of course, we often do not know whether the model
    we estimate is the true model
  • In other words, we are unsure whether there is an
    omitted variable (X2) that affects Y and that is
    correlated with our variable of interest (X1)

2.5 Multiple regression
  • We can judge whether or not there is likely to be
    a correlated omitted variable problem using
  • theory
  • prior empirical studies

2.5 Multiple regression
  • Previously, we checked whether there was a
    pattern between the residuals and one independent
  • lnaf a0 a1 lnta res2
  • twoway (scatter res2 lnta) (lfit res2 lnta)
  • When we are using multiple regression, we want to
    test whether there is a pattern between the
    residuals and the right hand side variables as a
  • The right hand side of the equation is the same
    thing as the predicted value of the dependent

2.5 Multiple regression
  • So we should examine whether there is a pattern
    between the residuals and the predicted values of
    the dependent variable
  • STATA has a command which enables us to examine
    the relation between the residuals and the fitted
    (i.e., predicted) values
  • reg lnaf lnta big6
  • rvfplot
  • rvf stands for residuals versus fitted

2.6 Heteroscedasticity (hettest)
  • The OLS techniques for estimating standard errors
    are based on an assumption that the variance of
    the errors is the same for all values of the
    independent variables (homoscedasticity)
  • In many cases, the homoscedasticity assumption is
    violated. For example
  • reg auditfees totalassets big6
  • rvfplot
  • the homoscedasticity assumption can be tested
    using the hettest command after we do the
  • reg auditfees totalassets big6
  • hettest
  • Heteroscedasticity does not bias the coefficient
    estimates but it biases the standard errors of
    the coefficients downwards (the t-stats are
    biased upwards)

2.6 Heteroscedasticity (robust)
  • Heteroscedasticity often occurs if the dependent
    variable is not symmetrically distributed
  • for example the auditfees variable is highly
    skewed due to the fact that it has a lower bound
    of zero
  • much of the heterosedasticity can often be
    removed by transforming the dependent variable
    (e.g., use the log of audit fees instead of the
    raw values)

2.6 Heteroscedasticity (robust)
  • When you find that there is heteroscedasticity,
    you need to correct the standard errors using the
    Huber/White/sandwich estimator
  • In STATA it is easy to do this adjustment using
    the robust option
  • reg auditfees totalassets big6, robust
  • Compare the adjusted and unadjusted results
  • reg auditfees totalassets big6
  • note that the coefficients are exactly the same
  • the t-statistics on the independent variables are
    much smaller when the standard errors are
    adjusted for heteroscedasticity

Class exercise 4
  • Esimate the audit fee model in logs rather than
    absolute values
  • Using rvfplot, assess whether the residuals
    appear to be non-constant
  • Using hettest, provide a formal test for
  • Compare the coefficients and t-statistics when
    you estimate the standard errors with and without
    adjusting for heteroscedasticity.

  • You should now understand
  • how OLS models work
  • how to interpret the results of OLS models
  • how to spot potential problems of outliers and
  • how to correct the standard errors for
Write a Comment
User Comments (0)
About PowerShow.com