5' Endogenous right hand side variables - PowerPoint PPT Presentation

1 / 105
About This Presentation
Title:

5' Endogenous right hand side variables

Description:

5.2 The basic idea underlying the use of instrumental variables ... the effect of Zit on Y1it is a main effect (a3) or an indirect effect through Y2it (a1b2) ... – PowerPoint PPT presentation

Number of Views:319
Avg rating:3.0/5.0
Slides: 106
Provided by: accl5
Category:

less

Transcript and Presenter's Notes

Title: 5' Endogenous right hand side variables


1
5. Endogenous right hand side variables
  • 5.1 The problem of endogeneity bias
  • 5.2 The basic idea underlying the use of
    instrumental variables
  • 5.3 When the endogenous right hand side variable
    is continuous
  • 5.4 When the endogenous right hand side variable
    is binary

2
5.1 Endogeneity bias
  • Consider a simple OLS regression
  • Yit a0 a1 X1it uit
  • Recall that our estimate of a1 will be unbiased
    only if we can assume that X1it is uncorrelated
    with the error term (uit)
  • We have discussed two ways to help ensure that
    this assumption is true
  • First, we should control for any observable
    variables that affect Yit and which are
    correlated with X1it. For example, we should
    control for X2it if X2it affects Yit and X2it is
    correlated with X1it (see Chapter 2)
  • Yit a0 a1 X1it a2 X2it uit

3
5.1 Endogeneity bias
  • Second, if we have panel data, we can control for
    any unobservable firm-specific characteristics
    (ui) that affect Yit and which are correlated
    with the X variables.
  • From Chapter 4
  • Yit a0 a1 X1it a2 X2it ui eit
  • We control for the correlations between ui and
    the X variables by estimating fixed effects
    models.
  • Our estimates of a1 and a2 are unbiased if the X
    variables are uncorrelated with eit. In this
    case, we say that the X variables are exogenous.

4
5.1 Endogeneity bias
  • Unfortunately, multiple regression and fixed
    effects models do not always ensure that the X
    variables are uncorrelated with the error term
  • if we do not observe all the variables that
    affect Y and that are correlated with X, multiple
    regression will not solve the problem.
  • if we do not have panel data, the fixed effects
    models cannot be estimated.
  • even if we have panel data, the Y and X variables
    may display little variation over time in which
    case the fixed effects models can be unreliable
    (Zhou, 2001).
  • even if we have panel data and the Y and X
    variables display sufficient variation over time,
    the unobservable variables that are correlated
    with X may not be constant over time in which
    case the fixed effects models will not solve the
    problem.

5
  • A variable is more likely to be correlated with
    the error term if it is endogenous
  • Endogenous means that the variable is
    determined within the economic model that we are
    trying to estimate.
  • For example, suppose that Y2it is an endogenous
    explanatory variable
  • Y1it a0 a1 Y2it a2 Xit uit (1)
  • Y2it b0 b1 Xit b2 Zit vit
    (2)
  • Equations (1) and (2) have a triangular
    structure since Y2it is assumed to affect Y1it,
    but Y1it is assumed not to affect Y2it
  • Given this triangular structure, the OLS estimate
    of a1 in equation (1) is unbiased only if vit is
    uncorrelated with uit
  • If vit is correlated with uit, then Y2it is
    correlated with uit which means that the OLS
    estimate of a1 would be biased
  • To avoid this bias, we must estimate equation (1)
    instrumental variables (IV) regression rather
    than OLS.

6
  • Equations (1) and (2) are called structural
    equations because they describe the economic
    relationship between Y1it and Y2it
  • We can obtain a reduced-form equation by
    substituting eq. (2) into eq. (1)
  • Y1it a0 a1 (b0 b1 Xit b2 Zit vit) a2
    Xit uit
  • In this reduced-form equation, all the
    explanatory variables (Xit and Zit) are exogenous
  • The basic idea underlying IV regression is to
    remove vit from the Y1it model so that our
    estimate of a1 is unbiased.

7
5.2 The basic idea underlying the use of
instrumental variables
  • Note that vit is removed from the Y1it model if
    we use the predicted rather than the actual
    values of Y2it on the right hand side.
  • We predict Y2it using all the exogenous variables
    in the system (in our example, we use the two
    exogenous variables Xit and Zit)

8
5.2 The basic idea
  • We then use the predicted rather than the actual
    values of Y2it when estimating the Y1it model
  • The a1 estimate is biased in eq. (3) but it is
    unbiased in eq. (4) because the vit term has been
    removed.

9
  • In eq. (4) the estimated coefficient for the Zit
    variable is
  • We already know the value of from eq.
    (2)
  • Therefore
  • It is important to note that the
    coefficient can be estimated only if there is at
    least one exogenous variable in the structural
    model for Y2it that is excluded from the
    structural model for Y1it
  • This is the Zit variable in eq. (2)

10
  • In eq. (4) the coefficient is just
    identified because there is only one exogenous
    variable (Zit) that is in the Y2it model and that
    is excluded from the Y1it model

11
  • Suppose we had included Zit in both models
  • In this case, the coefficient cannot be
    identified because we estimate and
  • In other words, we cannot determine whether the
    effect of Zit on Y1it is a main effect (a3) or an
    indirect effect through Y2it (a1b2)
  • Here we say that the system of equations is
    under-identified

12
  • Suppose we had included two exogenous variables
    in the Y2it model and we excluded both these
    variables from the Y1it model
  • Now we have estimates of , ,
    , and .
  • Therefore
  • Here we say that the system of equations is
    over-identified

13
5.3 When the endogenous right hand side variable
is continuous
  • When the models have a triangular structure, the
    models can be estimated using the ivreg command
    (NB In STATA 10.0 this command has changed to
    ivregress)
  • In our example, the system is triangular because
    there are two equations and one endogenous
    right-hand side variable

14
5.3.1 Estimating triangular models using 2SLS
(ivreg)
  • Go to http//ihome.ust.hk/accl/Phd_teaching.htm
  • Open up the housing.dta file which provides data
    from 50 U.S. states (1980 Census)
  • use "C\phd\housing.dta", clear
  • pct_population_urban the of the population
    that lives in urban areas
  • family_income median annual family income
  • housing_value median value of private housing
  • rent median monthly housing rental payments
  • region1 region 4 dummy variables for four
    regions in the U.S.

15
  • Suppose we want to estimate the following
  • rent a0 a1 pct_population_urban
    a2 housing_value u
  • housing_value b0 b1 family_income
    b2 region2 b3 region3 b4 region4 v
  • This is a triangular system because there are two
    equations and one endogenous right hand side
    variable (housing_value)
  • If u and v are correlated, the OLS estimate of a2
    will be biased in the rent model

16
  • If we ignore the endogeneity problem and estimate
    the rent model using simple OLS
  • reg rent housing_value pct_population_urban
  • To take account of the potential endogeneity
    problem we use the ivreg command
  • ivreg depvar1 varlist1 (depvar2 varlistiv)
  • depvar1 is the dependent variable for the model
    which has an endogenous regressor
  • varlist1 are the exogenous variables in the model
    that has the endogenous regressor
  • depvar2 is the endogenous regressor
  • varlistiv are the exogenous variables that are
    believed to affect the endogenous regressor

17
  • The models that we want to estimate are
  • rent a0 a1 pct_population_urban
    a2 housing_value u
  • housing_value b0 b1 family_income
    b2 region2 b3 region3 b4 region4 v
  • Therefore
  • ivreg rent pct_population_urban (housing_value
    family_income region2 region3 region4)
  • The housing_value model can be estimated using
    OLS as there are no endogenous regressors.

18
  • STATA tells us that
  • the endogenous regressor is housing_value
  • the pct_population_urban, family_income region2 -
    region4 variables are assumed to be exogenous
    (i.e., they are instruments)

19
  • We would get exactly the same coefficients if we
    first estimated the housing_value model and then
    included the predicted values of housing_value in
    the rent model
  • However, the standard errors are biased under OLS
    so, in practice, you should use the ivreg command
  • reg housing_value family_income region2 region3
    region4 pct_population_urban
  • NB the housing_value model must be estimated on
    all the exogenous variables (including
    pct_population_urban)
  • predict housing_value_hat
  • reg rent housing_value_hat pct_population_urban

20
  • These OLS coefficients are the same as we
    obtained using ivreg.
  • However, the standard errors are different.
  • The OLS coefficients from the second-stage model
    using the predicted housing_value variable are

21
  • We should test whether
  • our chosen instruments are exogenous (i.e., they
    should be uncorrelated with the error term) and
  • it is valid to exclude some of them from the
    model that has the endogenous regressor.
  • If they are not exogenous or they should not be
    excluded, they are not valid instruments.

22
  • The Sargan and Basmann tests are used to test for
    instrument validity
  • they are tests of over-identifying restrictions
    because the tests can only be performed if the
    model with the endogenous regressor is
    overidentified
  • the tests assume that at least one of the chosen
    instruments is valid (unfortunately this
    assumption cannot be tested)
  • In our example, the instrumented housing_value
    variable is overidentified because four of the
    exogenous variables (family_income region2
    region3 region4) are excluded from the rent
    model.
  • If we had excluded only one of these variables,
    the instrumented housing_value variable would
    have been just identified in which case the
    Sargan and Basmann tests would have been
    unavailable.

23
  • We obtain the Sargan and Basmann tests by typing
    overid after we run ivreg
  • ivreg rent pct_population_urban (housing_value
    family_income region2 region3 region4)
  • overid
  • If overid is not installed on your computer you
    can install it from the STATA Technical bulleting
    by typing findit overid
  • findit commandname is a very useful way of
    downloading commands that have been written but
    not installed on your version of STATA (e.g.,
    your version is out of date or the command has
    not yet been included in the latest version)

24
  • We obtain the Sargan and Basmann tests by typing
    overid after we run ivreg
  • ivreg rent pct_population_urban (housing_value
    family_income region2 region3 region4)
  • overid
  • These tests are statistically significant, which
    means the chosen instruments are not valid.
  • This is not surprising because we did not have
    good reason to assume that they are exogenous and
    validly excluded from the rent model. For
    example
  • family incomes may depend on housing values and
    rents (e.g., families may own housing for
    investment purposes), so family_income is
    endogenous
  • rents may be different across the four regions,
    so the region dummies should not be excluded from
    the rent model

25
  • We can also test whether the coefficient of the
    endogenous regressor is biased under OLS.
  • The Hausman tests for endogeneity bias are only
    reliable if the chosen instruments are valid.
  • We obtain two Hausman tests for endogeneity bias
    by typing ivendog after we run ivreg
  • Given these results, we can strongly reject the
    hypothesis that housing_value is exogenous
  • Therefore, we have reason to be concerned about
    endogeneity bias (however, this test is not
    reliable as our chosen instruments are not
    valid).

26
  • It is easy to correct for heteroscedasticity and
    time-series dependence because we can use the
    robust cluster() option with ivreg
  • However, we can only run the overid and ivendog
    commands after running ivreg not after running
    ivreg, robust cluster()
  • Therefore, you can get the correct standard
    errors using robust cluster() and then test for
    endogeneity bias and instrument validity by
    running ivreg without the robust cluster() option

27
Class exercise 5a
  • Using the fees.dta file, estimate the following
    models for audit fees and company size
  • lnaf a0 a1 lnta a2 big6 u
  • lnta b0 b1 ln_age b2 listed v
  • where lnaf is the log of audit fees, lnta is the
    log of total assets, ln_age is the log of the
    companys age in years, listed is a dummy
    variable indicating whether the companys shares
    are publicly traded on a market.
  • Is the instrumented lnta variable
    over-identified, just-identified, or
    under-identified? Explain.
  • Estimate the audit fee model using IV regression,
    controlling for heteroscedasticity and
    time-series dependence.
  • Check that the coefficients are the same if you
    instead use the OLS two-step approach.
  • Test the validity of the chosen instrumental
    variables.
  • Test whether the lnta variable is affected by
    endogeneity bias.
  • Verify that the test for instrument validity is
    not available if you change the model so that it
    is just-identified.

28
Class exercise 5a
  • The instrumented lnta variable is over-identified
    because two exogenous variables (ln_age and
    listed) are excluded from the lnaf model.
  • Generating the variables and dropping
    observations with missing data
  • use "C\phd\Fees.dta", clear
  • gen fyedate(yearend, "mdy")
  • format fye d
  • gen yearyear(fye)
  • gen age year-incorporationyear
  • gen ln_ageln(age)
  • gen listed0
  • replace listed1 if companytype2
    companytype3 companytype5
  • gen lnafln(auditfees)
  • gen lntaln(totalassets)
  • egen missrmiss(ln_age listed lnaf lnta big6)
  • drop if miss!0
  • ivreg lnaf big6 (lntaln_age listed), robust
    cluster(companyid)

29
Class exercise 5a
30
Class exercise 5a
  • Checking against the two-step OLS results
  • reg lnta ln_age listed big6
  • predict lnta_hat
  • reg lnaf lnta_hat big6

31
Class exercise 5a
  • Testing for instrument validity and endogeneity
    bias
  • Remember to drop the robust cluster() option
  • ivreg lnaf big6 (lnta ln_age listed)
  • overid
  • ivendog

32
Class exercise 5a
  • Checking that the test for instrument validity
    requires the model to be over-identified, we can
    include ln_age or listed in the audit fee model
    so that it becomes just-identified. For example
  • ivreg lnaf big6 ln_age (lnta ln_age listed)
  • overid
  • Or
  • ivreg lnaf big6 listed (lnta ln_age listed)
  • overid
  • If we include both ln_age and listed in the audit
    fee model, it is under-identified and we cannot
    estimate the effect of company size on fees
  • ivreg lnaf big6 ln_age listed (lnta ln_age
    listed)

33
  • The key to estimating IV models is to find one or
    more exogenous variables that explains the
    endogenous regressor and that can be safely
    excluded from the main equation.
  • Unfortunately, most accounting studies that use
    IV regression do not attempt to justify why their
    chosen instruments are exogenous or why they can
    be excluded from the structural model.
  • As a result, Larcker and Rusticus (2007)
    criticize the way in which accounting studies
    have applied IV regression
  • A key problem is that the IV results can be very
    sensitive to the researchers choice of which
    variables to exclude from the structural model
    and, in many studies, these variables have been
    chosen in a very arbitrary way

34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
  • When testing instrument validity (overid) and
    endogeneity bias (ivendog), it is important to
    consider your sample size
  • in large samples, the tests may reject a null
    hypothesis that is nearly true.
  • in small samples, the tests may fail to reject a
    null hypothesis that is very false.
  • Larcker and Rusticus (2007) recommend that formal
    tests for instrument validity and endogeneity
    should be supplemented with sensitivity analyses.
    For example, researchers should
  • report both the OLS and IV results
  • examine whether the results are sensitive to
    using different instrumental variables

38
  • It is important to note that the overid test for
    instrument validity relies on the assumption that
    at least one of the chosen instrumental variables
    is valid
  • Larcker and Rusticus (2007) recommend that the
    overid test should be used to check the validity
    of instruments that are justified using theory
    or some basic economic intuition
  • the overid test should not be used to select
    instruments on purely statistical grounds

39
5.3.2 Estimating simultaneous equations using
3SLS (reg3)
  • So far we have been examining a triangular
    system. For example, Y2it affects Y1it but Y1it
    does not affect Y2it
  • Y1it a0 a1 Y2it a2 Xit a3 Z2it uit
  • Y2it b0 b2 Xit b3 Z1it vit
  • In a simultaneous system, both dependent
    variables affect each other
  • Y1it a0 a1 Y2it a2 Xit a3 Z2it uit
  • Y2it b0 b1 Y1it b2 Xit b3 Z1it vit

40
  • Y1it a0 a1 Y2it a2 Xit a3 Z2it uit
  • Y2it b0 b1 Y1it b2 Xit b3 Z1it vit
  • In this case, the OLS estimates are biased
    because
  • Eq. (1) shows that uit affects Y1it while eq. (2)
    shows that Y1it affects Y2it. As a result, it
    must be true that uit is correlated with Y2it in
    eq. (1). Therefore, the OLS estimate of a1 would
    be biased in eq. (1).
  • Eq. (2) shows that vit affects Y2it while eq. (1)
    shows that Y2it affects Y1it. As a result, it
    must be true that vit is correlated with Y1it in
    eq. (2). Therefore, the OLS estimate of b1 would
    be biased in eq. (2).

41
  • For example, it seems reasonable to argue that
    housing values depend on rents as well as rents
    depending on housing values
  • rent a0 a1 housing_value a2
    pct_population_urban u
  • housing_value b0 b1 rent b2 family_income
    b3 region2 b4 region3 b5 region4 v
  • Note that for identification, each equation must
    contain at least one exogenous variable that is
    not included in the other equation. These are
  • pct_population_urban in the rent model
  • family_income, region2 - region4 in the
    housing_value model

42
  • We estimate this kind of model using the reg3
    command
  • reg3 (depvar1 varlist1) (depvar2 varlist2)
  • use "C\phd\housing.dta", clear
  • reg3 (rent housing_value pct_population_urban)
    (housing_value rent family_income region2
    region3 region4)
  • Note that the robust cluster() option and the
    overid and ivendog commands are not available
    with reg3

43
5.4 When the endogenous right hand side variable
is binary
  • So far we have been dealing with the case where
    the endogenous regressor is continuous.
  • We may want to estimate a model in which the
    endogenous regressor is binary.
  • This brings us to a special class of models which
    are known as self-selection or Heckman
    models. Selectivity Endogeneity where the
    endogenous regressor is binary
  • The basic idea is similar to the instrumental
    variable techniques that we have already
    discussed.
  • Unfortunately, many accounting researchers have
    been misusing the Heckman model (Lennox and
    Francis, 2009).

44
  • Examples of endogenous binary variables in
    accounting
  • Companies decide whether to use hedge contracts
    (Barton, 2001 Pincus and Rajgopal, 2002).
  • Companies decide whether to grant stock options
    (Core and Guay, 1999).
  • Companies decide whether to hire Big 5 or non-Big
    5 auditors (e.g., Chaney et al., 2004).
  • Governments decide whether to fully or partially
    privatize (Guedhami and Pittman, 2006).
  • Companies decide whether to follow international
    financial reporting strategy (Leuz and
    Verrecchia, 2000).
  • Companies decide whether to recognize financial
    instruments at fair value or disclose (Ahmed et
    al., 2006).
  • Companies decide whether or not to go private
    (Engel et al., 2002).

45
Selection model
  • Concerns about selectivity arise when the RHS
    dummy variable (D) is endogenous
  • Endogeneity results in bias if E(u D) ? 0. The
    intuition underlying Heckman is to estimate and
    then control for E(u D). First model the choice
    of D
  • Z is a vector of exogenous variables that affect
    D but have no direct effect on Y.

46
Selection model
D
Z
Y
47
Selection model
  • The error terms in the two equations (u and v)
    are assumed to have a bivariate normal
    distribution with mean zero and
    variance-covariance matrix
  • If u and v are correlated (? ? 0), then
    E(u D) ? 0, in which case the OLS
    estimate of the effect of D on Y would be biased.

48
Selection model
  • The intuition underlying Heckman is to estimate
    E(u D) and include it as a control variable on
    the RHS of the Y model
  • E(u D) ?? IMR where

49
Selection model
  • The IMR variable is added as a control for
    selectivity in the Y model
  • The OLS estimate of the effect of D on Y is now
    unbiased because E(e D) ? 0.
  • The D and Y models can be estimated in two-steps
    or estimated jointly using maximum likelihood
    (ML)
  • ML yields separate estimates of ? and ?.
  • The two-step yields an estimate of ??.
  • Under the null of no selectivity bias, ? 0 and
    ?? 0.

50
Selection model
  • In the Y model it has been assumed that the slope
    coefficients on the X variables are equal for the
    cases where D 0, 1
  • This assumption can be relaxed by estimating the
    model separately on the two sub-samples

51
Selection model
  • Again, selectivity is controlled for by including
    IMR variables on the RHS

52
  • For example, Chaney, Jeter and Shivakumar (2004)
    examine the case where Y audit fees and D a
    dummy for Big 6 (Non-Big 6) audits.
  • They argue that an OLS regression of eq. (1)
    gives biased estimates of the Big 6 fee premium
    (?).
  • They also argue that the slope coefficients (?)
    may differ between Big 6 and Non-Big 6 audit
    clients

53
Class exercise 5b
  • As an example, we are going to look at a
    fictional dataset on 2,000 women.
  • use "C\phd\heckman.dta", clear
  • sum age education married children wage
  • Suppose we believe that older and more highly
    educated women earn higher wages. Why would it be
    wrong to estimate the following model?
  • reg wage age education
  • Estimate a probit model to test whether women are
    more likely to be employed if they are married,
    have children, are older and more highly educated.

54
Class exercise 5b
  • Of the 2,000 women in our dataset, only 1,343 are
    in a paid job.
  • This raises a selection (endogeneity) problem
    because the sub-sample of 1,343 women is probably
    not representative of the population (which
    includes women who are not earning wages).
  • Put another way, we do not observe the wages that
    would have been earned by the 657 women if they
    had been in employment.

55
Class exercise 5b
  • wage a0 a1 age a2 education u
  • If older and more highly educated women earn
    higher wages, we expect a1 gt 0 and a2 gt 0.
  • However, the dependent variable (wage) is only
    observed for women who are in employment.
  • To overcome this problem, we need to think about
    what determines the likelihood of female
    employment. For example, we may argue that women
    are more likely to be employed if they are
    married, have children, are older and more highly
    educated
  • emp b0 b1 married b2 children b3 age
    b4 education v
  • gen emp0 if wage.
  • replace emp1 if wage!.
  • probit emp married children age education

56
5.4 When the endogenous right hand side variable
is binary (heckman)
  • It is easy to estimate the two-step Heckman model
    in STATA
  • heckman depvar1 varlist1, select (depvar2
    varlist1), twostep
  • where depvar1 is the dependent variable in the
    main equation and depvar2 is the dependent
    variable in the selection model
  • Going back to our dataset on female wages
  • heckman wage education age, select(emp married
    children education age) twostep

57
(No Transcript)
58
  • The 657 censored observations are the women who
    are not in employment.
  • The Wald chi2 tests the overall significance of
    the model.
  • Womens wages are higher if they are older and
    more highly educated
  • The probit model of employment is exactly the
    same as what we had before
  • Women are more likely to be in employment if they
    are married, have children, are more highly
    educated or older.

59
  • Recall that we are trying to estimate the error
    in the wage equation which is truncated because
    we only observe wages of the women who are in
    employment
  • The lamba variable is simply the IMR that was
    estimated from the emp model ( ).
  • The IMR coefficient (4.00) is
  • Since the IMR coefficient is statistically
    significant, it may be concluded that there is
    statistically significant evidence of a selection
    effect.
  • The IMR coefficient can also be written as the
    product of rho and sigma ( )
  • rho ( ) is the correlation between u and v
  • sigma ( ) is the standard deviation of u
  • Thus, 4.00 0.67 5.95

60
  • The selection model can also be estimated using
    maximum likelihood (ML) rather than the two-step
    approach.
  • This can be useful if we want to test the
    statistical significance of rho.
  • If rho 0, the IMR coefficient must also be zero
    in which case there is no need to control for
    selectivity.
  • There is an unresolved debate in the econometrics
    literature as to whether the two-step or ML
    approach is best.
  • STATA automatically gives us the ML results if we
    do not specify twostep as an option
  • heckman wage education age, select(emp married
    children education age)

61
(No Transcript)
62
  • Here, the results for the wage and employment
    models are similar using either ML or the
    two-step.
  • NB Sometimes the results are different between
    ML and the two-step. Also you may find that the
    ML model does not converge if the likelihood
    function is not concave.
  • /athroh the inverse hyperbolic tangent of
  • /lnsigma is the log of the standard deviation of
    u ( )
  • STATA estimates athrho and lnsigma rather than
    rho and sigma directly in order to increase the
    numerical stability of the maximization routine
    for the likelihood function.
  • STATA also reports the untransformed values of
    rho and sigma.

63
  • The Likelihood-ratio statistic allows us to
    reject the hypothesis that rho 0, which means
    that there is significant evidence of
    selectivity.
  • When rho 0, it is also true that
    equals zero.
  • The statistical significance of athrho implies
    that there is significant evidence of selectivity.

64
Class exercise 5c
  • Estimate the following audit fee models
    separately for Big 6 and Non-Big 6 audit clients
  • lnaf a0 a1 lnta u (1)
  • lnaf a0 a1 lnsales u (2)
  • where lnaf log of audit fees, lnta log of
    total assets, lnsales log of sales
  • Use the heckman command to control for
    endogeneity with respect to the companys
    selected auditor. Your auditor choice models are
    as follows
  • big6 b0 b1 lnsales b2 lnta v
  • nbig6 c0 c1 lnsales c2 lnta w
  • where big6 1 (big6 0) if the company chooses
    a Big 6 (Non-Big 6) auditor and nbig6 1 (nbig6
    0) if the company chooses a Non-Big 6 (Big 6)
    auditor.

65
Class exercise 5c
  • What exclusion restrictions are you imposing in
    equations (1) and (2)?
  • Is there statistically significant evidence of
    selectivity?
  • For the two different specifications of the audit
    fee model
  • what are the signs of the IMR coefficients?
  • what are the signs of rho?

66
Class exercise 5c
  • In equation (1) we impose the restriction that
    lnsales does not affect lnaf. In equation (2) we
    impose the restriction that lnta does not affect
    lnaf.
  • use "C\phd\Fees.dta", clear
  • gen lnsalesln(sales)
  • gen lnafln(auditfees)
  • gen lntaln(totalassets)
  • egen missrmiss(lnaf lnta lnsales)
  • drop if miss!0
  • gen nbig60
  • replace nbig61 if big60
  • heckman lnaf lnta, select (big6 lnta lnsales)
    twostep
  • heckman lnaf lnsales, select (big6 lnta
    lnsales) twostep
  • heckman lnaf lnta, select (nbig6 lnta lnsales)
    twostep
  • heckman lnaf lnsales, select (nbig6 lnta
    lnsales) twostep

67
Class exercise 5c
  • The coefficients on the IMRs and rho are positive
    in equations (1) and (2) when the fee models are
    estimated for Non-Big 6 clients.
  • The coefficients on the IMRs and rho are positive
    in equation (1) but they are negative in equation
    (2) when the fee models are estimated for Big 6
    clients.
  • Therefore, the estimated effects of selectivity
    are sensitive to which exclusion restrictions are
    imposed on the audit fee model for Big 6 clients.
  • The problem is that we have chosen arbitrary
    exclusion restrictions that lack any intuitive or
    theoretical justification.

68
Treatment effects model
  • In exercise 5c, we estimated the audit fee models
    separately for the Big 6 and non-Big 6 audit
    clients
  • To do this, we use the heckman command

69
Treatment effects model
  • Suppose that we want to estimate one audit fee
    model with Big 6 on the right hand side of the
    equation (i.e., we assume that the X coefficients
    have the same slope in the two equations)

70
Treatment effects model
  • We can estimate this model using the treatreg
    command
  • treatreg lnaf lnta, treat (big6 lnta lnsales)
    twostep
  • treatreg lnaf lnsales, treat (big6 lnta
    lnsales) twostep
  • If we dont specify the twostep option we will
    get the ML estimates (sometimes the ML model will
    not converge due to a nonconcave likelihood
    function)
  • treatreg lnaf lnta, treat (big6 lnta lnsales)
  • treatreg lnaf lnsales, treat (big6 lnta
    lnsales)

71
Treatment effects model
  • The results for both the treatment effects and
    Heckman models can be very sensitive to the model
    specification.
  • For example, the Big 6 fee premium can easily
    flip signs from positive to negative
  • treatreg lnaf lnta, treat (big6 lnta lnsales)
    twostep
  • treatreg lnaf lnta lnsales, treat (big6 lnta
    lnsales) twostep
  • Note that there are no exclusion restrictions (Z
    variables) in the second specification since lnta
    and lnsales appear in both the first stage and
    second stage models

72
Exclusion restrictions
  • Francis and Lennox (2009) argue that many
    accounting studies have estimated the Heckman and
    treatment effects models incorrectly
  • It is well recognized (in economics) that
    exogenous Z variables from the first stage choice
    model need to be validly excluded from the second
    stage outcome regression (Little, 1985 Little
    and Rubin, 1987 Manning et al., 1987).
  • Accounting studies have generally failed to (a)
    impose exclusion restrictions, or (b) provide
    compelling grounds for the validity of the
    exclusion restrictions.

73
Exclusion restrictions
  • Of the 38 accounting studies in our survey
  • 4 studies explicitly fail to nominate any Z
    variable (5 studies estimate specifications both
    with and without Z variables 4 studies fail to
    disclose whether they include a Z variable).
  • Only 2 studies provide a rationale for including
    the Z variable in the first stage model and
    excluding it from the second stage.
  • Only 2 studies report robustness tests using
    alternative exclusion restrictions (i.e.,
    alternative Z variables).

74
Exclusion restrictions
  • Economists recognize that it is important to
    justify why the Zs can be validly excluded from
    the Y model.
  • For example, Angrist (1990) examines how military
    service affects the earnings of veteran soldiers
    after they are discharged from the army.
  • This involves a selection issue because
    individuals join the military if they have poor
    wage offers in other types of job.
  • Angrist (1990) tackles the selectivity issue
    using data from the Vietnam era, when military
    service was partly determined by a draft lottery.

75
Exclusion restrictions
D military service
Z Random lottery
Y civilian earnings
76
Exclusion restrictions
  • Other examples from economics
  • Levitt (1997) tests whether additional policing
    results in less crime
  • Selectivity is an issue because more police are
    hired if crime increases (or if it is expected
    that crime will increase)
  • Uses the electoral cycle as an instrument for
    policing.
  • Angrist and Evans (1998) test whether child
    bearing reduces female participation in the labor
    market
  • Selectivity is an issue because women are more
    likely to have children rather than enter the
    labor market if their wage offers would be low
    (i.e., lower opportunity cost).
  • Use the gender of the second child as instrument
    for the decision to have a third child.

77
Levitt (1997) Exclusion restriction
D policing
Z electoral cycle
Y crime
78
Angrist and Evans (1998) Exclusion restriction
D decision to have a third child
Z Sex composition of first two children
Y female participation in labor market
79
Exclusion restrictions
  • Of the 38 accounting studies in our survey, only
    two attempt to justify why Z has no direct impact
    on Y.
  • Many studies do not report results for the D
    model, so the reader cannot evaluate the power of
    the Z variables for identifying selectivity.
  • At least 9 studies (possibly 13) estimate models
    in which there are no nominated Z variables.

80
Exclusion restrictions
  • When there are no exclusion restrictions,
    identification of the IMR coefficients relies on
    the assumed non-linearity
  • The IMRs would capture any misspecification of
    the functional relation between X and Y (e.g.,
    non-linearity) in addition to any selectivity
    bias.

81
Exclusion restrictions
  • Little (1985) Relying on nonlinearities to
    identify selectivity bias is unappealing
    because it is very difficult to distinguish
    empirically between selectivity and
    misspecification of the models functional form.
  • STATA manual Theoretically, one does not need
    such identifying variables, but without them, one
    is depending on functional form to identify the
    model. It would be difficult to take such results
    seriously since the functional-form assumptions
    have no firm basis in theory.
  • A failure to nominate any Z variables can lead to
    serious problems of multicollinearity (Manning et
    al., 1987 Puhani, 2000 Leung and Yu, 2000).

82
Re-examine Chaney et al. (2004)
  • In some respects, their study is fairly typical
    of those in our survey
  • 26 out of 38 papers attempt to control for
    selectivity in a treatment variable.
  • 15 studies rely on the selection model for their
    primary results (even if those results contradict
    the OLS findings).
  • The Chaney et al. study does not include any Z
    variables.

83
Chaney, Jeter and Shivakumar (2004)
D BIG5 (company hires a Big 5 or non-Big 5
auditor)
Y Audit fees
Z null set
84
  • OLS models of audit fees

85
  • CJS argue that it is important to allow the slope
    coefficients to differ between Big 5 and Non-Big
    5 clients
  • Without controlling for selectivity, the mean fee
    premiums of Big 5 auditors are
  • Without controlling for selectivity, the mean fee
    premiums of non-Big 5 auditors are
  • What do these results mean?

86
  • The results are very different when the IMRs are
    added as RHS variables
  • After controlling for selectivity, the mean fee
    premiums of Big 5 auditors are
  • After controlling for selectivity, the mean fee
    premiums of non-Big 5 auditors are
  • What do these results mean?

87
Chaney, Jeter and Shivakumar (2004)
  • We want to test whether these results are robust.
  • Company size is the most important determinant of
    both auditor choice and audit fees.
  • We find evidence of multicollinearity problems
    due to the high correlations between company size
    (LTA) and the IMRs.
  • We try alternative specifications of the company
    size variable in the auditor choice and audit fee
    models.

88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
Chaney, Jeter and Shivakumar (2004)
  • To ensure a level playing field, we estimate the
    same specifications of the fee models without
    controlling for selectivity
  • LTA alone
  • LTS alone
  • LTA and LTS
  • The results consistently indicate that Big 5
    clients pay significant fee premiums.

92
Matched propensity scores (MPS)
  • Given the problems with using selection models,
    it would be good to find an alternative or
    complementary approach.
  • The major advantage of MPS is that exclusion
    restrictions and assumptions about functional
    form are unnecessary because the Y model does not
    include the IMRs.
  • Selection is assumed to take place on the
    independent variables in the D model so MPS does
    not control for any selectivity on
    unobservables.
  • MPS is used by only 3 of the 38 accounting
    studies in our survey.

93
Matched propensity scores
  • Steps
  • Estimate the D model.
  • Obtain the predicted probability that D 1 for
    each observation in the sample.
  • Match each D 1 observation to a D 0
    observation that has the closest predicted
    probability.
  • The above three steps can be done using the
    psmatch2 command in STATA
  • Estimate the Y model on the matched sample.
  • Compare results to the unmatched sample to
    determine if there is selectivity bias.

94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
Conclusions
  • The conclusions of prior studies may be fragile
    especially when
  • they attempt to control for selectivity in a
    treatment variable
  • results for the selection model are not
    corroborated by single equation estimates
  • researchers fail to nominate or justify the
    chosen exclusion restrictions.
  • This is true of nearly all studies in our survey!

98
(No Transcript)
99
Example Leuz and Verrecchia (2000)
D IR97 (international reporting)
Z ROA, Capital intensity, UK/US listing.
Y Cost of capital
100
Leuz and Verrecchia (2000)
  • Is it valid to assume that ROA, Capital
    intensity, and UK/US listing have no direct
    effect on the cost of capital?
  • Are these Z variables really exogenous?

101
(No Transcript)
102
Leuz and Verrecchia (2000)
  • Are the tests for selectivity bias powerful?
  • Are the results sensitive to functional form?
    (see the free float variable).
  • LV do not report results using OLS
  • LV do not report whether their results are
    sensitive to alternative model specifications.
  • LV do not report tests for multicollinearity, nor
    do they try the MPS approach.

103
Going forward
  • Researchers need to be aware that Heckman and
    treatment effects models can provide results that
    are extremely fragile. Sensitivity primarily
    affects the RHS variable that is assumed to be
    endogenous (D) and the IMRs.
  • Studies need to discuss
  • why the Zs are exogenous
  • why the Zs have no direct effect on Y
  • whether the Zs are powerful predictors of D
  • The signs and significance of the IMRs alone do
    not provide compelling evidence as to the
    direction or existence of selectivity bias.
  • Selection studies should routinely report tests
    for multicollinearity problems.
  • Researchers can consider using the MPS
    methodology to determine whether there is
    evidence of selection on observables.

104
Summary
  • When the endogenous regressor is continuous, you
    can control for endogeneity using the ivreg or
    reg3 commands.
  • When the endogenous regressor is binary, you can
    control for endogeneity using the heckman or
    treatreg commands.
  • If you want to control for endogeneity, it is
    vitally important that you have a good
    justification for your chosen exclusion
    restrictions.
  • Choosing arbitrary exclusion restrictions will
    very likely give you garbage results.

105
Concluding comments
  • When writing a paper, you normally follow three
    steps
  • Find a research idea (either before or after you
    get the data)
  • Perform the empirical analysis
  • Write up the results
  • This course has focused on step 2 but it has also
    touched on step 1
  • There are opportunities to improve on what prior
    accounting studies have done (e.g., Rock et al.,
    2001 Larcker and Rusticus, 2007 Francis and
    Lennox, 2008).
  • Step 3 is also very important
  • You should spend lots of time learning how to
    write
  • practice is very important, just as it is with
    data analysis and programming.
  • Having a well written paper is crucial for
    publication
  • badly written papers are sometimes rejected even
    if the idea is good and the data analysis is well
    done.
  • well written papers are sometimes accepted even
    if the empirical analysis is poor or the results
    are misinterpreted.
Write a Comment
User Comments (0)
About PowerShow.com