Automatisation in Stata - PowerPoint PPT Presentation

About This Presentation
Title:

Automatisation in Stata

Description:

Automatisation in Stata Jan Hagemejer & Joanna Tyrowicz Some advices we did not take at the right time Save your computers time (your wasted time is your problem ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 36
Provided by: ams87
Category:

less

Transcript and Presenter's Notes

Title: Automatisation in Stata


1
Automatisation in Stata
  • Jan Hagemejer
  • Joanna Tyrowicz

2
Plan
  • Standard solutions
  • Where they do not work?
  • Usually more than one way to estimate how to
    chose?
  • Using loops and global function together
  • Generating the resultssets for atypical
    estimations.
  • Difficulties with using bootstrap (and obtaining
    resultssets)
  • Summary comments and some advices

3
The standard route
  • Problem several estimations of similar form.
  • Need to compare results.
  • Three simple solutions
  • Solution 1 brute force sit type
  • Solution 2 use parmby/parmest if estimations on
    simple categories in data (limitations of by
    command)
  • Solution 3 use loops
  • See N. Coxs material from previous SUGM)
  • Commands developed by Roger Newson
    outreg/outreg2
  • nicely formatted tables,
  • publication-ready,
  • in many formats, even LaTeX.
  • Note if you need nice summary statistics, you
    can use outsum either with by or within loops

4
Where the problems come from?
  • 2nd and 3rd solution works only with
    regression-type estimations
  • However, some procedures are incompatible with
    pre-cooked solutions
  • Examples
  • Marginal effects,
  • Use outreg2 in Stata10 if use dprobit/logit
    instead of probit/logit
  • Use outreg2 in Stata11 with margins and/or mfx2
    (remeber about replace option)
  • Nice statistics
  • Use tempname and postfile syntax
  • Rolling window on any of this type of analysis

5
Not everything may be solved this way
  • Reason 1 things more complex than they seem (to
    come in a sec..)
  • Reason 2 some things are not listed in the
    output
  • Example various versions of R2 or sample size in
    simple regressions
  • outreg/parmest typically do not include them
  • they can be included as additional locals
  • you need to know what locals they are gt
    solution the family of return list commands
  • ret li gt results stored in r(), general commands
  • eret li gt results stored in e(), estimation
    commands
  • sret li gt results stored in s(), programming
    commands
  • Practical example

6
Cookbook for simple problems
  • Run procedure
  • Check with the use of return list family, which
    statistics you need
  • Add locals that should be generated after the
    procedure
  • Add these statistics to outreg2/parmest commands
  • forvalues no1(1)10
  • xi xtreg x y z i.year i.month if gno'1, fe
    robust
  • local Betweene(r2_b)
  • local Withine(r2_w)
  • local No_mine(g_min)
  • local No_maxe(g_max)
  • outreg2 using file.xls, bdec(4) title(Title)
    ctitle(no') append excel addstat(R2 between,
    Between', R2 within, Within', No min, No_min',
    No max, No_max', No average, No_avg')

7
Our problem is different application to PSM
  • Need to report
  • output of the procedure
  • sample properties after matching
  • balancing properties of matching
  • Problem1 actually, none of these is in the
    typical output
  • Problem2 we need it for many estimations looped
    over many variables and each one of them takes a
    looooong time

8
Detailed problem description
  • Analyse the effects of privatisation
  • Observe what happens before and after the event
    of privatisation, but time runs
  • E.g. firm A may be one year before privatisation
    in 1999 and firm B in 2006, so event is an
    anchor and time runs both ways.
  • Effects may be observed in many spheres
  • E.g. profits, investments, international
    competitiveness, employment
  • Effects may be due to self-selection
  • E.g. only better firms are privatised, so
    difference in performance is not due to the
    privatisation
  • Effects may be largerly due to self-selection
  • Heckman correction will tell about the
    statistical significance but not about the
    economic relevance
  • Propensity score matching is the best solution

9
Detailed problem desciption
  • Run logistic regression
  • Dependent variable Y 1, if participate Y 0,
    otherwise.
  • Choose appropriate conditioning (instrumental)
    variables.
  • Obtain propensity score predicted probability
    (p) or logp/(1 - p).
  • Match each participant to one or more
    nonparticipants on propensity score
  • Choose an adequate metric
  • Compare outcome variables
  • Example test means equality in sample treated
    and control group
  • In PSM obtaining pscore is irrelevant, but
    matching is key
  • To verify if matching is ok, need to run some
    diagnostics
  • Example compare the balancing properties after
    matching (so-called bias reduction thanks to
    matching)

10
Detailed problem description
  • Thus, in our case
  • Many time periods (for each time-to-anchor a
    separate estimation)
  • Many variables (for each variable separate
    outcomes, but within one period the same
    balancing properties)
  • Two ways of estimating regular and bootstrapping
    (especially the latter made things complex)
  • Each estimation roughly 1.5-3.5 hours
  • Over a hundred estimations
  • Additional pitfalls
  • We needed some statistics for all estimations and
    they were not in the return list
  • More precisely procedure computes them to be
    able to produce output, but they were not added
    to the return list by authors

11
Summary of the problems
  • Our problem was quite specific BUT consisted of
    many general problems
  • Loops take a lot of time need to find efficient
    ways
  • Some things cannot be obtained fast gt even more
    reasons to run it automatically
  • Obtaining datasets of the variables we need
    (so-called resultssets)
  • Getting visible data if they are not an output
  • Using invisible data
  • Getting around with bootstrap

12
The structure of our estimations
13
Using pscore or psmatch?
14
Using pscore or psmatch?
Event loop
  • Typical psmatch syntax
  • psmatch2 treat treatment_determinants,
    out(outcomes) options
  • Alternative
  • Estimate pscore first
  • pscore treatment treatment_determinants,
    pscore(name)
  • Run
  • psmatch2 treatment pscore, out(outcomes) options
  • How to choose?
  • If you want to bootstrap, pscore estimated once
    will save you time
  • If you want to introduce data-fitted caliper into
    options, pscore first is a must

15
How global function can be usefull?
16
Using the global function for estimations
Event loop
  • Our application observe the same firms back and
    forth from the moment of the privtisation
    (event)
  • Events happen in different years
  • But we can only match on one dimension has or
    has not the event
  • Conceptual solution use lags and forwards to get
    the time dimension
  • Technical problem many outcomes variables and de
    facto many loops
  • Technical solution define separately matching
    variables and output variables
  • global in"cut remoteness eksporter energia
    obrot klratio roa ros indebtedness wsk_plynnosci
    net_income_efficiency klratio_new roa_new
    indebtedness_new indebtedness_new
    wsk_plynnosci_new"
  • global out"te_new redukcja wzrost_zatr
    share_export lewar s_eff"
  • global outf1"ff1_te_new ff2_te_new ff3_te_new
    ff4_te_new ff5_te_new ff1_redukcja ff2_redukcja
    ff3_redukcja ff4_redukcja ff5_redukcja
    ff1_wzrost_zatr ff2_wzrost_zatr ff3_wzrost_zatr
    ff4_wzrost_zatr ff5_wzrost_zatr"
  • global outf2"ff1_share_export ff2_share_export
    ff3_share_export ff4_share_export
    ff5_share_export ff1_lewar ff2_lewar ff3_lewar
    ff4_lewar ff5_lewar ff1_s_eff ff2_s_eff ff3_s_eff
    ff4_s_eff ff5_s_eff"

17
The begining of the estimations so far
Event loop
  • forvalues d6(1)18
  • use data, clear
  • capture log close
  • capture drop our_pscore caliper mean diff
    ttest se_after se_before treated nontreated
  • log using priv_caliper_d', text replace
  • pscore dd' in, pscore(our_pscore_d')
  • ttest our_pscore_d', by(dd') unequal
  • capture drop sd_nontreated sd_treated
  • gen sd_nontreatedr(sd_1)'
  • gen sd_treatedr(sd_2)'
  • gen caliper_d' ((sd_treated2sd_nontreated2)/
    2)0.5
  • sum caliper_d'
  • local c_realr(mean)'
  • hist nasz_pscore_d', by(dd')
  • graph save our_pscore_dd'.png", replace
  • psmatch2 dd' our_pscore_d', out(out outf1
    outf2) common add mahalanobis(nace)
    caliper(c_real')

18
Getting from results to resultssets
19
Why (and what) do we need (in) the resultssets?
  • Why?
  • Most importantly without resultssets we cannot
  • analyse the changes over time
  • decompose the observed differentials
  • If we do not do it automatically, it would have
    to be copied manually from logs many
    estimations, many variables, etc
  • What ? Step 1 find out the reality
  • Size of each of the three groups treated, total
    and control ( matched)
  • Averages in all three groups (medians, etc.)
  • Knowledge if in fact they are different ( test
    of the statistical significance based on
    difference and standard error of this difference)
  • What? Step 2 find out, how good the findings are
    statistically
  • Balancing properties!

20
Our solution to step 1
Variables loop
  • foreach out in out outf1 outf2
  • local se_afterr(seatt_out')
  • gen se_after_out'se_after'
  • local diff_afterr(att_out')
  • gen diff_after_out'diff_after'
  • sum out' if dd'0 _support1
  • local mean_nontreatedr(mean)
  • gen mean_nontreated_out'mean_nontreated'
  • sum out' if dd'1 _support1
  • local mean_treatedr(mean)
  • gen mean_treated_out'mean_treated'
  • ttest out' if _support1, by(dd') unequal
  • local se_beforer(se)
  • gen se_before_out'se_before'
  • local mean_beforer(mu_2)-r(mu_1)
  • gen diff_before_out'mean_before'
  • gen ttest_before_out'diff_before_out'/se_bef
    ore_out'
  • gen ttest_after_out'diff_after_out'/se_after
    _out

21
Our solution to step 1 - continued
Variables loop
  • foreach type in before after
  • label var se_type'_out' "Standard error of
    difference type' matching"
  • label var diff_type'_out' "Difference type'
    matching"
  • label var ttest_type'_out' "T-test of
    difference"
  • label var mean_treated_out' "Mean of treated
    companies"
  • label var mean_nontreated_out' "Mean of
    non-treated companies (before matching)"
  • count if dd'1 _support1
  • local treatedr(N)
  • gen treatedtreated'
  • label var treated "No of treated companies"
  • count if dd'0 _support1
  • local nontreatedr(N)
  • gen nontreatednontreated'
  • label var nontreated "No of control companies"

22
Our solution to step 2
Variables loop
  • pstest in
  • foreach in in in
  • capture local bias_reductionr(bired_in')
  • capture local pvalue_befr(pbef_in')
  • capture local pvalue_afterr(paft_in')
  • capture gen b_red_in'bias_reduction'
  • capture gen pval_ber_in'pvalue_bef'
  • capture gen pval_aft_in'pvalue_after'
  • outsheet b_red pval using stats_priv_d',
    replace
  • psgraph
  • graph save priv_support_d', replace
  • graph export priv_supportd'.png, replace
  • drop b_red pval

23
Missing statistics
24
Solving problem of missing statistics
  • Look into the ado file you are using
    (procedure)
  • Throughout the file, there are commands
  • return scalar xsomelocal
  • Sometimes for clarity scalars are dropped at
    the end of procedure
  • Your prefered statistic (if it is in the output,
    it has to be at least a local) would simply have
    to have a local like that too
  • If it does not you can always generate it based
    on your preferences and available locals
  • gt Modify the original ado file

25
Solving problem of missing statistics example
1
  • Modified ado file line 380
  • Original ado file line 380
  • qui foreach v of varlist varlist'
  • replace _v' . if _support0
  • tempname m1t m0t u0u u1u att dif0
  • sum v' if _treated1, mean
  • scalar u1u' r(mean)
  • sum v' if _treated0, mean
  • scalar u0u' r(mean)
  • sum v' if _treated1 _support1, mean
  • scalar m1t' r(mean)
  • local n1 r(N)
  • sum _v' if _treated1 _support1, mean
  • scalar m0t' r(mean)
  • scalar att' m1t' - m0t'
  • scalar dif0' u1u' - u0u
  • return scalar att att'
  • return scalar att_v' att'
  • qui foreach v of varlist varlist'
  • replace _v' . if _support0
  • tempname m1t m0t u0u u1u att dif0
  • /all the same as earlier plus /
  • return scalar diff dif0'
  • return scalar diff_v' dif0
  • return scalar mean0 u0u'
  • return scalar mean0_v' u0u
  • return scalar mean1 u1u'
  • return scalar mean1_v' u1u'

26
Solving problem of missing statistics example
2
  • Modified ado file line 440
  • Original ado file line 440
  • return scalar seatt stderr'
  • return scalar seatt_v' stderr'
  • qui regress v' _treated
  • scalar ols' _b_treated
  • scalar seols' _se_treated
  • return scalar seatt stderr'
  • return scalar seatt_v' stderr'
  • qui regress v' _treated
  • scalar ols' _b_treated
  • scalar seols' _se_treated
  • return scalar seols seols
  • return scalar seols_v' seols'

27
Problems with bootstrap
28
Problems with bootstrap
  • Why did we need bootstrap?
  • After estimations s.e.s were relatively large
    (heterogenous sample)
  • When we tried bootstraping, the reduction in the
    size of s.e.s was roughly 50 while estimators
    were essentially unaffected
  • What problems with bootstrap?
  • Need to run it separately for each variable (it
    bootstraps only one standard error at a time)
  • Output is given in a totally different form
  • It takes a looong time
  • New piece of code for just BS standard errors gt
  • new variable loops within each time loop

29
Problems with bootstrap
  • foreach out in out outf1 outf2
  • use data, clear
  • sum caliper_d /this is where the initial
    pscore comes useful/
  • local c_realr(mean)
  • bootstrap r(att) psmatch2 dd' our_pscore_d',
    out(out') common add mahalanobis(nace)
    caliper(c_real')
  • matrix mat e(b), e(se) /without this, no
    resultssets/
  • mat li mat
  • svmat mat
  • rename mat1 ad'_diff_after_bs_out
  • rename mat2 ad'_se_after_bs_out
  • gen time_of_eventd'
  • keep se diff ttest mean time_of_event a
  • drop if _ngt1
  • save priv_bs_out'd', replace

30
Final steps
  1. Merge files obtained from bootstrap on event
    (to have a complete resultsset within each
    event period)
  2. Merge bootstrap resultssets with
  3. Append the files for event periods
  4. Organise the data
  5. Produce tables and graphs (again in loops)
  6. Write paper

31
The resulting graphs (1)
  • There are 6x3 figures alltogether

32
The resulting graphs (2)
  • There are 6x2 figures alltogether

33
The resulting graphs (3)
  • There are 6x3 figures alltogether

34
Some advices we did not take at the right time ?
  • Save your computers time (your wasted time is
    your problem ?)
  • Use sample 10 for testing your procedures -
    saves a lot of time
  • Leaving mess is not useful if you ever want to
    come back
  • Your memory lasts shorter than that of saved
    files describing dofiles really helps
  • Loops are better than copypaste and less messy
    too
  • STATA is not that complicated modifying
    ado-files is really easy if you know what you
    want

35
Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com