A%20new%20architecture%20for%20handling%20multiply%20imputed%20data%20in%20Stata

About This Presentation
Title:

A%20new%20architecture%20for%20handling%20multiply%20imputed%20data%20in%20Stata

Description:

Multiple imputation (MI) Introduced by Donald Rubin (1987 ... apply at the imputation level ... Within-imputation variance (average of the complete ... –

Number of Views:181
Avg rating:3.0/5.0
Slides: 21
Provided by: roryw
Learn more at: http://repec.org
Category:

less

Transcript and Presenter's Notes

Title: A%20new%20architecture%20for%20handling%20multiply%20imputed%20data%20in%20Stata


1
A new architecture for handling multiply imputed
data in Stata
  • JC Galati1, JB Carlin1,2, P Royston3
  • 1Murdoch Childrens Research Institute (MCRI),
    Melbourne
  • 2The University of Melbourne
  • 3MRC Clinical Trials Unit, London

2
Missing data
  • Why do we need additional tools for analysing
    datasets with missing values?
  • Traditional methods work with complete datasets
  • Statistical packages discard incomplete
    observations when analysing an incomplete dataset
  • i.e. a complete-case analysis is performed
  • This can lead to loss of power, and possibly to
    biased estimates, depending on why the data went
    missing

3
Multiple imputation (MI)
  • Introduced by Donald Rubin (1987 book, Wiley)
  • Based on Bayesian principles
  • Both the data-generating mechanism and the
    missingness mechanism are modelled
  • Fairly broad assumptions about data-generating
    model
  • Fairly restrictive assumptions about missingness
    mechanism
  • Modelling assumptions apply at the imputation
    level
  • Statistical modelling is general (once data is
    imputed)
  • Post estimation some more work needs to be
    done
  • Diagnostics theory and practice not yet worked
    out
  • Model-building in its infancy work has started

4
MI data analysis
  • Start with a dataset with some values missing
  • Missing values are imputed multiple times
  • Using a Bayesianly proper imputation method
  • This creates m sets of completed data
  • Each completed dataset is analysed separately
  • Standard complete-data estimation methods are
    used
  • E.g. linear regression, logistic regression

5
Inference (estimation) using MI
  • Coefficient estimates and variances (SEs) from
    complete-data analyses are combined using Rubins
    Rules
  • Parameter estimates
  • Average of the complete-data parameter estimates
  • Variance is the sum of two components
  • Within-imputation variance (average of the
    complete-data variances)
  • Between-imputation variance (determined from
    complete-data parameter estimates)
  • Point estimators divided by SE have approximate t
    distributions
  • Estimate d.f. and use t-multipliers to get
    confidence intervals

6
Background (MI in Stata)
  • What is available in Stata currently?
  • MI Tools, Carlin et. al. Stata J. 2003
  • Imputed datasets stored in separate dta files
  • myfile1.dta, ... , myfilem.dta
  • Estimation
  • mifit with
  • regress, logit, probit, clogit, glm,
  • logistic, poisson, svyreg, svylogit,
  • svyprobit, svypoisson, xtgee, xtreg
  • Post estimation
  • milincom, mitestparm
  • Data manipulation
  • miset, miappend, mimerge, mido, misave

7
Background (MI in Stata)
  • Main drawbacks of MI Tools
  • Loose association between original and imputed
    data
  • Loose association between individual imputed
    datasets
  • Limit to range of estimation commands supported
    (13)
  • Choice of coding of some aspects resulted in slow
    execution time in some cases
  • No capacity to perform imputation

8
Background (MI in Stata)
  • What is available in Stata currently? (cont.)
  • ice, micombine, Royston Stata J. 2004/05
  • ice stores imputed datasets in a single dta file
  • uses impid and obsid vars
  • Estimation
  • micombine with
  • clogit, cnreg, glm, logistic, logit,
  • poisson, probit, qreg, regress, rreg,
  • xtgee, streg, stcox, ologit, oprobit, mlogit
  • Post estimation
  • results returned in e(b) , e(V) etc.
  • onus on user to know when post-estimation command
    applied directly to combined estimates is valid

9
Background (MI in Stata)
  • ice, micombine, Royston Stata J. 2004/05 (cont.)
  • Data manipulation
  • left to user, but stacked format facilitates
    simple transformation of variables etc.
  • mijoin, misplit (for conversion between formats)
  • Main drawbacks
  • Limit to range of estimation commands supported
    (16)
  • Manipulation that changes number of observations
    in each dataset not easily supported (eg.
    reshape)
  • Not clear when/if post-estimation is valid

10
mim A new architecture
  • Main aims
  • To unify two sets of tools into a single
    architecture
  • To combine functionality of both sets of tools
  • To simplify the command syntax
  • To extend the range of estimation commands
    supported
  • Better post-estimation facilities
  • testparm, lincom, predict
  • Make it harder to do crazy things
  • Add other post-estimation commands later

11
mim A new architecture
  • Scope
  • Creation of imputations is NOT included
  • But easy for users to put imputed datasets into
    mim format
  • Architecture covers analysis and manipulation of
    existing imputed datasets
  • Designed to handle
  • Estimation
  • Data manipulation (reshape, append merge)
  • Post-estimation (lincom, testparm predict)
  • Replay (management of estimation results)
  • Utility functions

12
mim A new architecture
  • Storage of imputed datasets
  • Based on Roystons stacked format
  • Fixed names for impid and obsid vars
  • _mj (impid) and _mi (obsid)
  • no need for
  • dataset characteristics to record the names
  • additional command options to specify the names
  • dedicated set command to manage the
    characteristics
  • stacking requires only generate, append and
    replace
  • Original data stored in the stack
  • _mj 0

13
mim A new architecture
  • Storage of datasets illustration
  • _mj _mi y x
  • ----------------------------------
  • 0 1 1.1 105
  • 0 2 9.2 106
  • 0 3 1.1 .
  • 0 4 2.3 .
  • 0 5 7.5 108
  • 0 6 7.9 .
  • 1 1 1.1 105
  • 1 2 9.2 106
  • 1 3 1.1 109.796
  • 1 4 2.3 110.456
  • 1 5 7.5 108
  • 1 6 7.9 102.243
  • 2 1 1.1 105
  • 2 2 9.2 106
  • 2 3 1.1 107.952

14
mim A new architecture
  • Command structure
  • A single command prefix called mim
  • mim processes the multiply-imputed dataset
    currently in memory
  • Typical syntax
  • . mim command
  • E.g.
  • . use myImputedData, clear
  • . mim regress y x1 x2 x3
  • . mim predict yhat
  • . mim lincom x1x2x3, or

15
mim A new architecture
  • Commands (cont.)
  • General syntax
  • Default behaviour of mim may be modified through
    mim options
  • mim , mim_options command
  • mim_options depend on whether one wishes to do
  • estimation
  • data manipulation
  • post-estimation
  • replay

16
Using mim
  • Estimation
  • mim recognises 28 estimation commands
  • regress mean proportion ratio logistic
  • logit ologit mlogit probit oprobit poisson
  • glm binreg blogit clogit cnreg mvreg rreg
  • qreg iqreg sqreg bsqreg stcox streg
  • xtgee xtreg xtlogit xtmixed
  • and 11 svy commands
  • svyregress svymean ... svypoisson
  • Plus, in principle any Stata estimation command
    may be used
  • mim, category(fit) estimation_command

17
Using mim
  • Data manipulation
  • Stacked format allows simple manipulation using
    existing stata commands
  • . generate, replace, label etc.
  • . by _mj tabulate ...
  • mim recognises 3 data manipulation commands
  • . mim reshape cmdline
  • . mim append using another mim dataset
  • . mim, sort(varlist) merge using ...
  • In principle, any Stata data manipulation command
    may be used with mim
  • mim, category(manip) sort(varlist) manip_command

18
Using mim
  • mim recognises some post-estimation commands
  • . mim lincom cmdline
  • . mim testparm cmdline
  • . mim predict xbvar , eq(name)
  • . mim predict sevar, stdp eq(name)
  • Replay combined estimates
  • . mim
  • Replay individual estimates (th imputed
    dataset)
  • . mim, j()

19
Using mim
  • Interactive example in Stata

20
Final comments
  • Difficulties faced
  • Simplicity of programming versus ease of use and
    flexibility
  • Inconsistencies between commands resulted in more
    tailoring than wed hoped
  • Progress
  • Coding of version 1 complete
  • Current version is 1.0.3
  • Help file written
  • Has been in beta-testing for several months
  • Submitted for publication in Stata Journal
Write a Comment
User Comments (0)
About PowerShow.com