Missing Data: - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Missing Data:

Description:

Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 50
Provided by: mca102
Category:

less

Transcript and Presenter's Notes

Title: Missing Data:


1
Missing Data Where has my data gone?
Peter T. Donnan Professor of Epidemiology and
Biostatistics
2
The Tao of Missingness
The inside and the outside are one Zen
philosopher
Nothing is more real than nothing Samuel
Beckett
3
Overview
  • Why missing data matters
  • Some useful definitions
  • Practical issues
  • Methods for imputation

4
Missing data is inevitable!
  • Trials or observational studies are set up to
    obtain complete data from everyone
  • Multiple reminders for questionnaire data
  • Important to distinguish valid unknown, not
    applicable, lost to follow-up, etc
  • Its not missing, its unknown!
  • Despite investigators best efforts missing data
    is inevitable
  • The key is to minimise loss of data in the first
    place

5
Why does data go missing?
  • Poor trial management, lack of follow-up
  • Patients have Adverse Events (AE) and drop-out
  • Patients fail to attend clinic / fill in
    questionnaire
  • Migrate with no information available (They dont
    write, they dont call!)
  • Leave study for no apparent reason

6
Some real examples of reasons for missing data
  • Emergency Christmas shopping (reason for missed
    visit, early November)
  • The drugs will interfere with my drinking
    (reason for eligible pt saying No to trial)
  • No you cant come and see me Im better (pt
    dropping out at V3)
  • Changed address and/or phone number rendered pts
    untraceable (more frequent in the West)
  • Two pts co-operated but refused photographs, one
    on religious grounds (despite giving consent)

7
Does it matter?
  • Missing data can seriously damage a studys
    credibility
  • Two main problems
  • May introduce bias
  • Reduces Power

8
Note that even worse in regression
ID BMI HBA1c LDL Chol HDL
1 35.2 9.1 5.8 0.8
2 26.3 7.0 4.3 1.1
3
4 28.3 11.3 5.4 6.1 0.7
5 8.4 3.9
6 40.7 10.2 4.0
7 30.5 9.3 2.9 4.1 1.0
8 26.1 3.5 5.2
  • Pairwise comparisons leave out 38
  • So two-group comparisons not too bad
  • Regression or any other multidimensional analysis
    leaves out 75 of data

- COMPLETE-CASE ONLY ANALYSIS
9
Practical Tip 1
  • Complete Case analysis is where the missing data
    problem is ignored
  • Patients with missing data are excluded
  • This will be obvious from the constructed tables
  • The n in the tables reporting the analysis will
    be less than the N enrolled
  • Even worse the dataset used may differ by outcome
    as n may change

10
Practical Tip 1
  • A useful and informative procedure is to create a
    table comparing the characteristics of the
    complete case dataset and those missing e.g.

Factor Complete Cases Missing at 8 weeks
Mean Age 32 50
Mean BMI 19 28
Male 50 65
11
One Solution? Missing-indicator method
  • Code all missing as unknown and include unknown
    category in regression model (Mea culpa!)
  • Advantage that no subject excluded
  • Difficult to interpret
  • Does not deal with main issue of potential BIAS
  • In fact, it will add bias..
  • Fudge rather than solution

12
Example Unknown stage (n40/476) in Cox PH model
for colorectal cancer
HR Unknown Stage vs. Stage A
N.b. Effect of known stages are now biased
13
Imputation Another Solution
  • Impute missing values and then carry out analysis
    with complete dataset
  • Advantage that no subject excluded
  • Many methods of estimating the missing values
  • LVCF (LOCF) Last Value Carried Forward
  • Mean or median value of measurements
  • Expected value based on regression
  • Expected value based on E-M algorithm

14
Some notation
  • Yobs observed data
  • Ymiss missing data
  • R missing data indicator
  • R 1 indicates data observed,
  • R 0 missing
  • Prob R 0 Yobs prob of missing data given
    values of observed data

15
Some very difficult, opaque, but essential
definitions (1)
  • Missing Completely at Random (MCAR)
  • Prob (Missing) is independent of both
  • 1) observed data and
  • 2) unobserved data
  • Essentially observed data is a random sample of
    full data
  • MCAR is what everyone falsely assumes!
  • If MCAR is assumed, observed-case or
    complete-case analysis is valid.
  • Observed-case analysis is software default!

16
Representation of R as a stratification factor
for responses
Response Indicators Response Indicators Response Indicators Response Indicators
R1 R2 R3 R4
1 1 1 1
1 0 1 1
1 1 0 0
Response Vector Response Vector Response Vector Response Vector
Y1 Y2 Y3 Y4
y y y y
y y y
y y
For MCAR Prob R 0 Yobs, Ymiss, X Prob
R 0 X
17
Possible to test for MCAR
  • Park-Lee test for MCAR
  • Within framework of GEE (Liang and Zeger)
  • Define indicator variables for each missing data
    pattern
  • Fit model with indicators as covariates
  • Test regression coefficients for indicators and
    if significant missing data mechanism is not MCAR

Park T and Lee S-Y. A test of missing completely
at random for longitudinal data with missing
observations. Statist Med 1997 16 1859-1871
18
Example of Park-Lee test for MCAR
Fit three indicator variables Ik 1 if missing
pattern k, 0 otherwise
Missing data pattern Wave Wave Wave
Missing data pattern 1 2 3
0 O O O
1 O M O
2 O O M
3 O M M
Covariate Est/SE
I1 0.65
I2 2.03
I3 3.51
For overall test p 0.0023
19
Examples MCAR
  • Six Cities Air Pollution Study children changed
    schools because of parents so unrelated to health
    of children
  • In a trial Practice changed computer system so
    missing observations not related to previous
    observed or future values

20
Practical Tip 2
  • Check data for MCAR (note SPSS carries out
    Littles test)
  • If assumption seems reasonable analyse using
    complete-case only with impunity
  • If missing data constitutes lt 5 probably
    reasonable to assume MCAR
  • If not, complete-case analysis is likely to be
    biased
  • N.b. MCAR not that common

21
Another essential definition
  • Missing At Random (MAR)
  • Prob (Missing) is independent of
  • 1) unobserved data but
  • 2) dependent on observed data
  • Essentially observed data is a random sample of
    full data in each stratum
  • MAR is weaker version of MCAR assumption
  • If MAR is assumed, many methods possible to
    impute data using observed data.

22
Missing At Random (MAR)
  • Prob (missing) depends on Yobs but not on missing
    Ymiss
  • Prob R 0 Yobs, Ymiss, X
  • Prob R 0 Yobs, X
  • MCAR is a special case of MAR
  • Use fact that missing Y for a person with same
    age, gender, BP, chol, BMI, etc. will be similar
    to a person with same characteristics who does
    have outcome
  • Allows imputation methods based on observed data
    e.g. mean, regression

23
Examples MAR
  • Six Cities Air Pollution Study children moved
    out of area because of non-respiratory problems
    (e.g. type 1 diabetes)
  • Men less likely to attend for follow-up visit but
    not related to values of their likely outcomes
  • Repeated measures where missingness is not
    related to values would have obtained

24
Single Imputation
ID BMI HBA1c LDL Chol HDL
1 35.2 9.1 5.8 0.8
2 26.3 7.0 4.3 1.1
3
4 28.3 11.3 5.4 6.1 0.7
5 8.4 3.9
6 40.7 10.2 4.0
7 30.5 9.3 2.9 4.1 1.0
8 26.1 3.5 5.2
  • Most common approach is to add mean of values
    observed to impute missing
  • Takes no account of differences related to other
    factors eg. HbA1c
  • Takes no account of uncertainty in estimating
    missing value
  • Makes clinicians uneasy!

31.2
31.2
25
Single Imputation
  • Common method in longitudinal data
  • Last Value Carried Forward (LVCF or LOCF)
  • Common in RCTs
  • Some journals and even FDA endorse
  • But statistically unsound unless strong and
    unrealistic assumptions met (see LSHTM website)

ID Baseline 4 weeks 8 weeks
1 15 13 13
2 29 32 32
3 43 43
4 32 29 25
5 19 36 26
6 10 10 13
7 31 25 20
8 19 18
43
19
26
Examples Single Imputation
  • Last Value Carried Forward (LVCF or LOCF) very
    common in RCTs
  • Adalimumab in severe Crohns disease, nearly 50
    of patients were lost-to-follow-up at 52 weeks
    in one trial and LVCF used (but
    relapsing-remitting condition!)
  • But legitimate use in Bells Palsy Trial!
  • No disagreement among statisticians that method
    is unsound

27
Solution is Multiple Imputation!
ID Baseline 4 weeks 8 weeks
1 15 13 12
2 29 32 30
3 35 44
4 32 29 25
5 19 36 26
  1. Assumes data MAR
  2. Missing data filled in m times
  3. The m complete datasets are each analysed by
    using standard procedures
  4. The results for the m complete datasets are
    combined for inference

36
ID Baseline 4 weeks 8 weeks
1 15 13 13
2 29 32 32
3 43   43
4 32 29 25
5 19 36 26
ID Baseline 4 weeks 8 weeks
1 15 13 16
2 29 32 25
3 39 28  40
4 32 29 25
5 19 36 26
19
28
Multiple Imputation (MI)
  • Process derived by Donald Rubin (1987)
  • Replace missing values with set of plausible
    values that also
  • Represents the uncertainty about the correct
    value
  • Requires MAR assumption but NOT MCAR
  • Many methods of estimating imputed values 1)
    regression, 2) propensity score, 3) MCMC

29
Step 1 Multiple Imputation Methods
  • 1) Regression Missing values predicted by
    regression model of previous values and
    covariates
  • Fit model Xß using any variables available
    (previous values and covariates)
  • Repeat if further follow-up results missing
  • Extract predicted value and save new dataset with
    predicted value inserted

30
Missingness Model
  • How do I choose what factors to use in predicting
    imputed values?
  • All factors related to outcome (i.e. all Xs)
  • Plus importantly the outcome
  • Any other factors possibly related to the reason
    for being missing
  • Better to be overly inclusive and statistical
    significance not important

31
Multiple ImputationA Cautionary tale
  • Hippisley-Cox et al, BMJ 2007 developed a risk
    algorithm for CVD called QRISK
  • 70 of Cholesterol values were missing and
    imputed using MI assuming data MAR
  • Found NO association between CVD and cholesterol
  • Investigation showed they had not used CVD
    outcome in the imputation model
  • When rectified true association found!

32
Step 1 Multiple Imputation Methods
  • 2) Propensity score
  • create indicator variable R0 for missing
  • Fit logistic model Xß of propensity to be missing
    (R0).
  • Divide observations by quintiles of propensity
    score
  • Allow random draws (Bayesian bootstrap) of
    values from observed data in matching quintile to
    fill in missing data

33
Step 1 Multiple Imputation Methods
  • 3) Monte Carlo Markov Chain (MCMC)
  • Imputation draws from conditional distribution of
    Ymiss Yobs
  • Posterior step simulates posterior mean and
    covariance matrix
  • New estimates used iteratively in imputation step
  • Process converges (hopefully)
  • Incorporates EM algorithm

34
Step 1 Multiple Imputation
  • All available in PROC MI in SAS software and
    creates m number of datasets
  • Now available in SPSS v. 17
  • Note SPSS carries out Littles test for MCAR
  • S-plus some functions
  • Stata has full set of programs for MI

35
How many (m) datasets do I need?
RE
  • Too many leads to
  • data management
  • problem
  • Relative efficiency of
  • using finite m
  • imputations is given by
  • RE ( 1 ? / m) -1
  • where ? is fraction of
  • missing information

? ? ? ? ? ?
m 10 20 30 50 70
3 0.97 0.94 0.91 0.86 0.81
5 0.98 0.96 0.94 0.91 0.88
10 0.99 0.98 0.97 0.95 0.93
20 0.99 0.99 0.98 0.98 0.97
36
Step 2 Multiple Imputation
  • Analyse the now complete datasets in standard way
  • T-test, Regression, Survival, Logistic, GLM,
    Mixed model, etc
  • Creates a set of parameter estimates for each of
    m datasets

37
Step 3 Multiple Imputation
  • Combine results from m datasets
  • Standard way is calculate mean and variance of
    parameter estimate

Let Ü be within-imputation variance and B the
between-imputation variance then the total
variance T is -
38
Step 3 Multiple Imputation
  • Relatively easy, but fortunately SAS has a
    procedure to implement this called PROC MIANALYZE
  • Good documentation
  • SPSS now does this step in v.17!
  • MI now considered gold standard methodology for
    drawing valid inferences in the face of missing
    data (with MAR)
  • Still many people wary

39
Alternative Solution Weighting
  • Weight observed data to take account of
    under-representation of certain response profiles
  • Does not involve imputation but assumes MAR
  • First proposed in sample survey literature
  • Relatively easy as most standard programs allow
    addition of weighting factor
  • Requires weight wi and then complete case
    analysis weighted by 1/wi

40
Alternative Solution Weighting
  • Estimate wi Pr R 0 Yobs, X
  • Repeat for multiple time points
  • Analyse complete cases weighted by wi
  • Example GEE with MAR
  • Intuitively good as weight people with missing
    data as similar
  • to those with observed data

41
Practical Tip 3
  • If we assume MAR, method of MI provides means of
    valid inference
  • Comprehensive software in SAS and now SPSS
  • Other software incorporate as standard (Stata)
  • Consider weighting method as intuitively appealing

42
Another essential definition
  • Missing Not At Random (MNAR)
  • Prob (Missing) is dependent on both
  • 1) unobserved data and
  • 2) observed data
  • Often referred to as nonignorable missing
    mechanism or informative missingness
  • MNAR is completely unverifiable from the data
  • Need to assess the sensitivity of results to
    different plausible explanations
  • All standard methods are NOT valid
  • Ongoing area of research in statistical methods

43
Examples NMAR
  • QOL missing in those with low quality of life and
    so missingness related to what might have been
    QOL
  • Measurement of weight loss more likely to be
    missing if weight loss likely to be low

44
Missing Not At Random (MNAR)
  • One method uses Structural Equation Modelling
    (SEM)
  • Requires specialist software
  • Often referred to as nonignorable missing
    mechanism or informative missingness
  • MNAR is completely unverifiable from the data
  • Need to assess the sensitivity of results to
    different plausible explanations
  • All standard methods are NOT valid
  • Ongoing area of research in statistical methods

45
Summary
  • Consider hierarchy of missing data
  • MCAR, MAR, MNAR
  • Ideal is to use MI if MAR
  • or Weighting methods if MAR
  • Tools now in SPSS
  • Need to model missingness mechanism jointly with
    analysis of outcome if MNAR
  • Complete case analysis needs to be justified!
  • LVCF needs to be justified!

46
Summary
  • it is time to place CC analysis and simple
    imputation methods, in particular LOCF, in the
    Museum of Statistical Science..
  • Geert Molenberghs
  • Editorial JRSS A, 2007861-863

47
References
  • LSHTM website on missing data, sponsored by ESRC
    (www.lshtm.ac.uk/missingdata/start.html)
  • Donders AR, van der Heijden GJ, Stijnen T, Moons
    KG. Review a gentle introduction to imputation
    of missing values. J Clin epidemiol 2006 59
    1087-91
  • Sterne JAC, White IR, Carlin JB, Spratt M,
    Royston P, Kenward MG, Wood AM, Carpenter JR.
    Multiple imputation for missing data in
    epidemiological and clinical researchpotential
    and pitfalls. BMJ 2009 338 b2393.
  • Hippisley-Cox J, Coupland C, Vinogradova Y,
    Robson J, May M, Brindle P. Derivation and
    validation of QRISK, a new cardiovascular disease
    risk score for the United Kingdom prospective
    open cohort study. BMJ 2007 335 136.
  • Little, Roderick JA and Rubin, Donald B. (1987).
    Statistical Analysis with Missing Data John Wiley
    and Sons, New York.

48
References
  • Dempster AP, Laird NM and Rubin DB. Maximum
    Likelihood from Incomplete Data via the EM
    Algorithm, Journal of the Royal Statistical
    Society 1977 Ser. B., 39 1 - 38.
  • Rubin DB. (1987). Multiple imputation for
    nonresponse in surveys. John Wiley Sons, New
    York.
  • Yuan YC. Multiple imputation for missing data
    concepts and new development. SAS Institute Inc
    (P267-25)
  • Software Documentation for SAS, S-PLUS and
    SPSS.
  • R Development Core Team(2005). R A language and
    environment for statistical computing. R
    Foundation for Statistical Computing, Vienna,
    Austria.

49
The Tao of Missingness
There are known knowns. These are things we
know that we know. There are known unknowns. That
is to say, there are things that we know we
don't know. But there are also unknown unknowns.
There are things we don't know we don't know.
Donald Rumsfeld
Write a Comment
User Comments (0)
About PowerShow.com