Title: Missing Data:
1Missing Data Where has my data gone?
Peter T. Donnan Professor of Epidemiology and
Biostatistics
2The Tao of Missingness
The inside and the outside are one Zen
philosopher
Nothing is more real than nothing Samuel
Beckett
3Overview
- Why missing data matters
- Some useful definitions
- Practical issues
- Methods for imputation
4Missing data is inevitable!
- Trials or observational studies are set up to
obtain complete data from everyone - Multiple reminders for questionnaire data
- Important to distinguish valid unknown, not
applicable, lost to follow-up, etc - Its not missing, its unknown!
- Despite investigators best efforts missing data
is inevitable - The key is to minimise loss of data in the first
place
5Why does data go missing?
- Poor trial management, lack of follow-up
- Patients have Adverse Events (AE) and drop-out
- Patients fail to attend clinic / fill in
questionnaire - Migrate with no information available (They dont
write, they dont call!) - Leave study for no apparent reason
6Some real examples of reasons for missing data
- Emergency Christmas shopping (reason for missed
visit, early November) - The drugs will interfere with my drinking
(reason for eligible pt saying No to trial) - No you cant come and see me Im better (pt
dropping out at V3) - Changed address and/or phone number rendered pts
untraceable (more frequent in the West) - Two pts co-operated but refused photographs, one
on religious grounds (despite giving consent)
7Does it matter?
- Missing data can seriously damage a studys
credibility - Two main problems
- May introduce bias
- Reduces Power
8Note that even worse in regression
ID BMI HBA1c LDL Chol HDL
1 35.2 9.1 5.8 0.8
2 26.3 7.0 4.3 1.1
3
4 28.3 11.3 5.4 6.1 0.7
5 8.4 3.9
6 40.7 10.2 4.0
7 30.5 9.3 2.9 4.1 1.0
8 26.1 3.5 5.2
- Pairwise comparisons leave out 38
- So two-group comparisons not too bad
- Regression or any other multidimensional analysis
leaves out 75 of data
- COMPLETE-CASE ONLY ANALYSIS
9Practical Tip 1
- Complete Case analysis is where the missing data
problem is ignored - Patients with missing data are excluded
- This will be obvious from the constructed tables
- The n in the tables reporting the analysis will
be less than the N enrolled - Even worse the dataset used may differ by outcome
as n may change
10Practical Tip 1
- A useful and informative procedure is to create a
table comparing the characteristics of the
complete case dataset and those missing e.g.
Factor Complete Cases Missing at 8 weeks
Mean Age 32 50
Mean BMI 19 28
Male 50 65
11One Solution? Missing-indicator method
- Code all missing as unknown and include unknown
category in regression model (Mea culpa!) - Advantage that no subject excluded
- Difficult to interpret
- Does not deal with main issue of potential BIAS
- In fact, it will add bias..
- Fudge rather than solution
12Example Unknown stage (n40/476) in Cox PH model
for colorectal cancer
HR Unknown Stage vs. Stage A
N.b. Effect of known stages are now biased
13Imputation Another Solution
- Impute missing values and then carry out analysis
with complete dataset - Advantage that no subject excluded
- Many methods of estimating the missing values
- LVCF (LOCF) Last Value Carried Forward
- Mean or median value of measurements
- Expected value based on regression
- Expected value based on E-M algorithm
14Some notation
- Yobs observed data
- Ymiss missing data
- R missing data indicator
- R 1 indicates data observed,
- R 0 missing
- Prob R 0 Yobs prob of missing data given
values of observed data
15Some very difficult, opaque, but essential
definitions (1)
- Missing Completely at Random (MCAR)
- Prob (Missing) is independent of both
- 1) observed data and
- 2) unobserved data
- Essentially observed data is a random sample of
full data - MCAR is what everyone falsely assumes!
- If MCAR is assumed, observed-case or
complete-case analysis is valid. - Observed-case analysis is software default!
16Representation of R as a stratification factor
for responses
Response Indicators Response Indicators Response Indicators Response Indicators
R1 R2 R3 R4
1 1 1 1
1 0 1 1
1 1 0 0
Response Vector Response Vector Response Vector Response Vector
Y1 Y2 Y3 Y4
y y y y
y y y
y y
For MCAR Prob R 0 Yobs, Ymiss, X Prob
R 0 X
17Possible to test for MCAR
- Park-Lee test for MCAR
- Within framework of GEE (Liang and Zeger)
- Define indicator variables for each missing data
pattern - Fit model with indicators as covariates
- Test regression coefficients for indicators and
if significant missing data mechanism is not MCAR
Park T and Lee S-Y. A test of missing completely
at random for longitudinal data with missing
observations. Statist Med 1997 16 1859-1871
18Example of Park-Lee test for MCAR
Fit three indicator variables Ik 1 if missing
pattern k, 0 otherwise
Missing data pattern Wave Wave Wave
Missing data pattern 1 2 3
0 O O O
1 O M O
2 O O M
3 O M M
Covariate Est/SE
I1 0.65
I2 2.03
I3 3.51
For overall test p 0.0023
19Examples MCAR
- Six Cities Air Pollution Study children changed
schools because of parents so unrelated to health
of children - In a trial Practice changed computer system so
missing observations not related to previous
observed or future values
20Practical Tip 2
- Check data for MCAR (note SPSS carries out
Littles test) - If assumption seems reasonable analyse using
complete-case only with impunity - If missing data constitutes lt 5 probably
reasonable to assume MCAR - If not, complete-case analysis is likely to be
biased - N.b. MCAR not that common
21Another essential definition
- Missing At Random (MAR)
- Prob (Missing) is independent of
- 1) unobserved data but
- 2) dependent on observed data
- Essentially observed data is a random sample of
full data in each stratum - MAR is weaker version of MCAR assumption
- If MAR is assumed, many methods possible to
impute data using observed data.
22Missing At Random (MAR)
- Prob (missing) depends on Yobs but not on missing
Ymiss - Prob R 0 Yobs, Ymiss, X
- Prob R 0 Yobs, X
- MCAR is a special case of MAR
- Use fact that missing Y for a person with same
age, gender, BP, chol, BMI, etc. will be similar
to a person with same characteristics who does
have outcome - Allows imputation methods based on observed data
e.g. mean, regression
23Examples MAR
- Six Cities Air Pollution Study children moved
out of area because of non-respiratory problems
(e.g. type 1 diabetes) - Men less likely to attend for follow-up visit but
not related to values of their likely outcomes - Repeated measures where missingness is not
related to values would have obtained
24Single Imputation
ID BMI HBA1c LDL Chol HDL
1 35.2 9.1 5.8 0.8
2 26.3 7.0 4.3 1.1
3
4 28.3 11.3 5.4 6.1 0.7
5 8.4 3.9
6 40.7 10.2 4.0
7 30.5 9.3 2.9 4.1 1.0
8 26.1 3.5 5.2
- Most common approach is to add mean of values
observed to impute missing - Takes no account of differences related to other
factors eg. HbA1c - Takes no account of uncertainty in estimating
missing value - Makes clinicians uneasy!
31.2
31.2
25Single Imputation
- Common method in longitudinal data
- Last Value Carried Forward (LVCF or LOCF)
- Common in RCTs
- Some journals and even FDA endorse
- But statistically unsound unless strong and
unrealistic assumptions met (see LSHTM website)
ID Baseline 4 weeks 8 weeks
1 15 13 13
2 29 32 32
3 43 43
4 32 29 25
5 19 36 26
6 10 10 13
7 31 25 20
8 19 18
43
19
26Examples Single Imputation
- Last Value Carried Forward (LVCF or LOCF) very
common in RCTs - Adalimumab in severe Crohns disease, nearly 50
of patients were lost-to-follow-up at 52 weeks
in one trial and LVCF used (but
relapsing-remitting condition!) - But legitimate use in Bells Palsy Trial!
- No disagreement among statisticians that method
is unsound
27Solution is Multiple Imputation!
ID Baseline 4 weeks 8 weeks
1 15 13 12
2 29 32 30
3 35 44
4 32 29 25
5 19 36 26
- Assumes data MAR
- Missing data filled in m times
- The m complete datasets are each analysed by
using standard procedures - The results for the m complete datasets are
combined for inference
36
ID Baseline 4 weeks 8 weeks
1 15 13 13
2 29 32 32
3 43 43
4 32 29 25
5 19 36 26
ID Baseline 4 weeks 8 weeks
1 15 13 16
2 29 32 25
3 39 28 40
4 32 29 25
5 19 36 26
19
28Multiple Imputation (MI)
- Process derived by Donald Rubin (1987)
- Replace missing values with set of plausible
values that also - Represents the uncertainty about the correct
value - Requires MAR assumption but NOT MCAR
- Many methods of estimating imputed values 1)
regression, 2) propensity score, 3) MCMC
29Step 1 Multiple Imputation Methods
- 1) Regression Missing values predicted by
regression model of previous values and
covariates - Fit model Xß using any variables available
(previous values and covariates) - Repeat if further follow-up results missing
- Extract predicted value and save new dataset with
predicted value inserted
30Missingness Model
- How do I choose what factors to use in predicting
imputed values? - All factors related to outcome (i.e. all Xs)
- Plus importantly the outcome
- Any other factors possibly related to the reason
for being missing - Better to be overly inclusive and statistical
significance not important
31Multiple ImputationA Cautionary tale
- Hippisley-Cox et al, BMJ 2007 developed a risk
algorithm for CVD called QRISK - 70 of Cholesterol values were missing and
imputed using MI assuming data MAR - Found NO association between CVD and cholesterol
- Investigation showed they had not used CVD
outcome in the imputation model - When rectified true association found!
32Step 1 Multiple Imputation Methods
- 2) Propensity score
- create indicator variable R0 for missing
- Fit logistic model Xß of propensity to be missing
(R0). - Divide observations by quintiles of propensity
score - Allow random draws (Bayesian bootstrap) of
values from observed data in matching quintile to
fill in missing data
33Step 1 Multiple Imputation Methods
- 3) Monte Carlo Markov Chain (MCMC)
- Imputation draws from conditional distribution of
Ymiss Yobs - Posterior step simulates posterior mean and
covariance matrix - New estimates used iteratively in imputation step
- Process converges (hopefully)
- Incorporates EM algorithm
34Step 1 Multiple Imputation
- All available in PROC MI in SAS software and
creates m number of datasets - Now available in SPSS v. 17
- Note SPSS carries out Littles test for MCAR
- S-plus some functions
- Stata has full set of programs for MI
35How many (m) datasets do I need?
RE
- Too many leads to
- data management
- problem
- Relative efficiency of
- using finite m
- imputations is given by
- RE ( 1 ? / m) -1
- where ? is fraction of
- missing information
? ? ? ? ? ?
m 10 20 30 50 70
3 0.97 0.94 0.91 0.86 0.81
5 0.98 0.96 0.94 0.91 0.88
10 0.99 0.98 0.97 0.95 0.93
20 0.99 0.99 0.98 0.98 0.97
36Step 2 Multiple Imputation
- Analyse the now complete datasets in standard way
- T-test, Regression, Survival, Logistic, GLM,
Mixed model, etc - Creates a set of parameter estimates for each of
m datasets
37Step 3 Multiple Imputation
- Combine results from m datasets
- Standard way is calculate mean and variance of
parameter estimate
Let Ü be within-imputation variance and B the
between-imputation variance then the total
variance T is -
38Step 3 Multiple Imputation
- Relatively easy, but fortunately SAS has a
procedure to implement this called PROC MIANALYZE - Good documentation
- SPSS now does this step in v.17!
- MI now considered gold standard methodology for
drawing valid inferences in the face of missing
data (with MAR) - Still many people wary
39Alternative Solution Weighting
- Weight observed data to take account of
under-representation of certain response profiles - Does not involve imputation but assumes MAR
- First proposed in sample survey literature
- Relatively easy as most standard programs allow
addition of weighting factor - Requires weight wi and then complete case
analysis weighted by 1/wi
40Alternative Solution Weighting
- Estimate wi Pr R 0 Yobs, X
- Repeat for multiple time points
- Analyse complete cases weighted by wi
- Example GEE with MAR
- Intuitively good as weight people with missing
data as similar - to those with observed data
41Practical Tip 3
- If we assume MAR, method of MI provides means of
valid inference - Comprehensive software in SAS and now SPSS
- Other software incorporate as standard (Stata)
- Consider weighting method as intuitively appealing
42Another essential definition
- Missing Not At Random (MNAR)
- Prob (Missing) is dependent on both
- 1) unobserved data and
- 2) observed data
- Often referred to as nonignorable missing
mechanism or informative missingness - MNAR is completely unverifiable from the data
- Need to assess the sensitivity of results to
different plausible explanations - All standard methods are NOT valid
- Ongoing area of research in statistical methods
43Examples NMAR
- QOL missing in those with low quality of life and
so missingness related to what might have been
QOL - Measurement of weight loss more likely to be
missing if weight loss likely to be low
44Missing Not At Random (MNAR)
- One method uses Structural Equation Modelling
(SEM) - Requires specialist software
- Often referred to as nonignorable missing
mechanism or informative missingness - MNAR is completely unverifiable from the data
- Need to assess the sensitivity of results to
different plausible explanations - All standard methods are NOT valid
- Ongoing area of research in statistical methods
45Summary
- Consider hierarchy of missing data
- MCAR, MAR, MNAR
- Ideal is to use MI if MAR
- or Weighting methods if MAR
- Tools now in SPSS
- Need to model missingness mechanism jointly with
analysis of outcome if MNAR - Complete case analysis needs to be justified!
- LVCF needs to be justified!
46Summary
- it is time to place CC analysis and simple
imputation methods, in particular LOCF, in the
Museum of Statistical Science.. - Geert Molenberghs
- Editorial JRSS A, 2007861-863
47References
- LSHTM website on missing data, sponsored by ESRC
(www.lshtm.ac.uk/missingdata/start.html) - Donders AR, van der Heijden GJ, Stijnen T, Moons
KG. Review a gentle introduction to imputation
of missing values. J Clin epidemiol 2006 59
1087-91 - Sterne JAC, White IR, Carlin JB, Spratt M,
Royston P, Kenward MG, Wood AM, Carpenter JR.
Multiple imputation for missing data in
epidemiological and clinical researchpotential
and pitfalls. BMJ 2009 338 b2393. - Hippisley-Cox J, Coupland C, Vinogradova Y,
Robson J, May M, Brindle P. Derivation and
validation of QRISK, a new cardiovascular disease
risk score for the United Kingdom prospective
open cohort study. BMJ 2007 335 136. - Little, Roderick JA and Rubin, Donald B. (1987).
Statistical Analysis with Missing Data John Wiley
and Sons, New York.
48References
- Dempster AP, Laird NM and Rubin DB. Maximum
Likelihood from Incomplete Data via the EM
Algorithm, Journal of the Royal Statistical
Society 1977 Ser. B., 39 1 - 38. - Rubin DB. (1987). Multiple imputation for
nonresponse in surveys. John Wiley Sons, New
York. - Yuan YC. Multiple imputation for missing data
concepts and new development. SAS Institute Inc
(P267-25) - Software Documentation for SAS, S-PLUS and
SPSS. - R Development Core Team(2005). R A language and
environment for statistical computing. R
Foundation for Statistical Computing, Vienna,
Austria.
49The Tao of Missingness
There are known knowns. These are things we
know that we know. There are known unknowns. That
is to say, there are things that we know we
don't know. But there are also unknown unknowns.
There are things we don't know we don't know.
Donald Rumsfeld