Missing Data: - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Missing Data:

Description:

Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 50

Provided by: mca102

Category:

more less

Transcript and Presenter's Notes

Title: Missing Data:

1
Missing Data Where has my data gone?
Peter T. Donnan Professor of Epidemiology and
Biostatistics
2
The Tao of Missingness
The inside and the outside are one Zen
philosopher
Nothing is more real than nothing Samuel
Beckett
3
Overview

Why missing data matters
Some useful definitions
Practical issues
Methods for imputation

4
Missing data is inevitable!

Trials or observational studies are set up to
obtain complete data from everyone
Multiple reminders for questionnaire data
Important to distinguish valid unknown, not
applicable, lost to follow-up, etc
Its not missing, its unknown!
Despite investigators best efforts missing data
is inevitable
The key is to minimise loss of data in the first
place

5
Why does data go missing?

Poor trial management, lack of follow-up
Patients have Adverse Events (AE) and drop-out
Patients fail to attend clinic / fill in
questionnaire
Migrate with no information available (They dont
write, they dont call!)
Leave study for no apparent reason

6
Some real examples of reasons for missing data

Emergency Christmas shopping (reason for missed
visit, early November)
The drugs will interfere with my drinking
(reason for eligible pt saying No to trial)
No you cant come and see me Im better (pt
dropping out at V3)
Changed address and/or phone number rendered pts
untraceable (more frequent in the West)
Two pts co-operated but refused photographs, one
on religious grounds (despite giving consent)

7
Does it matter?

Missing data can seriously damage a studys
credibility
Two main problems
May introduce bias
Reduces Power

8
Note that even worse in regression
ID BMI HBA1c LDL Chol HDL
1 35.2 9.1 5.8 0.8
2 26.3 7.0 4.3 1.1
3
4 28.3 11.3 5.4 6.1 0.7
5 8.4 3.9
6 40.7 10.2 4.0
7 30.5 9.3 2.9 4.1 1.0
8 26.1 3.5 5.2

Pairwise comparisons leave out 38
So two-group comparisons not too bad
Regression or any other multidimensional analysis
leaves out 75 of data

- COMPLETE-CASE ONLY ANALYSIS
9
Practical Tip 1

Complete Case analysis is where the missing data
problem is ignored
Patients with missing data are excluded
This will be obvious from the constructed tables
The n in the tables reporting the analysis will
be less than the N enrolled
Even worse the dataset used may differ by outcome
as n may change

10
Practical Tip 1

A useful and informative procedure is to create a
table comparing the characteristics of the
complete case dataset and those missing e.g.

Factor Complete Cases Missing at 8 weeks
Mean Age 32 50
Mean BMI 19 28
Male 50 65
11
One Solution? Missing-indicator method

Code all missing as unknown and include unknown
category in regression model (Mea culpa!)
Advantage that no subject excluded
Difficult to interpret
Does not deal with main issue of potential BIAS
In fact, it will add bias..
Fudge rather than solution

12
Example Unknown stage (n40/476) in Cox PH model
for colorectal cancer
HR Unknown Stage vs. Stage A
N.b. Effect of known stages are now biased
13
Imputation Another Solution

Impute missing values and then carry out analysis
with complete dataset
Advantage that no subject excluded
Many methods of estimating the missing values
LVCF (LOCF) Last Value Carried Forward
Mean or median value of measurements
Expected value based on regression
Expected value based on E-M algorithm

14
Some notation

Yobs observed data
Ymiss missing data
R missing data indicator
R 1 indicates data observed,
R 0 missing
Prob R 0 Yobs prob of missing data given
values of observed data

15
Some very difficult, opaque, but essential
definitions (1)

Missing Completely at Random (MCAR)
Prob (Missing) is independent of both
1) observed data and
2) unobserved data
Essentially observed data is a random sample of
full data
MCAR is what everyone falsely assumes!
If MCAR is assumed, observed-case or
complete-case analysis is valid.
Observed-case analysis is software default!

16
Representation of R as a stratification factor
for responses
Response Indicators Response Indicators Response Indicators Response Indicators
R1 R2 R3 R4
1 1 1 1
1 0 1 1
1 1 0 0
Response Vector Response Vector Response Vector Response Vector
Y1 Y2 Y3 Y4
y y y y
y y y
y y
For MCAR Prob R 0 Yobs, Ymiss, X Prob
R 0 X
17
Possible to test for MCAR

Park-Lee test for MCAR
Within framework of GEE (Liang and Zeger)
Define indicator variables for each missing data
pattern
Fit model with indicators as covariates
Test regression coefficients for indicators and
if significant missing data mechanism is not MCAR

Park T and Lee S-Y. A test of missing completely
at random for longitudinal data with missing
observations. Statist Med 1997 16 1859-1871
18
Example of Park-Lee test for MCAR
Fit three indicator variables Ik 1 if missing
pattern k, 0 otherwise
Missing data pattern Wave Wave Wave
Missing data pattern 1 2 3
0 O O O
1 O M O
2 O O M
3 O M M
Covariate Est/SE
I1 0.65
I2 2.03
I3 3.51
For overall test p 0.0023
19
Examples MCAR

Six Cities Air Pollution Study children changed
schools because of parents so unrelated to health
of children
In a trial Practice changed computer system so
missing observations not related to previous
observed or future values

20
Practical Tip 2

Check data for MCAR (note SPSS carries out
Littles test)
If assumption seems reasonable analyse using
complete-case only with impunity
If missing data constitutes lt 5 probably
reasonable to assume MCAR
If not, complete-case analysis is likely to be
biased
N.b. MCAR not that common

21
Another essential definition

Missing At Random (MAR)
Prob (Missing) is independent of
1) unobserved data but
2) dependent on observed data
Essentially observed data is a random sample of
full data in each stratum
MAR is weaker version of MCAR assumption
If MAR is assumed, many methods possible to
impute data using observed data.

22
Missing At Random (MAR)

Prob (missing) depends on Yobs but not on missing
Ymiss
Prob R 0 Yobs, Ymiss, X
Prob R 0 Yobs, X
MCAR is a special case of MAR
Use fact that missing Y for a person with same
age, gender, BP, chol, BMI, etc. will be similar
to a person with same characteristics who does
have outcome
Allows imputation methods based on observed data
e.g. mean, regression

23
Examples MAR

Six Cities Air Pollution Study children moved
out of area because of non-respiratory problems
(e.g. type 1 diabetes)
Men less likely to attend for follow-up visit but
not related to values of their likely outcomes
Repeated measures where missingness is not
related to values would have obtained

24
Single Imputation
ID BMI HBA1c LDL Chol HDL
1 35.2 9.1 5.8 0.8
2 26.3 7.0 4.3 1.1
3
4 28.3 11.3 5.4 6.1 0.7
5 8.4 3.9
6 40.7 10.2 4.0
7 30.5 9.3 2.9 4.1 1.0
8 26.1 3.5 5.2

Most common approach is to add mean of values
observed to impute missing
Takes no account of differences related to other
factors eg. HbA1c
Takes no account of uncertainty in estimating
missing value
Makes clinicians uneasy!

31.2
31.2
25
Single Imputation

Common method in longitudinal data
Last Value Carried Forward (LVCF or LOCF)
Common in RCTs
Some journals and even FDA endorse
But statistically unsound unless strong and
unrealistic assumptions met (see LSHTM website)

ID Baseline 4 weeks 8 weeks
1 15 13 13
2 29 32 32
3 43 43
4 32 29 25
5 19 36 26
6 10 10 13
7 31 25 20
8 19 18
43
19
26
Examples Single Imputation

Last Value Carried Forward (LVCF or LOCF) very
common in RCTs
Adalimumab in severe Crohns disease, nearly 50
of patients were lost-to-follow-up at 52 weeks
in one trial and LVCF used (but
relapsing-remitting condition!)
But legitimate use in Bells Palsy Trial!
No disagreement among statisticians that method
is unsound

27
Solution is Multiple Imputation!
ID Baseline 4 weeks 8 weeks
1 15 13 12
2 29 32 30
3 35 44
4 32 29 25
5 19 36 26

Assumes data MAR
Missing data filled in m times
The m complete datasets are each analysed by
using standard procedures
The results for the m complete datasets are
combined for inference

36
ID Baseline 4 weeks 8 weeks
1 15 13 13
2 29 32 32
3 43 43
4 32 29 25
5 19 36 26
ID Baseline 4 weeks 8 weeks
1 15 13 16
2 29 32 25
3 39 28 40
4 32 29 25
5 19 36 26
19
28
Multiple Imputation (MI)

Process derived by Donald Rubin (1987)
Replace missing values with set of plausible
values that also
Represents the uncertainty about the correct
value
Requires MAR assumption but NOT MCAR
Many methods of estimating imputed values 1)
regression, 2) propensity score, 3) MCMC

29
Step 1 Multiple Imputation Methods

1) Regression Missing values predicted by
regression model of previous values and
covariates
Fit model Xß using any variables available
(previous values and covariates)
Repeat if further follow-up results missing
Extract predicted value and save new dataset with
predicted value inserted

30
Missingness Model

How do I choose what factors to use in predicting
imputed values?
All factors related to outcome (i.e. all Xs)
Plus importantly the outcome
Any other factors possibly related to the reason
for being missing
Better to be overly inclusive and statistical
significance not important

31
Multiple ImputationA Cautionary tale

Hippisley-Cox et al, BMJ 2007 developed a risk
algorithm for CVD called QRISK
70 of Cholesterol values were missing and
imputed using MI assuming data MAR
Found NO association between CVD and cholesterol
Investigation showed they had not used CVD
outcome in the imputation model
When rectified true association found!

32
Step 1 Multiple Imputation Methods

2) Propensity score
create indicator variable R0 for missing
Fit logistic model Xß of propensity to be missing
(R0).
Divide observations by quintiles of propensity
score
Allow random draws (Bayesian bootstrap) of
values from observed data in matching quintile to
fill in missing data

33
Step 1 Multiple Imputation Methods

3) Monte Carlo Markov Chain (MCMC)
Imputation draws from conditional distribution of
Ymiss Yobs
Posterior step simulates posterior mean and
covariance matrix
New estimates used iteratively in imputation step
Process converges (hopefully)
Incorporates EM algorithm

34
Step 1 Multiple Imputation

All available in PROC MI in SAS software and
creates m number of datasets
Now available in SPSS v. 17
Note SPSS carries out Littles test for MCAR
S-plus some functions
Stata has full set of programs for MI

35
How many (m) datasets do I need?
RE

Too many leads to
data management
problem
Relative efficiency of
using finite m
imputations is given by
RE ( 1 ? / m) -1
where ? is fraction of
missing information

? ? ? ? ? ?
m 10 20 30 50 70
3 0.97 0.94 0.91 0.86 0.81
5 0.98 0.96 0.94 0.91 0.88
10 0.99 0.98 0.97 0.95 0.93
20 0.99 0.99 0.98 0.98 0.97
36
Step 2 Multiple Imputation

Analyse the now complete datasets in standard way
T-test, Regression, Survival, Logistic, GLM,
Mixed model, etc
Creates a set of parameter estimates for each of
m datasets

37
Step 3 Multiple Imputation

Combine results from m datasets
Standard way is calculate mean and variance of
parameter estimate

Let Ü be within-imputation variance and B the
between-imputation variance then the total
variance T is -
38
Step 3 Multiple Imputation

Relatively easy, but fortunately SAS has a
procedure to implement this called PROC MIANALYZE
Good documentation
SPSS now does this step in v.17!
MI now considered gold standard methodology for
drawing valid inferences in the face of missing
data (with MAR)
Still many people wary

39
Alternative Solution Weighting

Weight observed data to take account of
under-representation of certain response profiles
Does not involve imputation but assumes MAR
First proposed in sample survey literature
Relatively easy as most standard programs allow
addition of weighting factor
Requires weight wi and then complete case
analysis weighted by 1/wi

40
Alternative Solution Weighting

Estimate wi Pr R 0 Yobs, X
Repeat for multiple time points
Analyse complete cases weighted by wi
Example GEE with MAR
Intuitively good as weight people with missing
data as similar
to those with observed data

41
Practical Tip 3

If we assume MAR, method of MI provides means of
valid inference
Comprehensive software in SAS and now SPSS
Other software incorporate as standard (Stata)
Consider weighting method as intuitively appealing

42
Another essential definition

Missing Not At Random (MNAR)
Prob (Missing) is dependent on both
1) unobserved data and
2) observed data
Often referred to as nonignorable missing
mechanism or informative missingness
MNAR is completely unverifiable from the data
Need to assess the sensitivity of results to
different plausible explanations
All standard methods are NOT valid
Ongoing area of research in statistical methods

43
Examples NMAR

QOL missing in those with low quality of life and
so missingness related to what might have been
QOL
Measurement of weight loss more likely to be
missing if weight loss likely to be low

44
Missing Not At Random (MNAR)

One method uses Structural Equation Modelling
(SEM)
Requires specialist software
Often referred to as nonignorable missing
mechanism or informative missingness
MNAR is completely unverifiable from the data
Need to assess the sensitivity of results to
different plausible explanations
All standard methods are NOT valid
Ongoing area of research in statistical methods

45
Summary

Consider hierarchy of missing data
MCAR, MAR, MNAR
Ideal is to use MI if MAR
or Weighting methods if MAR
Tools now in SPSS
Need to model missingness mechanism jointly with
analysis of outcome if MNAR
Complete case analysis needs to be justified!
LVCF needs to be justified!

46
Summary

it is time to place CC analysis and simple
imputation methods, in particular LOCF, in the
Museum of Statistical Science..
Geert Molenberghs
Editorial JRSS A, 2007861-863

47
References

LSHTM website on missing data, sponsored by ESRC
(www.lshtm.ac.uk/missingdata/start.html)
Donders AR, van der Heijden GJ, Stijnen T, Moons
KG. Review a gentle introduction to imputation
of missing values. J Clin epidemiol 2006 59
1087-91
Sterne JAC, White IR, Carlin JB, Spratt M,
Royston P, Kenward MG, Wood AM, Carpenter JR.
Multiple imputation for missing data in
epidemiological and clinical researchpotential
and pitfalls. BMJ 2009 338 b2393.
Hippisley-Cox J, Coupland C, Vinogradova Y,
Robson J, May M, Brindle P. Derivation and
validation of QRISK, a new cardiovascular disease
risk score for the United Kingdom prospective
open cohort study. BMJ 2007 335 136.
Little, Roderick JA and Rubin, Donald B. (1987).
Statistical Analysis with Missing Data John Wiley
and Sons, New York.

48
References

Dempster AP, Laird NM and Rubin DB. Maximum
Likelihood from Incomplete Data via the EM
Algorithm, Journal of the Royal Statistical
Society 1977 Ser. B., 39 1 - 38.
Rubin DB. (1987). Multiple imputation for
nonresponse in surveys. John Wiley Sons, New
York.
Yuan YC. Multiple imputation for missing data
concepts and new development. SAS Institute Inc
(P267-25)
Software Documentation for SAS, S-PLUS and
SPSS.
R Development Core Team(2005). R A language and
environment for statistical computing. R
Foundation for Statistical Computing, Vienna,
Austria.

49
The Tao of Missingness
There are known knowns. These are things we
know that we know. There are known unknowns. That
is to say, there are things that we know we
don't know. But there are also unknown unknowns.
There are things we don't know we don't know.
Donald Rumsfeld

Write a Comment

User Comments (0)