Title: Statistical, Practical, and Design Issues in Analysis with Missing Data
1Statistical, Practical, and Design Issues in
Analysis with Missing Data
- John Graham
- The Methodology Center
- Penn State University
- American Psychological Association, Toronto,
August 8, 2003
2Acknowledgements
- Joe Schafer
- Scott Hofer
- Patricio Cumsille
- Bonnie Taylor
- Steve West
- NIAAA
- NIDA
- The Hanley Family Foundation
3Presentation in Two Parts
- (1) Introductory Material Practical Issues
- (2) Planned Missingness Designs
4Recent Papers
- Collins, L. M., Schafer, J. L., Kam, C. M.
(2001). A comparison of inclusive and
restrictive strategies in modern missing data
procedures. Psychological Methods, 6, 330_351. - Schafer, J. L., Graham, J. W. (2002). Missing
data our view of the state of the art.
Psychological Methods, 7, 147-177. - Graham, J. W., Cumsille, P. E., Elek-Fisk, E.
(2003). Methods for handling missing data. In
J. A. Schinka W. F. Velicer (Eds.). Research
Methods in Psychology (pp. 87_114). Volume 2 of
Handbook of Psychology (I. B. Weiner,
Editor-in-Chief). New York John Wiley Sons. - http//mcgee.hhdev.psu.edu/publication_resources
5Recent Papers
- Graham, J. W., Taylor, B. J., Cumsille, P. E.
(2001). Planned missing data designs in analysis
of change. In L. Collins A. Sayer (Eds.), New
methods for the analysis of change, (pp.
335-353). Washington, DC American Psychological
Association. - Graham, J. W. (2003). Adding missing-data
relevant variables to FIML-based structural
equation models. Structural Equation Modeling,
10, 80-100. - Graham, J. W., Schafer, J. L. (1999). On the
performance of multiple imputation for
multivariate data with small sample size. In R.
Hoyle (Ed.) Statistical Strategies for Small
Sample Research, (pp. 1-29). Thousand Oaks, CA
Sage.
6Part IMissing Data Introductory Material and
Practical Issues
7Problem with Missing Data
- Analysis procedures were designed for complete
data. . .
8Solution 1
- Design new procedures
- Missing Data Parameter Estimation in One Step
- Full Information Maximum Likelihood (FIML)SEM
and Other Latent Variable Programs(Amos, Mx,
LISREL, Mplus, LTA)
9Solution 2
- Missing data Multiple Imputation (MI)
- Two Steps
- Step 1 Replace Missing Values with Plausible
Values - Step 2 Analyze Data as if there were No Missing
Data
10FAQ
- Aren't you somehow helping yourself with
imputation?. . .
11NO. Missing data imputation . . .
- does NOT give you something for nothing
- DOES let you make use of all data you have
- . . .
12FAQ
- Is the imputed value what the person would have
given?
13NO. When we impute a value . .
- We do not impute for the sake of the value itself
- We impute to preserve important characteristics
of the whole data set - . . .
14We want . . .
- unbiased parameter estimation
- e.g., b-weights
- Good estimate of variability
- e.g., standard errors
- best statistical power
15Causes of Missingness
- Ignorable
- MCAR Missing Completely At Random
- MAR Missing At Random
- Non-Ignorable
- MNAR Missing Not At Random
16Practical Issues
- How much difference does it make?
- How easy is the "sell"?
- Which is better FIML or MI?
- "Auxiliary" Variables (Collins, Schafer, Kam,
2001 Graham, 2003) - Small sample size (Graham Schafer, 1999)
- Too many variables
- Automation
17Practical IssuesBiggest problems in multiple
imputation
- How do I write my data out of SPSS?
- How can I use MI with ANOVA?
- How do I use MI with SPSS, STATA, SUDAAN, EQS,
Mplus? - Is there a less tedious way?
18Part IIPlanned Missingness Designs
19Planned Missingness
- Why would anyone want to plan to have
missingness? - To manage costs, data quality, and statistical
power - In fact, we do it all the time. . .
20Common Sampling Designs
- Random sampling of
- Subjects
- Items
- Goal
- Collect smaller, more manageable amount of data
- Draw reasonable conclusions
21Planned Missingness
- Is sampling items within subjects so hard?
- Trick is to sample all item combinations
22Why NOT UsePlanned Missingness?
- Past Not convenient to do analyses
- Present Many statistical solutions
- Now is time to consider design alternatives
23Design Examples
24Lighten Burden ILongitudinal Measurement
- Problem Studying Growth
- Participants may grow tired of measurement
- One Solution sample times for measurement
- See Graham, Taylor, Cumsille (2001)
25Planned Missingness for Growth Modeling
- Growth modeling increasingly common
- multiple (e.g., 5) measurement waves
- identify intercept, slope, etc.
- predict slope, etc. with other variables
26ExamplePreventing College Alcohol Problems
- Alcohol use in first year of college
- Baseline rate steep onset
- After program . . . shallower onset rate
27Could collect data at all five time points
- Advantages
- easy to analyze
- Disadvantages
- expensive in per-subject costs
- expensive in data quality (taxes subjects)
- Explore missing data designs
28Design 1 all combinations of1 time missing (17
missing)
- 1 1 1 1 1 57 0 0 0 0 01 1 1 1 0 57 1 1 1 1
11 1 1 0 1 57 1 1 1 1 11 1 0 1 1 57 1 1 1 1
11 0 1 1 1 57 1 1 1 1 10 1 1 1 1 57 1 1 1 1
1 ___ N 342
29Design 3 all combinations of 2 times missing
(36 missing)
- 1 1 1 1 1 31 0 0 0 0 0
- 1 1 1 0 0 31 0 0 0 0 0
- 1 1 0 1 0 31 0 0 0 0 0
- 1 0 1 1 0 31 0 0 0 0 0
- 0 1 1 1 0 31 1 1 1 1 1
- 1 1 0 0 1 31 1 1 1 1 1
- 1 0 1 0 1 31 1 1 1 1 1
- 0 1 1 0 1 31 1 1 1 1 1
- 1 0 0 1 1 31 1 1 1 1 1
- 0 1 0 1 1 31 1 1 1 1 1
- 0 0 1 1 1 31 1 1 1 1 1
30Planned Missingness Designs
- all combinations missing
- Design of ___ missing times data points
- ---------- -------------------- ----------------
- 1 1 17
- 2 1 2 29
- 3 2 36
- 4 2 3 45
- 5 3 54
31Standard Errors for Various Designs
Complete Cases Designs
Missing Data Designs
Data Points
32Missing data designs
- Often better than complete cases designs
- Always cheaper
- Often acceptable drop in power
33Lighten Burden on Respondents II
- The problem
- 7th graders can answer only 100 questions
- We want to ask 133 questions
- One Solution The 3-form design
343-Form Design
- Student Received Item Set?
- ----------------------------
- X A B C
- Form 1 yes yes yes NO
- Form 2 yes yes NO yes
- Form 3 yes NO yes yes
- Form 4 yes yes yes yes
353-Form Design
- Item Sets X A B C total 34 33 33 3
3 133 - form X A B C1 34 33 33 0 1002 34 33
0 33 1003 34 0 33 33 100
363-Form Design Item Order
- Form 1 X A BForm 2 X C AForm 3 X B C
373-Form Design Item Order
- Form 1 X A B CForm 2 X C A BForm
3 X B C A
383-Form Design Item Order
- Form 1 X A B CForm 2 X C A BForm
3 X B C A - Could pay some subjects to complete extra
questions
393-Form Design Item Order
- Form 1 X A B CForm 2 X C A BForm
3 X B C A - Give questions as shown, measure reasons for
non-completion - poor reading
- low motivation
- "Managed" missingness
40Planned MissingnessExpensive Measures IWhole
Constructs Sometimes Missing
41Research Example
- Recent Prevention Study
- Adolescent Alcohol Prevention Trial
- Two Drug-Abuse Prevention Curricula
- Resistance Training
- Normative Education
42N 1000
N 3000
43Expensive Measures IIItems Sometimes
Missingfrom Larger Construct
44Research Examples
- Smoking Research
- less expensive Self-Reports
- more expensive CO and Saliva Cotinine
- Alcohol Research
- less expensive Brief Self-reports
- more expensive Time Line Follow Back
45Research Examples
- Nutrition Research
- less expensive Brief Nutrition Survey
- more expensive Extensive 24-hr Recall
- Survey Research
- less expensive Brief Mail Survey
- more expensive Extensive Face-to-Face
Interview
46Expensive Measures II
Larger N, Less Expensive
r .30
Smaller N, More Expensive
47Example Study
- r -.30 (smoking and health)
- Self-report Smoking
- two items
- Biochemical Smoking Measures
- Expired Air CO
- Saliva Cotinine
48Example Study
- 15,050 for Measuring Smoking
- Self-Reports 7.30 per subject
- CO / Cotinine 16.78 per subject
- self-reports bio-chem625 x 7.30
625 x 16.78 15,050 - 1200 x 7.30 375 x 16.78 15,050
49Standard Errors
Sample Size (Self-Reports)
50But is that all there is to it?
- What about importance of
- the main analysis?
- secondary analyses?
- Are conclusions the same
- when the main analysis is
- a little more important?
- moderately more important?
- a lot more important?
51Standard Errors
Sample Size (Self-Reports)
52Which Design is Best?Importance Factor 20
- Sample Size
-
- cheap expensive Overall
- measure measure Value
-
- 625 625 (complete) 21.42
- 900 505 (optimal) 23.00
- 1500 217 (extreme) 20.86
53Which Design is Best?Importance Factor 5
- Sample Size
-
- cheap expensive Overall
- measure measure Value
-
- 625 625 (complete) 6.92
- 900 505 (optimal) 8.00
- 1500 217 (extreme) 8.97
54Conclusions
- Optimal (missing data) design better than
complete cases - Extreme missing data design often best
55Why would anyone ever want to PLAN to have
missing data?
- Easy to analyze
- Cheaper
- Optimal design in most cases
- So . . .
56 57Recent Papers
- Collins, L. M., Schafer, J. L., Kam, C. M.
(2001). A comparison of inclusive and
restrictive strategies in modern missing data
procedures. Psychological Methods, 6, 330_351. - Schafer, J. L., Graham, J. W. (2002). Missing
data our view of the state of the art.
Psychological Methods, 7, 147-177. - Graham, J. W., Cumsille, P. E., Elek-Fisk, E.
(2003). Methods for handling missing data. In
J. A. Schinka W. F. Velicer (Eds.). Research
Methods in Psychology (pp. 87_114). Volume 2 of
Handbook of Psychology (I. B. Weiner,
Editor-in-Chief). New York John Wiley Sons. - http//mcgee.hhdev.psu.edu/publication_resources
- email jgraham_at_psu.edu