NHANES Design and Analysis 19992006 - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

NHANES Design and Analysis 19992006

Description:

60 Percent Mexican Americans did not report race (approx 20 percent of sample) ... Conclusion no 'Race' variable at this time ... – PowerPoint PPT presentation

Number of Views:520
Avg rating:3.0/5.0
Slides: 91
Provided by: chiayi
Category:
Tags: nhanes | analysis | design | race

less

Transcript and Presenter's Notes

Title: NHANES Design and Analysis 19992006


1
NHANES Design and Analysis 1999-2006
  • Lester R. Curtin
  • lrc2_at_cdc.gov

2
NHANES Analysis Clichés
  • One Size does NOT fit all
  • Every answer to a statistical question should
    start will It depends
  • Stuff happens

3
Todays Agenda
  • Survey Design and Analysis
  • Multi-stage National Household Designs
  • Design versus Model based estimation
  • NHANES Design
  • Sample Design
  • Sample Weights
  • Specific Analytic Topics
  • Rare events
  • Test statistics

4
Some Basic Design Considerations
  • List Frame Census Addresses
  • Census every 10 years
  • Some differential undercount
  • Confidentiality
  • Area Frame using Census counts
  • MOS out of date
  • Migration
  • New construction
  • Define/Stratify Stages - Segments/Clusters

5
Area/Multi-Stage Design
  • Civilian, non-institutionalized population
  • Area Frame 4 stage design
  • Primary sample units county
  • Segments Census geographic groups
  • Households/Dwelling units
  • Persons

6
Census Geography
  • National
  • Region
  • Division
  • State
  • County
  • Tracts
  • Block Groups
  • Blocks

7
Stage 1 Counties
Stage 2 Segments
Stage 3 Households
Stage 4 SPs
OP96S017
8
Sample Segment map
9
Characteristics of a Survey Design
  • Controlled Selection
  • Stratification/Clusters
  • Screening
  • Differential selection (weights)
  • Costs/efficiency
  • Randomization
  • Effective Sample Size
  • Multiple Objectives

10
Design based Analytic Issues
  • Estimation/Weights
  • Influential weights
  • Subsamples - many
  • Variance Estimation
  • Approximation Methods
  • Degrees of Freedom
  • Missing units
  • Design Effects
  • Subdomains versus totals
  • Effective sample size

11
Variance Equations
  • Linearization (Taylor)
  • Varaince Linear nonlinear
  • BRR
  • Delete half of PSUs at each time
  • Jackknife
  • Bootstrap
  • Complicated

12
Variance Estimation
13
Design Impacts on Variance
  • DEFF Varcomplex/Varsrs
  • Weights
  • DEFF (1 CV2wts)
  • Clustering
  • DEFF (1 (m 1) p)
  • Net Effect
  • DEFF ( 1 CV2wts ) (1 (m-1)p)
  • Subdomain versus Total Population
  • Between and Within PSU variation

14
Range of DEFF for well behaved MEC variables
  • White males, 12-19 (0.9 , 1.2)
  • Mexican American (1.1 , 2.1)
  • NonHispanic Black (1.2 , 2.3)
  • NonHispanic White (1.4, 2.8)
  • Total population (2.4 , 7.2)

15
Design Effects for some Laboratory Tests (means)
NHANES 1999-2000
  • Glucose, serum 2.24 (RSE0.37)
  • Creatinine. serum 2.77 (RSE 0.67)
  • Total Cholesterol 3.42 (RSE 0.46)
  • C-reactive protein 4.49 (RSE 3.58)
  • Creatinine, urine 5.45 (RSE 1.69)
  • Measles Antibody 8.01 (RSE 2.69)
  • Blood Lead 9.50 (RSE 2.28)
  • Total Mercury 10.59 ( RSE 8.05)
  • Calcium 25.63 (RSE 0.29)
  • Chloride 34.10 (RSE 0.22)

16
Analysis/Interpretation of Data
  • Weights - Bias versus Variance
  • Generalize Population of Inference
  • Efficiency
  • Variance Estimates - Normal versus student td
  • Degrees of Freedom
  • Larger C.I. For Small Number of PSUs
  • Rare Events proportion, percentiles
  • Chi-square versus F-test
  • Wald (Koch, Freeman, Freeman)
  • Pearson (Rao-Scott)
  • Model based analysis

17
(No Transcript)
18
NHANES Mobile Exam Center
OP96S041
19
NHANES History
  • NHES (three cycles) First in 1960
  • NHANES I 1974-1974
  • NHANES II 1976-1980
  • Hispanic Hanes 1982-1984
  • NHANES III 1988-1994
  • Current NHANES (annual sample design)
  • 1999-2000
  • 2001-2002
  • 2003-2004
  • 2005-2006
  • 2007-2008

20
NHANES III
  • Six years (1988-1994)
  • 89 stands 81 PSUs
  • 30,818 examined persons
  • Three-year national sample
  • Highly screened sample
  • 9,090 Mexican Americans
  • 9,009 NonHisp Black Americans
  • 11,283 NonHisp White
  • 1, 436 remainder

21
NHANES 1999-2006
  • WESTAT data collection contract
  • Approximately 5,000 persons per year
  • Domains Black American and Mexican American
    Under age 20
  • 15 PSUs per year ANNUAL SAMPLE
  • Within PSU stratify segments by MOS
  • Screen for Race/Ethnicity/Age
  • More than 1 sample person per household
  • But random selection not family based

22
Sample Selection 1999-2000
  • Number of PSU 26
  • Number of Stands 27
  • Number of Segments 681
  • Number HH Screened 22,839
  • Number HH, identified SP 6,005
  • Number identified SPs 12,160
  • Number interviewed (82) 9,965
  • Number completing MEC (76) 9,282

23
2002 Survey Design Changes
  • 1999-2001 NHIS PSUs
  • 2002-2006 Independent set of PSUs
  • MOS - Population and Percent Race/Ethnic
  • 18 Self Representing PSUs random 3 per year
  • 12 Non Self Representing Strata 1 per year
  • 2007-2011 New set of PSUs
  • Change Age specific sampling fractions
  • Change to Hispanic (Still Mexican Americans)

24
(No Transcript)
25
Impact of Design and Data release cycle on
Confidentiality
  • Limited Geography
  • no PSU on PUMS
  • No State Estimates
  • Limited SES variables
  • Education (3 categories)
  • Income (PIR 3 categories)
  • Occupation and Industry
  • Race/Ethnic
  • No Household/family link
  • Need to Use Research Data Center

26
Analytic Requirements
  • Design effects around 1.5
  • 10-percent statistic with 30-percent relative
    standard error
  • Sample size of 150
  • 10 percent difference (in proportion)
  • 95 percent significance 90 percent power
  • Sample size 420

27
Sample Subdomains
  • Mexican American
  • Non-Hispanic Black
  • White/other non-low income
  • White/other low income (2000)
  • M/F (0-11 mo, 1-2 yr, 3-5 yr)
  • M and F 6-11, 12-15, 16-19
  • BMAMF 20-39, 40-59, 60
  • White/Other MF20-29,30-39,40-49,
    50-59,60-69,70-79,80
  • Pregnant Women

28
Race/Ethnicity Issues
  • Design for Mexican American, Black, and Other
  • 60 Percent Mexican Americans did not report race
    (approx 20 percent of sample)
  • Multiple race and coding for other races
  • Comparability with Census, other NCHS surveys
  • Conclusion no Race variable at this time
  • Recommendation Do Not attempt estimates for
    Total Hispanics for two year cycles
  • Use Non-Hispanic Black even though sample
    weights are post-stratified to Black Population

29
Sample Size, 1999-2000
30
(No Transcript)
31
Weighting the Data
  • Create Base Weights (inverse probability of
    selection)
  •  
  • Adjust for new construction, subsampling,
    deselection
  •  
  • Adjust for screener non-response
  •  
  • Post-stratify to collapsed sampling domain
    controls (Screener Weight)

32
Interview Weights
  •  
  • Adjust for Interview Non-response (race/ethnic,
    age, sex, household size)
  • 96 cells collapsed to 65 (ngt30, max1.35)
  • Trim
  • Post-stratify to control totals (Interview
    weight)
  •  

33
Examination Weights
  • Adjust for MEC non-response
  • Race/ethnic, age, sex, household size, household
    education, self-reported health status, length of
    stay at current residence
  • 941 cells collapsed to 195 (n30,max1.35)
  • Trim
  • Post-stratify to control totals
  • Exam weight
  •  

34
Examination Weights White/Other
  • Min Mean
    Max
  •  M/F lt 6 1,816 34,855 86,892
  • Male 6-19 4,950 48,874 196,502
  • Male 20 9,438 67,518 212,358
  • Female 6-19 7,107 46,234 190,233
  • Female 20 4,876 67,866 261,361

35
Examination Weights Black, NonHispanic
  • Min Mean
    Max
  • M/F lt 6 3,977 9,357 22,008
  • Male 6-19 4,113 8,656 20,961
  • Male 20 6,804 24,044 58,992
  • Female 6-19 4,193 8,415 17,921
  • Female 20 4,415 24,888 99,282
  •  

36
Examination Weights Mexican Americans
  •  
  • Min
    Mean Max
  • M/F lt 6 1,163 5,276 19,233
  • Male 6-19 1,684 4,054 16,409
  • Male 20 1,523 11,446 43,866
  • Female 6-19 980 4,022 16,594
  • Female 20 1,250 9,049 37,001
  •  

37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Percent Distribution Population compared to
sample
41
Prevalence Estimates
  • Do NOT sum weights
  • Estimate by calculating proportion, multiply by
    population count
  • Detailed age/race/sex (subject to reliability)
  • Reason component/demographic non-response
  • Reason Population Controls by limited age
  • Reason - 1990 vs 2000 based control totals

42
Two 2-year or One 4-year Survey
  • Recommended Analysis One 4-year Survey
  • Due to sample size
  • Due to number of PSUs
  • Geographic representation
  • Degrees of Freedom for sample errors
  • Greater demographic detail
  • Exceptions
  • Component in for two years only
  • Public Health importance of sort terms trends
  • Statistical power to detect change
  • Internal/External validity

43
Merging 2-year Files
  • Consecutive numbering system used for stratum
  • Original PSU pairing (gt50) for stratum
  • MVUs just 1 or 2 within stratum
  • 1999-2000 number 112
  • 2001-2002 number 1327
  • 2003-2004 number 28 42

44
Sample weights Which weights?
45
Why a four year weight for 1999-02
  • Have 1999-2000 Weights based on 1990 Census
    brought forward to 1999-2000
  • Have 2001-2002 weights based on 2000 census
  • Error of Closure for post 1990 estimates versus
    2000 Census (especially Hispanic)
  • Thus 1999-02 weights based on 2000 Census

46
Two, Four, Six, Eight - How can we estimate?
  • For 4 years of data from 2001-2004 -
  • if sddsrvyr2 or 3 (2001-2004) then
  • MEC4YR 1/2 WTMEC2YR
  • For 6 years of data from 1999-2004
  • if sddsrvyr1 or 2 (1999-2002) then
  • MEC6YR 2/3 WTMEC4YR
  • If sddsrvyr3 (2003-3004) then
  • MEC6YR 1/3 WTMEC2YR
  • Only when analyzing years 1999-2002, you should
    not combined 2 year weights but use the 4 year
    weights provided.

47
Two, Four, Six, Eight - How can we estimate?
  • Future years of data will be combined similarly
  • For 6 years of data from 2001-2006 -
  • if sddsrvyr1 or 2 or 3 (2001-2006) then
  • MEC6YR 1/3 WTMEC2YR
  • For 8 years of data from 1999-2006
  • if sddsrvyr1 or 2 (1999-2006) then
  • MEC8YR 1/2 WTMEC4YR (1999-2002)
  • if sddsrvyr3 or 4 then
  • MEC8YR 1/4 WTMEC2YR etc

48
How Many Weights?
  • Full Sample (Interview, MEC, HH)
  • Half Samples
  • AM (fasting)/PM
  • Audiometry
  • Balance (99-00 only)
  • CIDI
  • Environmental Samples
  • Dioxins
  • PAH, Phthalates
  • Heavy Metals
  • T4/TSH
  • Volatile Organic Compounds

49
Subsample Weights
  • MEC weight as start
  • Form adjustment cells
  • Demographics/sample size
  • Calculate ratio
  • Sum(WTMEC)/Sum(WTSubsample respondents)
  • Probability of selection/nonresponse
  • Re-weight subsample within cells
  • Lohr, 1999 pp xxx-xxx

50
(No Transcript)
51
2003-2004 Nutrition Weights
  • Two days dietary recall
  • 10 1999-2001
  • 100 2002-2006
  • Day 1 Weights (NR day of week)
  • Day 2 Weights (NR weekend/weekday)

52
Sample Size for 2003-2004 Nutrition Samples
  • Stage Number Percent
  • stage
    Cum
  • Selected 12,761 100 100
  • Interviewed 10,122 79.3 79.3
  • Examined 9,643 95.3 75.6
  • Day 1 9,034 93.7 70.8
  • Day 2 8,354 92.5 65.5

53
Minimum Sample SizeIs there a simple
Rule/Guideline

  • Deff 1.5
  • proportion RSE 30 RSE 20
    H3-USDA
  • 50 17
    38 45
  • 40 or 60 25
    56 45
  • 30 or 70 39
    88 45
  • 20 or 80 67
    150 60
  • 10 or 90 150
    338 120
  • 5 or 95 317
    713 240
  • 2 or 98 817
    1,838 800

54
RSE30 DEFF1.5
55
Sampling errors are point estimates
  • RSE 1/SQRT(DF)
  • DF PSU - Strata
  • DEFF 1.5
  • DF 25 RSE 20 Se 0.3 CI (0.9,
    2.1)
  • DF 9 RSE 33 Se 0.5 CI (0.5,
    2.5)
  • Note Use of t-statistic instead of normal (z)

56
Determining Minimum Sample Size
  • SRS sample size (, mean, odds ratio )
  • Inflate by DEFF (point estimate of se versus
    smoothed se)
  • Inflate by degrees of freedom for se
  • DF 14 ratio 2.14/1.96
  • DF 8 ratio 2.31/1.96

57
Minimum Sample Size for Proportions
  • Depends on min (p, 1-p)
  • Depends on DEFF
  • Depends on Degrees of Freedom for DEFF
  • Transformation for rare events
  • Subject matter common sense
  • Internal Validity
  • External Validity

58
Estimating Sampling Errors in NHANES
  • Confidentiality Pseudo-PSU vs MVU
  • Options Use BRR, JK, Taylor
  • Software Use SUDDAN, WESVAR
    need STATA, SAS, SPSS
  • Additive for combining sets of 2-years
  • Linearization yes
  • Replication - problems

59
Swapping Options
  • Swap segments between pairs of PSUs
  • Swap segments from any PSU with PSU
  • Number of Segments (3, 4, , 12)
  • Actual Proportion of sps 20 to 25
  • Matching variables for Segments
  • Census information
  • Some current NHANES information

60
(No Transcript)
61
(No Transcript)
62
Statistical Issues for NHANES 1999-2004
  • Stability of Complex Survey Variance estimator
  • few PSUs subdomain problems
  • heterogeneity
  • (Effective) Degrees of Freedom
  • Number of model parameters exceeds the design
    based Degrees of Freedom
  • Assumptions underlying test statistics
  • Combining Years Combining Domains

63
Analysis Considerations When Events are Rare
  • Influential weights/PSUs - outliers
  • Variance estimate for proportion
  • CI for proportions
  • Weighted/Unweighted
  • Population/geographic heterogeneity
  • Testing difference between proportions
  • CI for percentiles Woodruff method
  • Logistic regression

64
(No Transcript)
65
Blood Lead Level
  • Weighted cumulative histogram of blood lead
    levels, with calculation of 90 confidence
    interval for linear interpolated median.
  • lower confidence limit 13.92 upper confidence
    limit 16.21

66
Simple Random Sample case - Basic problems for
CI(p)
  • Binomial sum of iid Bernouilli, p fixed
  • Discrete limited outcome space
  • Rare skewed distribution
  • Confidence Interval beyond (0,1) bounds
  • nominal coverage for two-sided
  • Computational ease for alternatives
  • Software limitations

67
Alternative SRS methods
  • Wald (Normal)
  • with transformed data (log, logit, arcsine)
  • With Continuity correction
  • Agresti-Coull
  • Wilson score
  • Clopper-Pearson (Exact)
  • Bayesian - Jeffreys Prior
  • Likelihood ratio
  • Poisson approximation

68
SRS CI for proportions
  • Wald
  • Wilson score

69
Transformations
  • Logit
  • Arcsine

70
SRS CI for proportions
  • Clopper-Pearson Exact Binomial
  • Bayesian (Jeffreys prior)

71
Staph aureus (MRSA)Age 1-19, p36.9, n4772
72
Elevated Blood Lead (NHBF 12-19) N290 p0.7
73
A Note on Unweighted Estimates
  • Why unweighted
  • Influential weights for rare events
  • Methods study only
  • Pop of inference NOT national
  • Why NOT unweighted
  • Heterogeneity
  • Age/Race/Ethnicity/Sex
  • Geographic Between PSU variation
  • Inflated wts inflated CI
  • Wts typically informative w/respect to outcome

74
Test Statistics for Survey Data Motivation
  • I ran SUDAAN on my model and got 4 different
    test statistics. Only one test statistic shows
    my important model parameter as significant. Can
    I just use the one that indicates significance
    and ignore the other ones?
  • The NHANES Analytic Guidelines do not indicate
    the appropriate test statistic to use for
    multivariate analysis.

75
Wald Chi-square (Koch et al)
  • Ho CB 0
  • Q (CB)(CVC)(CB) X2r
  • V Design-based estimate of VB
  • r rank (C)
  • Problems lacks statistical power

76
Satterthwaite adjusted chi-square (Rao/Scott)
  • Q (CB)(CVC)(CB)
  • Q/d(1a2) X2r
  • V SRS estimate of VB
  • r r(1 a2)

77
Wald F- test (Felligi)
  • (d r1)/rdQ Fr,d-r1
  • d PSUs - Strata

78
Satterthwaite adjusted F (Thomas Rao)
  • Q/d(1a2)/r F r,e
  • Here d avg E-values of V-1V
  • a2 coefficient of variation of e-values

79
Categorical Response Rao and Thomas (2003)
  • Wald Chi-Square
  • Rao-Scott (R-S) First and Second Order
    Corrections
  • F-statistic variants to Wald, R-S
  • Fay Jackknife
  • Bonferroni Adjustments

80
Rao-Thomas (2003)
  • Avoid Wald Statistic extremely liberal
  • Determining factor variation in generalized
    design effects
  • All procedures derived from the F-based Wald test
    exhibited low power for small number of clusters
  • Note limits to past simulations eg Thomas
    Singh Roberts (1996) all cluster size set equal
    to 20 and number of clusters ranged from 15 to 70

81
Effective Degrees of Freedom
  • Asymptotic normality (too large)
  • Number of PSUs Number of Strata
  • Satterthwaite (too small)
  • 2var(y)2/var(var(y))
  • Korn and Graubard (1999)
  • Jang and Eltinge (1996)
  • Rust (1986)

82
Degrees of Freedom NHANES III
83
NHANES Number of PSUs with Domain Sample size
84
Additional Concerns
  • NHANES has a LARGE number and variety of analytic
    variables (Interview, Examination, Laboratory)
  • DEFF varies considerably for domains and analytic
    variables (0.4 to 12, 20 ???)
  • Proportion Within PSU Variance components vary a
    lot (15 to 100)

85
Recent research work
  • Rao, Scott and Skinner (1998)
  • Hidiroglou, Rao, Yung (recent ASA)
  • Fay and Graubard (2001)
  • McCaffrey and Bell (2002)
  • Manel and DeRouen (2001)
  • Pan and Wall (2002)
  • Effective Degrees of Freedom

86
Future Research
  • Variance components for NHANES
  • (Design-based components, Korn Graubard
    2003)
  • Simple Modifications for small number of clusters
  • Simulation Study for NHANES situation
  • Empirical based finite population
  • Model-based for rare events

87
Summary
  • NHANES 99 is an Annual sample
  • 2 year data release may need 4 or 6
  • Area Probability Sample Strata/Stages
  • Over-sampling Density Strata/Screening
  • Many Sample Weights can be confusing
  • Many analytic issues, especially small numbers

88
Analytic Guidelines for NHANES 1999-2006
  • Read all Documentation
  • NHANES III Guidelines can be used
  • Use Survey weights for estimation
  • Undertake descriptive (or exploratory ) analysis
    of data
  • Due to small sample sizes, limited
    Race/Ethnic/Age/Sex for 2 year cycles
  • Possible problems with limited geography
  • Extra Careful with Design Based estimation
  • Influential values, sample weights

89
WEB sites
  • NHANES tutorial
  • http//www.cdc.gov/nchs/tutorials/Nhanes/index.htm
  • ASA Survey Methods Section
  • http//www.amstat.org/sections/SRMS/links.html
  • http//www.fas.harvard.edu/stats/survey-soft/surv
    ey-soft.html
  • UCLA
  • http//www.ats.ucla.edu/stat/
  • http//www.ats.ucla.edu/stat/survey/survey_howtoch
    oose.htm
  • UNC CPC
  • http//www.cpc.unc.edu/projects/usda/help/SUDAAN_S
    TATA.html
  • Harvard
  • http//www.iq.harvard.edu/psr/harvard_survey_resou
    rces.html
  • http//www.hcp.med.harvard.edu/statistics/survey-s
    oft/
  • PSU
  • http//www.psu.edu/help/cacpri/sudaan/sudaan.htm

90
Analyzing Data from NHANES 1999-2004
  • Analytic Guidelines
  • Detailed guidelines for working with NHANES data
    can be found at
  • http//www.cdc.gov/nchs/nhanes.htm
  • This document contains everything discussed today
    and will continue to grow to include guidelines
    for statistical tests, multivariate analyses,
    modeling and more!
  • Web based tutorial also currently in creation.
  • Target date for release is Dec 31st 2006.
Write a Comment
User Comments (0)
About PowerShow.com