Title: NHANES 1999-2004
1NHANES 1999-2004 Analytic Strategies
Deanna Kruszon-Moran, MS
2Analyzing Data NHANES 1999-2004Preparing your
data files
- Downloading demographic, questionnaire, exam and
lab files. - Files are no longer available as self-extracting
zip files. - Documentation and procedure files are now in
Adobe PDF format and can be viewed or accessed
directly via the web link - Clicking on the data link will allow you to store
the data file or open it directly with SAS. - Data files are in SAS transport (.xpt) format.
3Know your data
- Read the documentation !!
- Read the documentation !!
- Read the documentation !!
- Read the documentation!!
4Preparing your data files
- Merging
- Merge all files by sequence number to the
demographic file. - Verify the numbers of records merged and the
final sample number against the published
frequencies on the web. - Be sure they are what you expected and all merges
worked correctly.
5Know your data
- Run basic frequencies and cross tabulations.
- Know your target population.
- Understand how item was measured
- (how is the item defined, topcoded, recoded)
- Recode variables as necessary
- (example age groups, positive/negative lab
tests, high/low BP, high/low cholesterol etc.). - Recode unknown/refusals as missing data
- (77, 99 recode to missing).
- Check your coding run frequencies in SAS.
6Know your data
- Continuous Outcome Data
- Look for outliers in your measure.
- Run Proc Univariate.
- Look for outliers among the weights.
- Use Proc Univariate on the weight variable.
- Outlying variables especially those with large
weights can really influence your estimates. - Look at normality.
- Consider transformations.
- Log, square root, power.
7NHANES Sample Design
- NHANES is a complex, multistage,
- probability cluster design of the civilian,
- noninstitutionalized US population.
8Sample Weights
- To analyze NHANES data you must use the sample
weights to account for -
9 1. The base probability of selection
102. Over sampling
- NHANE 1999-2004 - Oversampled
- African Americans
- Mexican Americans
- Persons with low income
- Adolescents aged 12-19
- Persons aged 60
11Non-response to the interview exam Sample
persons age 20
12Non-response issues for NHANES
- Non-response
- Most components have some level of individual
item or component non-response. - ONLY non-response to the interview and exam has
already been accounted for in the weights. - All additional non-response to the outcome
measure of interest should be examined against
all possible predictors. - Potential biases should be discussed.
- If non-response is high, re-weighting should be
considered.
13Why weight?
Sample Subdomain US Population sample unweighted sample weighted
Non-Hispanic Blacks 13 25 12
Mexican Americans 9 28 9
12-19 year olds 12 24 12
14Sample weights Which weights?
Weight Variables to Use Household Interview Data ONLY ANY Data from Exam/Lab/MEC Interview
Any 2 yrs of data (1999-2000 or 2001-2002 or 2003-2004) WTINT2YR WTMEC2YR
4 yrs of data (1999-2002) WTINT4YR WTMEC4YR
4 or 6 yrs of data (1999-2004) or (2001-2004) Combine appropriate 2 or 4 year weights as follows Combine appropriate 2 or 4 year weights as follows
15Two, Four, Six, Eight - How can we estimate?
- For 4 years of data from 2001-2004 -
- MEC4YR 1/2 WTMEC2YR
- For 6 years of data from 1999-2004
- if sddsrvyr1 or sddsrvyr2 then
- MEC6YR 2/3 WTMEC4YR / for 1999-2002 /
- If sddsrvyr3 then
- MEC6YR 1/3 WTMEC2YR / for 2003-2004 /
-
- Only when analyzing years 1999-2002, you should
not combined 2 year weights but use the 4 year
weights provided.
16Two, Four, Six, Eight - How can we estimate?
- Future years of data will be combined similarly
- For 6 years of data from 2001-2006 -
- if sddsrvyr in (2,3,4) then
- MEC6YR 1/3 WTMEC2YR
-
- For 8 years of data from 1999-2006
- if sddsrvyr1 or sddsrvyr2 then
- MEC8YR 1/2 WTMEC4YR / for 1999-2002 /
- if sddsrvyr3 or sddsrvyr4 then
- MEC8YR 1/4 WTMEC2YR etc / for 2003-2006 /
-
17Sample Weights - Subsamples
- Subsamples and appropriate weights
- Look at your primary variable of interest and the
corresponding weight. - Look at all other variables you want to combine
with it. - Are all from the interview? Exam? Subsample (i.e.
fasting, audiometry, dioxin, VOCs ) ? - Use the weight from the smallest subsample for
your analysis. - Be consistent!
18Sample Weights - Subsamples
- Subsamples and appropriate weights
- Be careful about combining subsamples beyond MEC
VOCs, Interview Dioxin etc. - Combining subsamples such as Environmental AM
fasting could be problematic. - Some subsamples are mutually exclusive.
- Weights were not designed for combining
subsamples and may not produce good estimates.
19Preparing for Analyses
- Subsetting the data for SUDAAN
- If using MEC exam weights - SUBSET the data on
those MEC EXAMINED in SAS before using SUDAAN. - If using other subsample weights subset the
data on those in the subsample corresponding to
the weights you are using. - Then use the SUBPOPN statement in the SUDAAN
procedure to further subset your data by age,
gender etc. to reflect the target population you
are interested in analyzing.
20Sample Weights
- Example
- You are interested in examining the association
of high triglycerides, blood pressure, and body
mass index (BMI) controlling for race/ethnicity
on females age 20-59 from the 6 years of data
from 1999-2004. -
21Sample Weights
- Step 1 Determine the smallest sample
population for the analysis to determine the
correct weight to use. - Race/ethnicity, gender and age are in the
interview. - Blood pressure and weight come from the MEC exam
a subset of those interviewed. - Triglycerides were measured on a subsample of
those MEC examined who fasted for 8 hours and
came to the AM MEC exam. - Therefore, the fasting subsample is the smallest
subsample in the analysis and you would use the
AM fasting weights (WTSAF2YR and WTSAF4YR). -
22Sample Weights
- Step 2 Combine weights in SAS prior to the
SUDAAN procedure for the 6 years from 1999-2004
- If sddsrvyr in (1,2) then
- WEIGHT6 2/3WTSAF4YR / 1999-2002 /
- If sddsrvyr3 then
- WEIGHT6 1/3WTSAF2YR / 2003-2004/
23Sample Weights
- Step 3 Subset your data set in SAS to reflect
the weight being used (AM fasting weights
WTSAF2YR or WTSAF4YR) -
- SAS Code
- IF WTSAF2YR ne . or WTSAF4YR ne .
24Sample Weights
- Step4 Last specify the correct weight to use
using the weight statement in SUDAAN - and subset your data to obtain the subpopulation
of interest using the SUBPOPN statement in SUDAAN
(females age 20-59) - WEIGHT WEIGHT6
- SUBPOPN riagendr2 and ridageyr gt 19 and
ridageyr lt 60
25NHANES 1999-2000Variance Estimation
- Why must you use the sample design to estimate
the variance? - NHANES is a cluster design
- Individual within a cluster are more similar than
those in other clusters. - This homogeneity or clustering results in a
reduction of our effective sample size because we
choose individuals within cluster vs randomly
throughout the population.
26NHANES 1999-2004Variance Estimation
- Why must you use the sample design to estimate
the variance? - Variance estimates that do not account for this
intra cluster correlation are too low and biased. - Survey software such as SUDAAN or SAS Survey
procedures must be used to account for the
complex design and produce unbiased variance
estimates - These procedures require information on the
sample design (i.e. identification of the PSU and
strata) for each sample person.
27NHANES 1999-2000Variance Estimation
- For the initial 1999-2000 data release we
recommended - Using JK-1/Jackknife/leave-one-out procedure.
- Required 52 replicate weights for each of 52
groups created. Only provided for 1999-2000. - Can still be used if you have software that can
produce the replicate weights. - Replicate weights for this procedure will no
longer be created on the data set. - Too cumbersome
28NHANES 1999-2004Variance Estimation
- We now recommend
- Using the Taylor series (linearization) method
- Same as that used in NHANES III.
- We now provide Masked Variance Units (MVUs) in
place of primary sampling units (PSUs) to
maintain confidentiality. - Design variables are called - SDMVSTRA and
SDMVPSU.
29Design Variables
- SDMVSTRA and SDMVPSU
-
- Found in the demographic file.
- Found in all two year data sets and can be
combined for 4 or 6 or year data sets. - Can be used the same as the actual stratum and
PSU variables. - Produce variance estimates close to those using
the true design. - Data MUST be sorted by SDMVSTRA and SDMVPSU
first, before using SUDAAN.
30Sample SUDAAN Code
31Preparing for AnalysisSetting up the procedure
in SAS Surveymeans
32Other data analysis issues from NHANES
- Calculating Population Totals
- Estimates of the number of persons in the U.S.
population with a particular condition must be
done carefully. - Recommended procedure is to
- First, estimate the proportion with the condition
for each subdomain of interest. - Mutliply that by the population control totals
for that subdomain. - Tables are available on the NCHS web site with
the current March 2001 CPS control totals as part
of the analytic guidelines.
33Other data analysis issues from NHANES
- Calculating Population Totals
- Estimates of number of persons with a condition
can be obtained by summing the weights of those
positive. - These estimates will be less reliable due to
- item non response
- and sampling error
- Not the recommended method.
34Analyzing within NHANES 1999-2004
- Things to consider
- Data released in two year cycles.
- We STRONGLY RECOMMEND using two or more cycles (4
or more years )to produce reliable estimates. - Verify data items collected were comparable in
wording and methods. - When combining years remember to use correct
combined weights.
35Analyzing trends with NHANES NHANES III to
NHANES 1999-2004
- Things to consider
- What is your sample from each surveyage?
- How different was the question worded or the
interview methods ? - How different were the lab or exam methodologies
? Cutoffs used? Definitions? - For current NHANES 1999-2004 sample sizes may be
smaller depending on number of years measured -
especially in sub domains - Larger sampling variation.
- May need to limit comparisons.
36Race/Ethnicity NHANES 1999-2004
- Two variables available
- RIDRETH1
-
- RIDRETH2
-
37Race/Ethnicity NHANES 1999-2004
- Ridreth1- Use for analyses of 1999-2004 data
alone. - 1Mexican American
- 2other Hispanic
- 3non-Hispanic white
- 4non-Hispanic black
- 5other races including multiracial.
- For 2 and 4 years of data we know there is
insufficient sample size to analyze other
Hispanics (group 2) alone or to analyze all
Hispanics. - Analyses to evaluate whether 6 years of data
(1999-2004) are sufficient to analyze these
Hispanic groups are ongoing. - Groups 2 and 5 can AND should continue to be
combined to represent all other races.
38Race/Ethnicity NHANES 1999-2004
- Ridreth2
- Use for analyzing trends from NHANES III to
NHANES 1999-2004. - Most comparable to race/ethnicity variable
collected in NHANES III. - Coded as
- 1non-Hispanic white
- 2non-Hispanic black
- 3Mexican American
- 4other including Multi-Racial
- 5other Hispanic
-
39Analyzing data from NHANES 1999-2004
- Crude versus Age Standardized Estimates
- Age distributions within survey samples vary by
racial/ethnic group. - Age distributions also vary by survey NHANES
III vs. NHANES 1999-2004. - When comparing estimates across racial/ethnic
groups or between surveys you may need to age
standardize. - Also present all age specific estimates!
40Analyzing data from NHANES 1999-2004
- When Age Standardizing
- Use the 2000 U.S. Census Population for
consistency for both NHANES III and all NHANES
1999-2000 or above. - For guidelines and population proportions see the
website below for the Klein and Schoenborn HP2010
Statistical Notes on Age Adjustment using the
2000 Projected U.S. Population. - http//www.cdc.gov/nchs/data/statnt/statnt20.pdf
41Analyzing data from NHANES 1999-2004
- When Age Standardizing
- In SUDAAN, use the STDVAR and STDWGT statements.
- STDVAR variable name for the age groups.
- STDWGT corresponding proportion of the 2000
U.S. Census population for that age subgroup.
42Age standardization for NHANES
- Crude vs. Age Standardized Estimates Example
Hepatitis B NHANES III Non-Hispanic White Non-Hispanic Black Mexican American
Crude Prevalence 3.1 (2.6-3.6) 11.9 (10.6-13.2) 3.6 (2.8-4.6)
Age Standardized 2.6 (2.2-3.1) 11.9 (10.7-13.3) 4.4 (3.4-5.6)
43Other data analysis issues from NHANES
- Design Effect
- Sample design effect - the ratio of the variance
estimated under the complex sample design to the
variance under simple random sampling - Var (CSD) / Var (SRS)
- SUDAAN - DEFT2 option in Proc Descript
- Design effect can be averaged
44Other data analysis issues from NHANES
- Effective Sample sizes
-
- Sample sizes should be adjusted by the sample
design effect (DEFF) - Effective N N/DEFF
- Minimum sample size for reporting each individual
estimate depends on the statistic being
calculated, its relative size, stability of the
SE estimate, degrees of freedom and other special
circumstances. - Please refer to the Analytic Guidelines on our
web site for more details.
45Other data analysis issues from NHANESEstimate
Stability
- Relative Standard Errors
- For estimates such as means/prevalences
calculate the relative standard error (RSE) as
follows (SE mean / mean) X 100 - For prevalence estimates near 100 (i.e. gt 90),
look at the RSE for the percent negative not just
percent positive. - i.e. calculate RSE for minimum p or 1-p
46Other data analysis issues from NHANES
- Relative Standard Errors and Rare Events
- RSEs lt20, estimates are most likely reportable.
- RSEs gt30, consider whether the estimate
provides useful information. - Estimates of 50 with SE of 15 and RSE of 30
give a 95 CIs approximately 20-80. Is this
really useful information? - Estimates of low prevalence (i.e. 5) with SE of
1.5 also give RSE of 30 but the 95 CI is
approximately 2-8. This may be very useful
information.
47Other data analysis Issues from NHANES
- Confidence Limits for rare (gt90 or lt10)
events - Standard normal approaches for calculating 95
CIs may give lower bounds lt 0 or upper bounds gt
100. - Statistical literature describes alternative
methods under these situations. - Evaluation of these various methods - see
analytic guidelines on NCHS web site.
48Other data analysis Issues from NHANES
- Degrees of freedom (DF) for t-statistics
- Must calculate the DF to obtain a correct
t-statistic for calculating confidence limits. - DF are number of clusters in the 2nd level of
sampling ( PSUs) number of clusters in the
1st level of sampling (strata) in your subgroup
of interest. - Same for both SAS and SUDAAN when all strata and
PSUs are represented in your subgroup.
49Other data analysis Issues from NHANES
- Degrees of freedom (DF) for t-statistics
- SAS and SUDAAN do not calculate DF the same way
when your subgroup is NOT represented in all
PSUs and strata. - SAS is currently working on correcting this.
- In SUDAAN, to calculate DF you must output the
strata and the PSUs using the ATLEVL11 and
ATLEV22 options in your PROC Descript or PROC
Crosstab
50Analyzing Data from NHANES 1999-2004
- Analytic Guidelines
- Detailed guidelines for working with NHANES data
can be found at - http//www.cdc.gov/nchs/nhanes.htm
- This document contains everything discussed today
and will continue to grow to include guidelines
for statistical tests, multivariate analyses,
modeling and more! - Web based tutorial also currently available and
continuously being developed.