Title: Epidemiologic design from a sampling perspective
1Epidemiologic design from a sampling perspective
- Epidemiology II Lecture
- April 14, 2005
- David Jacobs
2Why different epidemiologic designs?
- It is generally not possible to observe everyone
in a population - New questions arise after data / samples have
been collected - Cost and feasibility
- Statistical efficiency and appropriateness to
study question
3The possibilities
- There are many approaches
- Sampling from the whole population
- Sampling from exposure
- Sampling from caseness
- Haphazard selection
4True Population Configuration Underlying
Epidemiologic Study Designs
Time 0
Time 1
5True Population Configuration Underlying
Epidemiologic Study Designs Approximate numbers
for Minnesota
Time 0
Time 1
6Alternate Format Population Values Exposed and
Diseased
Diseased Not diseased
Exposed A C
Not exposed B D
The numbers A, B, C, D are fixed and known exactly
7Alternate Format Population Values Exposed and
Diseased
Diseased Not diseased
Exposed A 50,000 C 950,000
Not exposed B 50,000 D 2,950,000
The numbers A, B, C, D are fixed and known exactly
8Measures of Risk and Relative Risk Whole
Population
Diseased Not diseased Odds Risk Probability
Exposed A C A/C A/(AC)
Not exposed B D B/D B/(BD)
Exposure Ratio A/(AB) C/(CD)
- Risk Difference A/(AC) B/(BD)
- Risk Ratio, Relative Risk A/(AC) / B/(BD)
- Odds Ratio A/C / B/D AD/BC
9Measures of Risk and Relative Risk Whole
Population
Diseased Not diseased Odds Risk Probability
Exposed 50,000 950,000 0.053 .05
Not exposed 50,000 2,950,000 0.017 .017
Exposure Ratio 0.5 0.244
- Risk Difference 0.033
- Risk Ratio, Relative Risk 3
- Odds Ratio 3.11
10- Epidemiologic studies sample from A, B, C, and D
to estimate Odds or Risk Risk Difference, Risk
Ratio, or Relative Risk - Epidemiologic design is determined by
investigator control, temporality, sampling
fraction
11Epidemiologic design level of investigator
control
- Clinical Trial
- Exposure assigned (at random)
- Reflects temporary state
- Observational
- Exposure occurs naturally
- Often reflects long term state
12Epidemiologic design temporality
- Clinical Trial, Cohort, Nested case control,
Case-cohort - Exposure assessed at variable times before
disease - Cross-sectional
- Exposure assessed simultaneously with disease
- Case-control
- Past exposure assessed simultaneously with disease
13Epidemiologic design sampling fraction
- Cells A, B, C, and D are sampled at random with
constant probability (called the sampling
fraction) - Sample size is a, b, c, d
- If a/A b/B c/C d/D then the sampling
fraction is equal for all cells
14Sampling fractions
Diseased Not diseased
Exposed a/A fA c/C fC
Not exposed b/B fB d/D fD
The numbers A, B, C, D are fixed and known
exactly. The numbers a,b,c,d are realized in a
given study, determined during the study.
15Expected observations given sampling fractions
Diseased Not diseased
Exposed a 5,000 fA0.1 c 950 fC0.001
Not exposed b 1,250 fB0.025 d 5,900 fD0.002
- Risk naïve (and wrong) 5000/5950 0.84 and
- 1250/7150 0.175 naïve relative risk 4.8
- Correct risk 5000/0.1/ (5000/0.1 950/0.001)
0.05 and 1250/0.025 / (1250/0.025 5900/0.002)
0.017 leading to Relative risk 0.05/0.017 3
16Observations given sampling fractions
Diseased Not diseased
Exposed a 5,000 ea fA0.1 c 950 ec fC0.001
Not exposed b 1,250 eb fB0.025 d 5,900 ed fD0.002
All estimates differ from population values by
random amounts (see example in Excel file)
17Epidemiologic design sampling fraction
- Cross-sectional sample equally from everyone fA
fB fC fD - Clinical trial, Cohort study sample equally from
initial exposure groups - fAC and fBD
- (ac)/(AC) usually differ from (bd)/(BD) in
clinical trial, usually the same in cohort study
18Cross-sectional Study
Diseased Not diseased
Exposed a/A f c/C f
Not exposed b/B f d/D f
- Sampling fraction is the same in all cells.
- Risk and odds estimates are unbiased, so risk
differences and ratios are unbiased.
19Expected Cross-sectional Study
Diseased Not diseased
Exposed a 50 fA0.001 c 950 fC0.001
Not exposed b 50 fB0.001 d 2,950 fD0.001
Naïve risks and relative risks are correct!
50/1000 0.05, etc.
20Observed Cross-sectional Study
Diseased Not diseased
Exposed a 50 ea fA0.001 c 950 ec fC0.001
Not exposed b 50 eb fB0.001 d 2,950 eabc fD0.001
All estimates differ from population values by
random amounts
21Clinical Trial or Cohort Study
Diseased Not diseased
Exposed a/A fAC c/C fAC
Not exposed b/B fBD d/D fBD
- Sampling fraction is fixed within exposed and
within not exposed. Usually fAC not fBD in
clinical trial, fAC fBD in cohort study
(which has cross-sectional baseline). - Risk and odds estimates are unbiased, so risk
differences and ratios are unbiased.
22Expected Clinical Trial or Cohort Study
Diseased Not diseased
Exposed a 100 fAC0.002 c 1900 fAC0.002
Not exposed b 50 fBD0.001 d 2950 fBD0.001
- fAC usually fBD in a cohort study
- fAC may differ from fBD in a clinical trial (if
treatment allocation is not 11)
23Expected Measures of Risk and Relative Risk
Clinical Trial or Cohort Study
Diseased Not diseased Odds Risk Probability
Exposed 100 1,900 0.053 .05
Not exposed 50 2,950 0.017 .017
Exposure Ratio 0.67 0.39 ?Differs from total population ?Differs from total population
- Correct Risk Difference 0.033
- Correct Risk Ratio, Relative Risk 3
- Odds Ratio 3.11
24Observed Clinical Trial or Cohort Study
Diseased Not diseased
Exposed a 100 ea fAC0.002 c 1900 - ea fAC0.002
Not exposed b 50 eb fBD0.001 d 2950 - eb fBD0.001
All estimates differ from population values by
random amounts
25Epidemiologic design sampling fraction
- Case control sample differentially within
diseased and within nondiseased - fA fB fAB and fC fD fCD
- Usually fAB much greater than fCD
26Case-control Study
Diseased Not diseased
Exposed a/A fAB c/C fCD
Not exposed b/B fAB d/D fCD
- Sampling fraction is fixed with diseased and
within not diseased. - Exposure probabilities and odds estimates are
unbiased, but risk, disease odds, risk
differences and ratios are biased. - Odds ratio relative risk when disease is rare.
27Expected Case-control Study
Diseased Not diseased
Exposed a 500 fAB0.01 c 494 fCD0.00052
Not exposed b 500 fAB0.01 d 1534 fCD0.00052
fAB 19.23 fCD
28Expected Measures of Risk and Relative Risk
Case-Control Study
Diseased Not diseased Odds Risk Probability
Exposed 500 494 1.01 0.503
Not exposed 500 1,534 0.33 0.246
Exposure Ratio 0.5 0.24 Differs from total population Differs from total population
- Incorrect Risk Difference 0.257
- Incorrect Risk Ratio, Relative Risk 2.04
- Odds Ratio 3.11 ? correct and approx true Rel
Risk
29Observed Case-control Study
Diseased Not diseased
Exposed a 500 ea fAB0.01 c 494 ec fCD0.00052
Not exposed b 500 - ea fAB0.01 d 1534 - ec fCD0.00052
30Epidemiologic design sampling fraction
- Nested case control sample differentially within
diseased and within nondiseased starting with a
cross-sectional base, so exposure measured prior
to disease diagnosis - fA fB fAB and fC fD fCD
- Often fAB 1
- Usually fAB somewhat greater than fCD
31Nested Case-Control Study, 1Observed
Cross-sectional Study
Diseased Not diseased
Exposed a 500 ea fA0.01 c 9500 ec fC0.01
Not exposed b 500 eb fB0.01 d 29500 eabc fD0.01
Previous cross-sectional example with sampling
fractions increased by a factor of 10
32Nested Case-Control Study, 2Sampling from the
cross-section
Diseased Not diseased
Exposed a/A fAB c/C fCD
Not exposed b/B fAB d/D fCD
- Sampling fraction is fixed within diseased and
within not diseased temporality preserved. - Exposure probabilities and odds estimates are
unbiased, but risk, disease odds, risk
differences and ratios are biased. - Odds ratio relative risk when disease is rare.
33Observed Nested Case-Control Study
Diseased Not diseased
Exposed a 500 ea fAB1 c 950 ec ec1 fCD0.01
Not exposed b 500 eb fAB1 d 2950 eabc ec1 fCD0.01
ea, eb, ec are ignored if fAB lt 1 then there
is an ea1.
34Expected Measures of Risk and Relative Risk
Nested Case-Control Study
Diseased Not diseased Odds Risk Probability
Exposed 500 950 0.526 0.344
Not exposed 500 2,950 0.169 0.145
Exposure Ratio 0.5 0.24 Differs from total population Differs from total population
Incorrect Risk Difference 0.199 Incorrect Risk
Ratio, Relative Risk 2.38 Odds Ratio 3.11 ?
correct and approx true Rel Risk
35Epidemiologic design sampling fraction
- Case cohort sample differentially within
diseased and within everyone (diseased
nondiseased) starting with a cross-sectional
base, so exposure measured prior to disease
diagnosis - fA fB fAB the whole cohort is sampled at
fABCD - Usually fAB 1, while fABCD is a sizeable
fraction like 0.1 or 0.25.
36Case-Cohort Study, 1Observed Cross-sectional
Study
Diseased Not diseased
Exposed a 500 ea fA0.01 c 9500 ec fC0.01
Not exposed b 500 eb fB0.01 d 29500 eabc fD0.01
Previous cross-sectional example with sampling
fractions increased by a factor of 10
37Case-Cohort Study, 2 sampling from the
cross-section
Diseased Cohort (Part of all ppts)
Exposed A/A, fAB1 (ac)/(AC) fABCD
Not exposed B/B, fAB1 (bd)/(BD) fABCD
- Sampling fraction is fixed within diseased and
within not diseased temporality preserved
cohort includes cases and noncases. - Risk and odds estimates are unbiased within
exposed and within unexposed but differently
weighted, so risk differences biased - Risk ratios are unbiased.
38Observed Case-Cohort Study
Diseased Cohort
Exposed a 500 ea fAB1 c 1000ecec1 fABCD0.1
Not exposed b 500 eb fAB1 d 3000eabcec1 fABCD0.1
39Observed Case-Cohort Study
Case, fAB1 Cohort, fABCD0.1 Cohort, fABCD0.1
Diseased Diseased Not diseased
Exposed a 500 ea 50ecec1 950ecec3
Not exposed b 500 eb 50ecec2 2950 eabcec123
- When fAB 1, cohort diseased is a subset of
case diseased. - When fAB lt 1, cohort diseased usually overlaps
case diseased.
40Nested Case-Control vs Case Cohort
- Same cases in both
- For a certain sampling strategy, same noncases in
both - Analytic strategy different
41Expected Measures of Risk and Relative Risk
Case-Cohort Study
Diseased Cohort Odds Risk Probability
Exposed 500 1,000 n/a 0.5
Not exposed 500 3,000 n/a 0.167
Exposure Ratio 0.5 0.25 ?Differs from total population ?Differs from total population
- Incorrect Risk Difference 0.333 (true risk
diff/fABCD) - Odds ratio Correct Risk Ratio, Relative Risk
3 - Relative risk would be correct even if the
disease were rare
42Analysis of Case Cohort Study
- Set up table as if cohort were the control group
- Include the overlapping cases in both cases and
cohort - Compute ad/bc
- You have estimated relative risk
- Note
- If you know the cohort sampling fraction, you can
multiply the cohort up and estimate true risks - Given additional error in second stage cohort
sampling, this is less efficient than estimating
relative risk without upweighting
43Analysis of Case Control and Case Cohort Study
- Case Control
- Logistic regression
- eb is an odds ratio
- Temporal bias?
- Nested Case Control
- Logistic regression
- eb is an odds ratio
- No temporal bias
- Case Cohort
- Logistic regression or Linear regression
- eb is a relative risk, not an odds ratio
- No Temporal bias
- Variance somewhat high unless robust variance
estimate is used (e.g. PROC GENMOD with GEE
option)
44Disadvantage of Case Control vs. Case Cohort Study
- Case Control and Nested Case Control
- Inflexible Outcome is fixed
- Even in nested case control study the sampling
structure is usually unknown - Case Cohort
- Ideal for the intended outcome or for multiple
outcomes - If cohort is large enough, multiple outcomes can
be analyzed - Cases can be included in analysis of alternate
dependent variable because sampling structure is
known
45Person years of risk, 1
- The foregoing assumes that all cases occur at the
same time (or can safely be treated as such) - In many, even most studies, this assumption is
reasonable - Person years is length of followup number of
participants when events are rare and/or all
participants start followup at nearly the same
time
46Person years of risk, 2
- Even if there are 50 events and they occur on
average somewhat after the midpoint of followup,
person years gt ¾ length of followup number of
participants - Incidence density rates are somewhat higher than
correspondingly scaled cumulative incidence
rates, but relative risks are probably not much
affected by computation of incidence density vs
cumulative incidence
47Person years of risk, 3
- Proportional hazards models do not allow time
dependency in prediction, so most analyses are
not considering followup time in this way. - The timing of events vs. censoring and competing
risk may cause differences in findings for
incidence density vs. cumulative incidence, but
this would be rare - Subgroups with very different followup times
could create problems, but this is rare
48- In prospective studies, events are accumulated
over time, so incidence density methods can be
applied - Nested case control
- Case cohort
49Nested Case-Control Study (1)
Consider the following hypothetical cohort
X
X lung cancer case O loss to follow-up
X
O
O
X
t1
t2
t3
Time
50Nested Case-Control Study (2)
- At time t1 the first case occurs for which 8
eligible controls are identified - Similarly, there are 5 eligible controls for the
case at time t2, and 4 eligible controls for the
case occurring at time t3 - A control can become a case at a later time
(e.g., cases at t2 and t3 serve as control for
case at t1) - Controls can be selected randomly from all
eligible controls (i.e., 1 or more controls for
each case) - Number of eligible controls decreases with
increasing number of matching factors
51Case-Cohort Study (1)
Consider the following hypothetical cohort
X
X lung cancer case O loss to follow-up
X
O
O
X
t0
Time
52Case-Cohort Study (2)
- In closed cohort (in this case, when everybody
enters cohort at t0), a sample of all subjects
(sub-cohort) is randomly selected from cohort
members at start of follow-up t0 - In open cohort (i.e., when time of entry into
cohort is variable), a sample of all subjects
(sub-cohort) is randomly selected from members
of cohort as it is followed over time (i.e.,
regardless of when subjects entered the cohort)
53Merits of time-based selection
- From a black and white theoretical perspective,
time-based selection makes sense a person is a
noncase until a certain time, then becomes a
case. - In the life table approach, consistent with this
thought, the risk set at any time point is cases
and noncases at that time point. - However, much chronic disease develops slowly and
the black-white, case-noncase formulation does
not apply very well. - I am unenthusiastic generally about time-based
selection of controls or noncases because I would
like to maintain maximum separation between cases
and noncases.
54Analysis of events that evolve in time
- Nevertheless, taking person years into account
(incidence density analysis) is more precise than
is analysis of cumulative incidence. - Analysis is therefore by Cox proportional hazards
life table regression methods, or some similar
technique. - The nested case-control method is not easily
adapted to this type of analysis - The case-cohort method is easily analyzed this
way (PROC PHREG or GENMOD with Poisson
regression), but the variances of the slopes tend
to be too large. - Robust variance estimation is possible in GENMOD
with GEE option Barlow provides a SAS macro for
use of PHREG.