Title: A short introduction to epidemiology Chapter 9: Data analysis
1A short introduction to epidemiologyChapter 9
Data analysis
- Neil Pearce
- Centre for Public Health Research
- Massey University
- Wellington, New Zealand
2Chapter 9Data analysis
- Basic principles
- Basic analyses
- Control of confounding
3Basic principles
- Effect estimation
- Confidence intervals
- P-values
4Testing and estimation
- The effect estimate provides an estimate of the
effect (e.g. relative risk, risk difference) of
exposure on the occurrence of disease - The confidence interval provides a range of
values in which it is plausible that the true
effect estimate may lie - The p-value is the probability that differences
as large or larger as those observed could have
arisen by chance if the null hypothesis (of no
association between exposure and disease) is
correct - The principal aim of an individual study should
be to estimate the size of the effect (using the
effect estimate and confidence interval) rather
than just to decide whether or not an effect is
present (using the p-value)
5Problems of significance testing
- The p-value depends on two factors the size of
the effect and the size of the study - A very small difference may be statistically
significant if the study is very large, whereas a
very large difference may not be significant if
the study is very small. - The purpose of significance testing is to reach a
decision based on a single study. However,
decisions should be based on information from all
available studies, as well as non-statistical
considerations such as the plausibility and
coherence of the effect in the light of current
theoretical and empirical knowledge (see chapter
10).
6Chapter 9Data analysis
- Basic principles
- Basic analyses
- Control of confounding
7Basic analyses
- Measures of occurrence
- Incidence proportion (risk)
- Incidence rate
- Incidence odds
- Measures of effect
- Risk ratio
- Rate ratio
- Odds ratio
8Example
E
E
M1
C
a
b
c
d
M0
N0
N1
T
9Example Smoking and Ovarian Cancer
E
E
60
36
24
C
40
58
98
158
76
82
10(No Transcript)
11(No Transcript)
12- This ?2 is based on the assumptions that the
marginal totals of the table (N1, N0, M1,M0) are
fixed and that the proportion of exposed cases is
the same as the proportion of exposed controls
(i.e. that the overall proportion M1/T applies to
both cases and controls)
13The natural logarithm of the odds ratio has
(under a binomial model) an approximate standard
error of SEln(OR) (1/a 1/b 1/c
1/d)0.5 An approximate 95 confidence interval
for the odds ratio is then given by OR e1.96
SE
14Chapter 9Data analysis
- Basic principles
- Basic analyses
- Control of confounding
15Control of confounding
- There are two methods of calculating a summary
effect estimate to control confounding - Pooling
- Standardisation
16Example of pooling
The unadjusted (crude) findings indicate that
there is a strong association between smoking and
the ovarian cancer. Suppose, however, that we
are concerned about the possibility that the
effect of smoking is confounded by use of oral
contraception (this would occur if oral
contraception caused the ovarian cancer and if
oral contraception was associated with smoking).
We then need to stratify the data into those who
have used oral contraceptives and those who have
not.
17OC use
Yes
No
Smoking
Smoking
Yes
No
15
4
9
32
Cases
19
41
8
28
36
Controls
50
12
62
17
60
77
65
16
81
18In those who have used oral contraceptives, the
odds ratio for smoking is In those who have
not used oral contraceptives, the odds ratio for
smoking is
19Thus, the crude OR for smoking (0.46) was partly
elevated due to confounding by oc use. When we
remove this problem (by stratifying on oral
contraceptive use) the odds ratios increase and
are close to 1.0
20In this example, the odds ratios are not exactly
the same in each stratum. If they are very
different (e.g. 1.0 in one stratum and 4.0 in the
other stratum) then we would usually report the
findings separately for each stratum. However,
if the odds ratio estimates are reasonably
similar then we usually wish to summarize our
findings into a single summary odds ratio by
taking a weighted average of the OR estimates in
each stratum.
21where ORi OR in stratum i Wi
weight given to stratum i
22 One obvious choice of weights would be to weight
each stratum by the inverse of its variance
(precision-based estimates). However, this
method of obtaining a summary odds ratio yields
estimates which are unstable and highly affected
by small numbers in particular strata.
23A better set of weights were developed by
Mantel-Haenszel. These involve using the weights
bi ci /Ti
24Stratum 1
Stratum 2
E
E
15
4
9
32
19
41
C
C
C
50
12
62
8
28
36
C
65
16
81
17
60
77
25 This set of weights yields summary odds ratio
estimates which are very close to being
statistically optimal (they are very close to the
maximum likelihood estimates) and are very robust
in that they are not unduly affected by small
numbers in particular strata (provided that the
strata do not have any zero marginal totals).
26We can calculate a corresponding chi-square
27Stratum 1
Stratum 2
E
E
15
4
9
32
19
41
C
C
C
50
12
62
8
28
36
C
65
16
81
17
60
77
28The natural logarithm of the odds ratio has
(under a binomial model) an approximate standard
error of SPR S(PS QR) SQS SE -----
-------------- ------ 2R2 2RS 2S2 wher
e P (ai di)/Ti Q (bi ci)/Ti R
aidi/Ti S bici/Ti R SR S SS
29 An approximate 95 confidence interval for the
odds ratio is then given by OR e1.96 SE
30Rate ratios
E
a
b
c
M1
Y1
Y0
PY
31E
350
125
Case PY Rate
10,000
10,000
0.00125
0.00350
32(No Transcript)
33(No Transcript)
34(No Transcript)
35The summary Mantel-Haenszel rate ratio involves
taking the weights bY1/T to yield
36(No Transcript)
37The equivalent Mantel-Haenszel chi-square is
38 This is very similar to the ?2MH for
case-control studies, but it has some minor
modifications to take account of the fact that we
are using person-time data rather than binomial
data.
39(No Transcript)
40An approximate standard error for the natural log
of the rate ratio is SM1iY1iY0i/Ti20.5 SE
------------------------------ (SaiY0i/Ti)(S
biY1i/Ti)0.5
41 An approximate 95 confidence interval for the
rate ratio is then given by RR e1.96 SE
42Risk ratios
E
a
b
Cases
M1
c
d
Non Cases
M0
N1
N0
Total
43(No Transcript)
44(No Transcript)
45An approximate standard error for the natural log
of the risk ratio is SM1iN1iN0i/Ti2 -
aibi/Ti0.5 SE --------------------------------
- (SaiN0i/Ti)(SbiN1i/Ti)0.5
46 An approximate 95 confidence interval for the
risk ratio is then given by RR e1.96 SE
47Standardization, in contrast to pooling, involves
taking a weighted average of the rates in each
stratum (eg age-group) before taking the ratio of
the two standardized rates. Standardization has
many advantages in descriptive epidemiology
involving comparisons between countries, regions,
ethnic groups or gender groups. However,
pooling (when done appropriately) has some
superior statistical properties when comparing
exposed and non-exposed in specific study.
48Summary of Stratified Analysis
- If we are concerned about confounding by a factor
such as age, gender, smoking then we need to
stratify on this factor (or all factors
simultaneously if there is more than one
potential confounder) and calculate the exposure
effect separately in each stratum. - If the effect is very different in different
strata then we would report the findings
separately for each stratum.
49If the effect is similar in each stratum then we
can obtain a summary estimate by taking a
weighted average of the effect in each
stratum. If the adjusted effect is different from
the crude effect this means that the crude effect
was biased due to confounding.
50Usually we need to adjust the findings (ie
stratify on) age, gender, and some other
factors. If we have five age-groups and two
gender-groups then we need to divide the data
into ten age-gender-groups. If we have too many
strata then we begin to get strata with zero
marginal totals (eg with no cases or no
controls). The analysis then begins to break
down and we have to consider using mathematical
modelling.
51A short introduction to epidemiologyChapter 9
Data analysis
- Neil Pearce
- Centre for Public Health Research
- Massey University
- Wellington, New Zealand