Title: Geographical Disease Surveillance
1Geographical Disease Surveillance
- Martin Kulldorff
- Harvard Medical School and
- Harvard Pilgrim Health Care
2Two Applications of Spatial Statistics in
Epidemiology
- Studies of Specific Hypotheses Evaluate the
relationship between cancer and geographical
variables of interest such as radon, pesticide
use or income levels, adjusting for geographical
variation. - Surveillance Evaluate the geographical variation
of cancer, adjusting for known or suspected
variables such as age, gender or income.
3Reasons for Geographical Disease Surveillance
- Disease Etiology
- Known Etiology but Unknown Presence
- Health Services
- Public Education
- Outbreak Detection
- New Diseases
4Example Questions
- Are people in some geographical area at higher
risk of brain cancer? This could be due to
environmental, socio-economical, behavioral or
genetic risk factors. - Are there geographical differences in the access
to and/or use of early detection programs, such
as mammography screening? - Are there geographical differences in the access
to and/or use of prostate cancer treatment?
5Different Types of Disease Data
- Count Data Incidence, Mortality, Prevalence
- Categorical Data Stage, Histology, Treatment
- Continuous Data Survival, Lead levels, BMI
6For Incidence and Mortality
Poisson Data Numerator Number of
Cases Denominator Person-years at risk
7For Prevalence
Bernoulli Data (0/1 Data) Numerator People with
Thyroid Cancer Denominator Those without Thyroid
Cancer
Note When prevalence is low, a Poisson model is
a very good approximation for Bernoulli data.
8For Stage, Histology and Treatment
Bernoulli Data (0/1 Data) Numerator Cases of a
specific type, e.g. late stage. Denominator All
cases.
Ordinal Data For example Stage 1, 2, 3, 4
9For Survival
Survival Data Length of Survival (Censored Data
is Common)
10For Weight, Lead Levels, etc
Normal Data May sometime want to use a log
transformation to normalize the data
11Data Aggregation (spatial resolution)
- Exact Location
- Census Block Group
- Zip Code
- Census Tract
- County
- State
12Data Aggregation
- Same level of aggregation usually needed due to
data availability. - Less aggregation is typically better as more
information is retained. - Many statistical methods can be used
irrespectively of aggregation level.
13Exploratory/DescriptiveMapping Techniques
- Maps of rates or relative risks
- Probability maps
- Smoothed rates or relative risks
- Smoothed probability maps
14Iowa Breast Cancer Incidence, 1993-1996 Age-Adjust
ed Relative Risks
15Uncertainty of Rate Estimates
In a regular map, a relative risk of 2 could mean
that there are 2000 cases with 1000 expected in
an urban county, or 2 cases with 1 expected in a
rural county. For the urban county, the
relative risk of 2 is a good estimate of the true
relative risk, but not for the rural county.
16Probability Map
For a particular county, one can test whether the
observed cases are significantly more than
expected, providing a p-value for that county. A
map of these p-values is called a probability
map. Reference Chownowski M. Maps Based on
Probabilities. Journal of the American
Statistical Association, 54385-388, 1959.
17County p-values
County Obs Exp RR p
Dubuque 275 235 1.17 0.004 Polk 892 817 1.09 0.
004 Clayton 77 57 1.34 0.006 Mills 51
36 1.43 0.006 Scott 411 368 1.12 0.012 Linn 467
429 1.09 0.033 Marion 97 82 1.18 0.048
18Regular vs. Probability Map
plt0.05
0.05ltplt0.10
19Warning
By chance, 5 of the counties will by chance have
a statistically significant p-value at the 0.05
level. Need to adjust for multiple testing.
20Maps of rates and probability maps are very
useful for descriptive purposes
Problem Maps of Rates No statistical
testing Probability Maps Multiple
testing Solution Tests for Spatial Randomness
One test
21Statistical SignificanceTests for Spatial
Randomness
22Whether or not there are true geographical
differences in risk, there will always be some
geographical patterns apparent to the naked eye.
As in all medical research, it is important to
evaluate whether observed patterns/results are
likely to be due to chance or not.
23Breast Cancer Incidence, Relative
Risks Age-Adjusted, Indirect Standardization
24Brain Cancer Mortality, Children 1986-1995
25Brain Cancer Mortality, Adults 1986-1995
26Tests for Spatial Randomness
Null Hypothesis The risk of disease is the same
in all parts of the map.
27Covariate Adjustments
For incidence and mortality analyses, it is
important to adjust for age, and sometimes for
other variables as well. This is done using
indirect standardization, so that
a covariate-adjusted expected number of cases are
obtained for each census area. Can be used
with any test for spatial randomness.
28Tests for Spatial Randomness
Three Different Types
- Global Clustering Tests
- Cluster Detection Tests
- Focused Tests
Complementary. Used for different purposes.
29Global Clustering Tests
Evaluates whether clustering exist as a global
phenomena throughout the map, without
pinpointing the location of specific clusters.
30Global Clustering Tests
- Chi-square, Pearson, 1900
- Forbes' Coefficient of Association, 1907
- Troup-Maynard, 1912
- Poisson Dispersion Index, Fisher 1922
- Renkonens Test, 1938
- Dice' Association Index, 1945
- Dice' Coincidence Index, 1945
31Global Clustering Tests
- Jahn et al.'s Index 1, 1947
- Jahn et al.'s Index 2, 1947
- Jahn et al.'s Index 4, 1947
- Moran's BB Join Count, 1948
- Moran's BW Join Count, 1948
- Moran's I, 1950
- Jahn's Reproducibility Index, 1950
32Global Clustering Tests
- Geary's c for binary data, 1954
- Duncan-Duncan's C0, 1955
- Morisita's Cd, 1959
- Pielou's S, 1961
- MartÃnez-Picó et al., 1965
- Horn's Ro, 1966
- Horn's Adjustment of Morisita's Cd, 1966
33Global Clustering Tests
- Potthoff-Whittinghill's V, 1966
- Potthoff-Wittinghill's z, 1966
- Lloyd's Mean Crowding, 1967
- Levin's a, 1968
- Mantel-Bailar, 1970
- Pianka-Stewart's Ojk, Pianka 1973
- Lloyd-Robert's Close Pairs Test, 1973
34Global Clustering Tests
- Walter, 1974
- Smith-Pike's T, 1976
- Ohno-Aoki-Aokis Test, 1979
- Weighted Moran's I, Cliff-Ord 1981
- Grimson-Wang-Johnson, 1981
- Lotwick-Silverman's Empty Space Test 1982
- Fraser's Clustering Index, 1983
35Global Clustering Tests
- Esseen's Test, 1983
- Symons-Grimson-Yuan, 1983
- White 1983
- Jansson's Count, 1983
- Fraser's Clustering Index, 1983
- Barnes et al, 1987
- Whittemore et al. 1987
36Global Clustering Tests
- Cuzick-Edwards k-NN, 1990
- Cuzick-Edwards Tinv, 1990
- Grimson-Rose's Method, 1991
- Besag-Newell's R, 1991
- Diggle-Chetwynd's D(s), 1991
- Diggle-Chetwynd's Dmax, 1991
- Diggle-Chetwynd's Dsum, 1991
37Global Clustering Tests
- Alexander's NNA, 1991
- Grimson's U, 1993
- Palmer, 1993
- Dixon's Zs, 1994
- Dixon's ZAA, 1994
- Tango's Excess Events Test, 1995
- Britton 1997
38Global Clustering Tests
- Anderson-Titterington's Thh, 1997
- Swartz' Entropy Test, 1998
- Conradt's Segregation Coefficient, 1998
- Perrys T, 1998
- Rogersons R, 1999
- Bithell's D, 1999
- Bithell's Tvar, 1999
39Global Clustering Tests
- Assuncao-Reis Empirical Bayes Index, 1999
- Tangos Max Excess Events Test, 2000
- Bonettis Polytope Tests, 2000
- Gangnon-Claytons Test, 2001
- Bonetti-Paganos M, 2004
- Bonetti-Paganos M(1), 2004
- Bonetti-Paganos M(KL), 2004 etc,etc,etc
40Cluster Detection Tests
Determine the location and statistical
significance of clusters without prior
assumptions about their locations.
41Cluster Detection Tests
- Jansons Largest Cluster Test, 1983
- Turnbulls CEPP, 1990
- Grimson's MAX, 1993
- Kulldorffs Spatial Scan Statistic, 1995,97
- Bithell's M, 1999
- Tango-Takahashis Flex Scan, 2005
- etc
42Focused Tests
Determine whether there is a cluster around a
pre-specified point or linear feature.
43Focused Tests
- Fixed Cut-Off, Lyon et al. 1981
- Isotonic Regression, Stone 1988
- Diggles D, 1990
- Lawson-Wallers Score Test, 1992,93
- Bithells Linear Rank Score Test, 1995
- Rogersons Ri, 1999
- etc.
44The Spatial Scan Statistic
45One-Dimensional Scan Statistic
46The Spatial Scan Statistic
- Create a regular or irregular grid of centroids
covering the whole study region. - Create an infinite number of circles around each
centroid, with the radius anywhere from zero up
to a maximum so that at most 50 percent of the
population is included.
47Collection of overlapping circles of different
sizes.
48- For each circle
- Obtain actual and expected number of cases
inside and outside the circle. - Calculate Likelihood Function.
- Compare Circles
- Pick circle with highest likelihood function as
Most Likely Cluster.
- Inference
- Generate random replicas of the data set under
the null-hypothesis of no clusters (Monte Carlo
sampling). - Compare most likely clusters in real and random
data sets (Likelihood ratio test).
49Spatial Scan Statistic Properties
- Adjusts for inhomogeneous population density.
- Simultaneously tests for clusters of any size and
any location, by using circular windows with
continuously variable radius. - Accounts for multiple testing.
- Possibility to include confounding variables,
such as age, sex or socio-economic variables. - Aggregated or non-aggregated data (states,
counties, census tracts, block groups,
households, individuals).
50Breast Cancer Incidence, Relative
Risks Age-Adjusted, Indirect Standardization
51A small sample of the circles used
52Four Most Likely Clusters
p0.99
p0.11
p0.37
p0.88
53Four Most Likely Clusters
Cluster Obs Exp RR p
East 1853 1722 1.08 0.11 Central 986
899 1.10 0.37 Southwest 51
36 1.43 0.89 Northwest 199 172 1.16 0.99
54Geographical Aggregation
- In traditional mapping of rates or relative risks
for disjoint geographical areas, there is a
trade-off between the stability of the estimates
and the geographical resolution. - With tests for spatial randomness, less
geographical data aggregation is always better - Ability to detect clusters not conforming to
political boundaries. - More accurate data / less loss of information.
55Breast Cancer IncidenceCensus Tract Analysis
732 census tracts
56Eight Most Likely Clusters for Breast Cancer
Incidence
(approximate locations)
57Iowa Breast Cancer Incidence
Census Tract Aggregation
Cluster Obs Exp RR LLR p
1 341 240 1.4 19.4 0.001 2 28
11 2.6 9.8 0.03 3 1843 1708 1.1 6.7 0.39 4
29 15 2.0 5.3 0.80 5 21
10 2.1 4.4 0.98 6 30 17 1.8 4.4 0.98 7 208 1
71 1.2 3.8 0.99 8 41 26 1.6 3.8 0.99
58Iowa Breast Cancer Staging
Census Tract Aggregation
Late Stage Cases 758 Total Cases 7415
59Six Most Likely Clusters of Late Stage Breast
Cancer
B
C
A
F
E
D
60Late Stage Breast Cancer
Census Tract Aggregation
Cluster Obs Exp RR LLR p
A 15 4.5 3.3 9.2 0.049
B 13 4.7 2.8 5.9 0.62 C
6 1.3 4.5 5.5 0.75 D 44
27.1 1.6 5.3 0.81 E 9 3.1 2.9 4.5 0.97
F 4 0.9 3.5 4.3 0.99
61Summary Breast Cancer in Iowa
- A cluster of high breast cancer incidence was
found west of Des Moines. - The geographical distribution of late stage
breast cancer is rather even, with only one
marginally significant cluster
62Summary Spatial Scan Statistic
- Cluster detection irrespectively of political
boundaries, and without assumptions about cluster
size or location. - Adjusts for multiple testing.
- It is only possible to pinpoint the general
location of a cluster. The borders are
approximate. - It is a surveillance tool. The cause of a
cluster must be investigated through other means.
63Global Clustering Tests
64Notation
Census areas i1, 2, ... , L Observed cases
in area i ci Total cases CSci Population in
area i ni Total population NSni Distance
between areas i and j dij
65Moran's I
- Let , where ri ci / ni.
- Morans I is defined as
- where a(i,j) is 1 if county i and j are neighbors
and 0 otherwise.
66Whittemore's Test
- Whittemore et al. proposed the statistic
67Cuzick-Edwards k-NN Test
åi åj ci cj I(dijltdik(i)) where k(i) the county
with the k-nearest neighbor to an individual in
county i. Note This test is a special case of
the Weighted Morans I Test, proposed by Cliff
and Ord, 1981
68Besag- Newells R
- For each case, find the collection of nearest
counties so that there are a total of at least k
cases in the area of the original and neighboring
counties. - Using the Poisson distribution, check if this
area is statistically significant (not adjusting
for multiple testing) - R is the the number of cases for which this
procedure creates a significant area
69Swartzs Entropy Test
- The test statistic is defined as
70Tango's Maximized Excess Events Test
- For a given constant ?, the Excess Events Test
statistic is defined as - To be able to detect clustering irrespectively of
its geographical scale, Tango proposed the
Maximized Excess Events Test
71Global Clustering TestsPower Evaluation
Joint work with Toshiro Tango, Peter Park and
Changhong Song
72Power Evaluation
- 245 counties and county equivalents in
Northeastern United States - 600 randomly distributed cases, according to
population size and different clustering models - Different parameters used for Besag-Newells R
and Cuzick-Edwards k-NN tests
Kulldorff M, Tango T, Park PJ. Power comparisons
for disease clustering tests. Computational
Statistics and Data Analysis, 2003,42665-684.
73Global Chain Clustering
- Each county has the same expected number of cases
under the null and alternative hypotheses - 300 cases are distributed according to complete
spatial randomness - Each of these have a twin case, located nearby,
at some distance from the original case (distance
zero, 1 and 5 of population along a chain,
respectively).
74PowerZero Distance
- Morans I 0.05
- Whittemores Test 0.13
- Cuzick-Edwards 1.00 0.92 0.73
- Besag-Newell 0.48 0.49 0.42
- Swartz Entropy 1.00
- Tangos MEET 0.99
- Spatial Scan 0.79
75PowerRandom Distance, 1
- Morans I 0.12
- Whittemores Test 0.12
- Cuzick-Edwards 0.53 0.52 0.47
- Besag-Newell 0.14 0.21 0.27
- Swartz Entropy 0.39
- Tangos MEET 0.56
- Spatial Scan 0.35
76PowerRandom Distance, 4
- Morans I 0.07
- Whittemores Test 0.10
- Cuzick-Edwards 0.14 0.17 0.18
- Besag-Newell 0.08 0.10 0.12
- Swartz Entropy 0.13
- Tangos MEET 0.25
- Spatial Scan 0.18
77Hot Spot Clusters
- One or more neighboring counties have higher risk
that outside. - Constant risks among counties in the cluster, as
well as among those outside the cluster
78PowerGrand Isle, Vermont (RR193)
- Morans I 0.00
- Whittemores Test 0.01
- Cuzick-Edwards 0.75 0.17 0.04
- Besag-Newell 0.71 0.39 0.09
- Swartz Entropy 0.94
- Tangos MEET 0.20
- Spatial Scan 1.00
79PowerGrand Isle 15 neigbors (RR3.9)
- Morans I 0.71
- Whittemores Test 0.01
- Cuzick-Edwards 0.76 0.62 0.25
- Besag-Newell 0.82 0.88 0.50
- Swartz Entropy 0.71
- Tangos MEET 0.23
- Spatial Scan 0.97
80PowerPittsburgh, PA (RR2.85)
- Morans I 0.05
- Whittemores Test 0.00
- Cuzick-Edwards 0.65 0.92 0.90
- Besag-Newell 0.04 0.02 0.98
- Swartz Entropy 0.27
- Tangos MEET 0.92
- Spatial Scan 0.94
81PowerPittsburgh 15 neighbors (RR2.1)
- Morans I 0.19
- Whittemores Test 0.00
- Cuzick-Edwards 0.60 0.72 0.84
- Besag-Newell 0.29 0.28 0.91
- Swartz Entropy 0.35
- Tangos MEET 0.83
- Spatial Scan 0.95
82PowerManhattan (RR2.73)
- Morans I 0.05
- Whittemores Test 0.27
- Cuzick-Edwards 0.63 0.86 0.89
- Besag-Newell 0.04 0.03 0.95
- Swartz Entropy 0.26
- Tangos MEET 0.94
- Spatial Scan 0.92
83PowerManhattan 15 neighbors (RR1.53)
- Morans I 0.07
- Whittemores Test 0.87
- Cuzick-Edwards 0.26 0.65 0.80
- Besag-Newell 0.01 0.06 0.37
- Swartz Entropy 0.05
- Tangos MEET 0.99
- Spatial Scan 0.93
84Conclusions
- Besag-Newells R and Cuzick-Edwards k-NN often
perform well, but are highly dependent on the
chosen parameter - Morans I and Whittemores Test have problems
with many types of clustering - Tangos MEET perform very well for global
clustering - The spatial scan statistic perform well for
hot-spot clusters
85Brain Cancer Mortalityin the United States
- Joint work with
- Zixing Fang, Cancer Prevention Institute
- David Gregorio, Univ Connecticut
86U.S. Brain Cancer Mortality1986-1995
deaths rate (95 CI) Children (age lt20)
5,062 0.75 (0.66-0.83) Adults (age 20)
106,710 6.0 (5.8-6.2) Adult Women
48,650 4.9 (4.7-5.0) Adult Men
58,060 7.2 (7.0-7.5) annual deaths / 100,000
87Brain Cancer
- Known risk factors
- High dose ionizing radiation
- Selected congenital and genetic disorders
- Explains only a small percent of cases.
- Potential risk factors
- N-nitroso compounds?, phenols?, pesticides?,
polycyclic aromatic hydrocarbons?, organic
solvents?
88Adjustments
All subsequent analyses where adjusted for
- Age
- Gender
- Ethnicity (African-American, White, Other)
89Brain Cancer Mortality, Children 1986-1995
90Cuzick-Edwards Test Children
k p-value 200 0.04 500 0.13
91Tangos Excess Events TestChildren
l p-value 1000 0.005 2000 0.06
5000 0.21 10000 0.29
92Spatial Scan Statistic, Children
93Children Seven Most Likely Clusters
Cluster Obs Exp RR
p 1. Carolinas 86 51 1.7 0.24 2.
California 16 4.9 3.3 0.74 3. Michigan
318 250 1.3 0.74 4. S Carolina 24 10 2.5 0.79 5
. Kentucky-Tenn 127 88 1.4 0.79 6.
Wisconsin 10 2.4 4.1 0.98 7. Nebraska 12 3.6 3.3
0.99
94Conclusions Children
Some evidence of global spatial clustering, but
rather weak. No statistically significant
clusters detected. Any part of the pattern seen
on the original map may be due to chance.
95How About Adults?
96Brain Cancer Mortality, Adults 1986-1995
97Cuzick-Edwards k-NN All Adults
k p-value 4000 0.0001
10000 0.0001
98Tangos EET All Adults
l p-value 1000 0.0001
2000 0.0001 5000 0.0001 10000 0.0001
99Spatial Scan Statistic Adults
100Brain Cancer Mortality, Adults 1986-1995
101Cuzick-Edwards Women
k p-value 1500 0.0001
3000 0.0001
102Tangos EET Women
l p-value 1000 0.0001
2000 0.0001 5000 0.0001 10000 0.0001
103Spatial Scan Statistic, Women
104Women Most Likely Clusters
Cluster Obs Exp RR
p 1. Arkansas et al. 2830 2328 1.22 0.0001
2. Carolinas 1783 1518 1.17 0.0001 3. Oklahoma
et al. 1709 1496 1.14 0.003 4. Minnesota et
al. 2616 2369 1.10 0.01 10. N.J. /
N.Y. 1809 2300 0.79 0.0001 11. S Texas 127
214 0.59 0.0001 12. New Mexico et al.
849 1049 0.81 0.0001
105Cuzick-Edwards Men
k p-value 2000 0.0001
4000 0.0001
106Tangos EET Men
l p-value 1000 0.0001
2000 0.0001 5000 0.0001 10000 0.0001
107Spatial Scan Statistic Men
108Men Most Likely Clusters
Cluster Obs Exp RR
p 1. Kentucky et al. 3295 2860 1.15 0.0001
2. Carolinas 1925 1658 1.16 0.0001 3. Arkansas
et al. 1143 964 1.19 0.001 4. Washington
et al. 1664 1455 1.14 0.003 5. Michigan 1251 1074
1.17 0.005 11. N.J. / N.Y. 2084 2615 0.80 0.00
01 12. S Texas 157 262 0.60 0.0001 13. New
Mexico et al. 1418 1680 0.84 0.0001 14. Upstate
N.Y. et al. 1642 1895 0.87 0.0001
109Conclusions Adults
Strong evidence of global spatial clustering. It
is possible to pinpoint specific areas with
higher and lower rates that are statistically
significant, and unlikely to be due to
chance. The exact borders of detected clusters
are uncertain. Similar patterns for men and
women.
110Conclusion General
Tests for spatial randomness are very useful
additions to cancer maps, in order to determine
if the observed patterns are likely due to chance
or not. Different tests provide complementary
information.
111Reference
- Fang, Kulldorff, Gregorio Brain cancer in the
United States 1986-1995, A Geographical Analysis.
Neuro-Oncology, 2004, 6179-187.
112Childhood Leukemia in Sweden
- Ulf Hjalmars, Martin Kulldorff
- Göran Gustafsson, Neville Nagarwalla
Statistics in Medicine, 1995
113Leukemia Incidence Data
- Acute leukemia
- Children, age 0-15 years
- Years 1973-1993
- 1523 cases
- 2577 parishes
- Denominator 1,703,235 children, based on an
average for years 1976,1982 and 1988.
114Three Most Likely Clusters
115Three Most Likely Clusters
- obs exp pop p
- Okome 3 0.1 133 0.70
- Ö Tunhem 5 0.6 695 0.91
- Stavnäs 20 9.9 10380 0.99
116Conclusions
- No evidence of any childhood cancer clusters in
Sweden - A leukemia cluster in Ã…storp that received
media attention in 1981 was detected, but it was
not among the top three clusters nor
statistically significant.
117(No Transcript)
118Breast Cancer MortalityNortheastern United States
States Maine, N.H., Vermont, Mass., R.I.,
Connecticut, N.Y., N.J., Pennsylvania, Delaware,
Maryland, D.C. Years 1988-1992 Deaths
58,943 Population 29,535,210 Geographical
Aggregation 245 counties Joint work with E
Feuer, B Miller, L Freedman, NCI
119Breast cancer mortality
120(No Transcript)
121Breast cancer mortality Most likely cluster
p0.001
122Most Likely Clusters
Location Obs Exp RR LLR
p NY/Philadelphia 24,044 23,040 1.074 35.7 0.001
Buffalo 1,416 1,280 1.109 7.1
0.12 Washington DC 712 618 1.154
6.9 0.15 Boston 5,966 5,726 1.047 5.5
0.40 Eastern Maine 267 229 1.166
3.0 0.99
123References
General Theory Kulldorff M. A Spatial Scan
Statistic, Communications in Statistics, Theory
and Methods, 261481-1496, 1997. Application Kull
dorff M. Feuer E, Miller B, Freedman L. Breast
Cancer in Northeast United States A Geographic
Analysis. American Journal of Epidemiology,
146161-170, 1997.