Title: Apresenta
1Geoinfo 2006
What is the true shape of a disease cluster? The
multi-objective genetic scan
Luiz Duczmal
André L.F. Cançado
Ricardo C.H. Takahashi
Univ. Federal Minas Gerais, Brazil, Statistics
Dept., Electrical Engineering Dept.,
Mathematics Dept.
2Irregularly shaped spatial disease clusters occur
commonly in epidemiological studies, but their
geographic delineation is poorly defined. Most
current spatial scan software usually displays
only one of the many possible cluster solutions
with different shapes, from the most compact
round cluster to the most irregularly shaped one,
corresponding to varying degrees of penalization
parameters imposed to the freedom of shape.
Even when a fairly complete set of solutions
is available, the choice of the most appropriate
parameter setting is left to the practitioner,
whose decision is often subjective.
3We propose quantitative criteria for choosing the
best cluster solution, through multi-objective
optimization, by finding the Pareto-set in the
solution space. Two competing objectives are
involved in the search regularity of shape, and
scan statistic value. Instead of running
sequentially a cluster finding algorithm with
varying degrees of penalization, the complete set
of solutions is found in parallel, employing a
genetic algorithm.
4- The cluster significance concept is extended
for this set in a natural and unbiased way, being
employed as a decision criterion for choosing the
optimal solution. The Gumbel distribution is
used to approximate the empiric scan statistic
distribution, speeding up the significance
estimation. The method is fast, with good power
of detection. An application to breast cancer
clusters is discussed.Keywords spatial scan
statistic, disease clusters, geometric
compactness penalty correction, Pareto-sets,
multi-objective optimization, vector
optimization, Gumbel distribution, genetic
algorithm.
5Spatial Scan Statistics Kulldorff (1997) Map
with m regions Total population N C cases
Under the null hypothesis there is no cluster
in the map, and the number of cases in each
region is Poisson distributed.
6For each circle centered in each centroids
region, let z be the collection of regions that
lie inside it. Let number of cases
inside z expected cases inside z
The scan statistic is defined as
z
if
and one otherwise.
7The collection (or zone) z with the highest L(z)
is the most likely cluster.
2
We sweep through all the m
possible circular
zones, looking for the highest L(z) value.
We need to compare this value against the max
L(z) for maps with cases distributed
randomly under the null hypothesis.
The whole procedure is repeated for thousands of
times, for each set of randomly distributed
cases. (Monte Carlo, Dwass(1957)).
8Extreme example of an irregularly shaped cluster
Penalty function to control the freedom of shape
(joint work with Kulldorff and Huang)
9A(z)area of the zone z H(z)perimeter of the
convex hull of z
Intuitively, the convex hull of a planar object
is the cell inside a rubber band stretched
around it.
Compactness
K(z) the area of z divided by the area of the
circle with perimeter H(z).
10Compactness for some common shapes
Circle K(z) 1 Square K(z) p/4
11Penalty function for the log of the likelihood
ratio (LLR(z))
K(z).LLR(z)
Generalized compactness correction
.LLR(z)
a 1 full compactness correction a 0.5
medium compactness correction a 0.0 no
compactness correction
12The Elliptic Scan Statistic (joint work with
Kulldorff, Huang and Pickle)
The scanning window has variable location, size,
shape and angle. A penalty function may be used.
13Breast Cancer Mortality Rates
14(No Transcript)
15penalty correction
1
0
circular
16penalty correction
1
0
elliptical
17penalty correction
1
0
irregular
18no penalty correction
disaster !
1
0
irregular
19Extreme example of an irregularly shaped cluster
(joint work with Martin Kulldorff and Lan Huang)
20Homicide average 1998-2002 Minas Gerais State,
Brazil Hom./100,000 inhab./year 853
municipalities Source DATASUS Map by Ricardo
Tavares
100 km
21Genetic Algorithms (joint work with Cançado,
Takahashi and Bessegato)
- OBJECTIVE
- Find a quasi-optimal solution for a maximization
problem. - Initial population.
- Random crossing-over of parents and offspring
generation. - Selection of children and parents for the next
generation. - Random mutation.
- Repeat the previous steps for a predefined number
of - generations or until there is no improvement in
the functional. -
22 We minimize the graph-related operations by
means of a fast offspring generation and
evaluation of the Kulldorffs scan likelihood
ratio statistic. This algorithm is more than
ten times faster and exhibits less variance
compared to a similar approach using simulated
annealing, and thus gives better confidence
intervals for the Monte Carlo inference process
of significance evaluation for the most likely
cluster found.
23(No Transcript)
24(No Transcript)
25Incidence of Malaria Deaths in the Brazilian
Amazon (1998-2002)
26(No Transcript)
27Initial population construction Start at a region
of the map.
28Initial population construction Add the neighbor
which forms the highest LLR 2-cell zone.
29Initial population construction Add the neighbor
which forms the highest LLR 3-cell zone.
30Initial population construction Add the neighbor
which forms the highest LLR 4-cell zone.
31Initial population construction Stop. (It is
impossible to form a higher LLR 5-cell zone)
32Initial population construction Start at another
region of the map.
33Initial population construction Add the neighbor
which forms the highest LLR 2-cell zone.
34Initial population construction etc. Repeat the
previous steps for all the regions of the map.
35THE OFFSPRING GENERATION (a simple example)
36THE OFFSPRING GENERATION (a simple example)
37THE OFFSPRING GENERATION (a simple example)
38THE OFFSPRING GENERATION (a simple example)
Another possible numbering
39THE OFFSPRING GENERATION (a more sofisticated
example)
40One instance of two parent trees
41- Advantages
- The offspring generation is very inexpensive
- All the children zones are automatically
connected - Random mutations are easy to implement
- The selection for the next generation is
straightforward - Fast evolution convergence
- The variance between different test runs is
small.
42Population Evolution Performance
43- Irregularly shaped clusters
- benchmark, Northeast US
- counties map.
- Duczmal L, Kulldorff M, Huang L.
- (2006)
- Evaluation of spatial scan statistics
- for irregularly shaped clusters.
- J. Comput. Graph. Stat.
-
44 Power evaluation of the genetic algorithm,
compared to the simulated annealing algorithm.
45Cluster of high incidence of breast cancer. São
Paulo State, Brazil, 2002. Population adjusted
for age and under-reporting.
46Cluster of high incidence of breast cancer. São
Paulo State, Brazil, 2002. Population adjusted
for age and under-reporting.
Data source DATASUS, G.L.Souza
Compactness correction 1.0 Cluster cases
2,924 Cluster population 346,024 Incidence
0.00845 LLR 298.9 p-value0.001
0 100 km
47Cluster of high incidence of breast cancer. São
Paulo State, Brazil, 2002. Population adjusted
for age and under-reporting.
Data source DATASUS, G.L.Souza
Compactness correction 0.5 Cluster cases
3,078 Cluster population 361,373 Incidence
0.00852 LLR 343.8 p-value0.001
0 100 km
48Cluster of high incidence of breast cancer. São
Paulo State, Brazil, 2002. Population adjusted
for age and under-reporting.
Data source DATASUS, G.L.Souza
Compactness correction 0.0 Cluster cases
3,324 Cluster population 394,294 Incidence
0.00843 LLR 449.6 p-value0.001
0 100 km
49- The genetic algorithm for disease cluster
detection is fast and exhibits less variance
compared to similar approaches - The potential use for epidemiological studies
and syndromic surveillance is encouraged - The need of penalty functions for the
irregularity of clusters shape is clearly
demonstrated by the power evaluation tests - The power of detection of clusters is similar to
the simulated annealing algorithm - The flexibility of shape control gives to the
practitioner more insight of the geographic
cluster delineation.
50Northeast US counties map with observed cases
Age adjusted female breast cancer, 1995.
Kulldorff M., Feuer E.J., Miller B.A., Freedman
L.S. (1997) Breast cancer clusters in the
Northeast United States a geographic analysis.
American Journal of Epidemiology, 146161-170.
Percent below/above expected gt 20
12 to 20 4 to 12 -4 to 4
-12 to -4 -20 to -12 lt -20
51The Gumbel parametric approximation to the log
likelihhod ratio scan. Joint work with Cançado
and Takahashi. Based on the results of Abrams,
Kulldorff and Kleinmann.
LLR
52Pareto Sets
The detection of irregularly shaped disease
clusters through multi-objective optimization.
53The genetic algorithm is used to maximize two
objectives -the scan statistic. -the
regularity of shape (compactness).
54Elite (red dots) Each red dot is not surpassed
by any other point on all variables
simultaneously.
compactness
log likelihood ratio
55Elite (red dots) Each red dot is not surpassed
by any other point on all variables
simultaneously.
compactness
log likelihood ratio
56Elite (red dots) Each red dot is not surpassed
by any other point on all variables
simultaneously.
compactness
log likelihood ratio
57Elite (red dots) Each red dot is not surpassed
by any other point on all variables
simultaneously.
compactness
log likelihood ratio
58The Pareto Surface is formed joining the elite
points.
compactness
log likelihood ratio
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66compactness
Null Hypothesis Critical Value Pareto Surface, 95
percentile (circles). 100 elites (from 100
simulations under the null hypothesis).
log likelihood ratio
67compactness
Power Test Pareto Surface, 95 percentile under
null hypothesis (red circles). 100 elites (from
100 simulations under the alternative hypothesis).
log likelihood ratio
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73Northeast US counties map with observed cases
Age adjusted female breast cancer, 1995.
Kulldorff M., Feuer E.J., Miller B.A., Freedman
L.S. (1997) Breast cancer clusters in the
Northeast United States a geographic analysis.
American Journal of Epidemiology, 146161-170.
Percent below/above expected gt 20
12 to 20 4 to 12 -4 to 4
-12 to -4 -20 to -12 lt -20
74(No Transcript)
75(No Transcript)
76References
- Duczmal L, Kulldorff M, Huang L. (2006)
Evaluation of spatial scan statistics for
irregularly shaped clusters. J. Comput. Graph.
Stat. 152,1-15. - Duczmal L, Cançado ALF, Takahashi RHC, Bessegato
LF, 2006. A genetic algorithm for irregularly
shaped spatial scan statistics (submitted). - Duczmal L, Cançado ALF, Takahashi RHC, 2006.
Delineation of Irregularly Shaped Disease
Clusters through Multi-Objective Optimization
(submitted). - Duczmal L, Assunção R. (2004), A simulated
annealing strategy for the detection of
arbitrarily shaped spatial clusters, Comp. Stat.
Data Anal., 45, 269-286. - Kulldorff M, Huang L, Pickle L, Duczmal L.
(2005) An Elliptic Spatial Scan Statistic.
Statistics in Medicine (to appear). - Patil GP, Taillie C. (2004) Upper level set scan
statistic for detecting arbitrarily shaped
hotspots. Envir. Ecol. Stat., 11, 183-197. - Kulldorff M. (1997), A Spatial Scan Statistic,
Comm. Statist. Theory Meth., 26(6), 1481-1496. - Kulldorff M, Tango T, Park PJ. (2003) Power
comparisons for disease clustering sets, Comp.
Stat. Data Anal., 42, 665-684. - Kulldorff M, Feuer EJ, Miller BA, Freedman LS.
(1997) Breast cancer clusters in the Northeast
United States a geographic analysis. Amer. J.
Epidem., 146161-170. - de Souza Jr. GL (2005) The Detection of Clusters
of Breast Cancer in São Paulo State, Brazil.
M.Sc. Dissertation, Univ. Fed. Minas Gerais.