Title: Summary of
1Summary of A Spatial Scan Statistic by M.
Kulldorff
- Presented by Gauri S. Datta
- gauri_at_stat.uga.edu
- Mid-Year Meeting
- February 3, 2006
2Background
- Scan Statistic
- A tool to detect cluster in a Point Process
- Naus (1965 JASA) studied in one dimension
- tests if a 1-dim point process is purely random
- Point Process
- Consider a time interval a,b and a window
At,tw of fixed width w - ?(A) of e-mails arrived in the time window A
- n(A) nA of junk e-mails number of
points - Arrival times of junk e-mails define a Point
Process
3Main Idea in Scan Statistic
- Move a window t,tw of size w lt b-a over a time
interval a,b - Over all possible values of t, record the maximum
number of points in the window - Compare this number with cut off points under the
the hypothesis of a purely Poisson Process
4(No Transcript)
5p
p
q
6Building block of Scan Test
- Repeated use of tests for equality of two
Binomial or Poisson populations - Two populations are defined by the scanning
window A and its complement Ac - As in multiple comparison, these tests are
dependent as one moves the scanning window
7Spatial Scan Statistic (SSS)
- Kulldorff (1997) used SSS to detect clusters in
spatial process - SSS can be used
- In multi-dim point process
- With variable window size
- With baseline process an inhomogeneous Poisson
process or Bernoulli Process
8SSS (continued)
- Scanning window can be any predefined shape
- SSS is on a geographical space G with a measure ?
- In traditional point process, G is a line, ? is a
uniform measure - In 2-dim, G is a plane, ? a Lebesgue measure
9p
p
q
10Examples
- Forestry
- Spatial clustering of trees.
- Want to see for clusters of a specific kind of
trees after adjusting for uneven spatial
distribution of all trees - ?(A)Total of trees in region A
- nA of trees in A of specific kind
11Examples (continued)
- Epidemiology
- Interest in detecting geographical clusters of
disease - Need to adjust for uneven population density
- Rural vs. urban population
- For data aggregated into census districts,
measure is concentrated at the central
coordinates of districts
12Examples (continued)
- If interest is in space-time clusters of a
disease, the measure will still be concentrated
in the geographical region as in the prior
example - Adjusting for uneven population distribution is
not always enough. Should take confounding
factors into account. E.g., in epidemiology
measure can reflect standardized expected
incidence rate
13SS LR statistic
- For a fixed size window, scan statistic is the
maximum of points in the window at any given
time/geographical region - Test Stat is equivalent to LR test statistic for
testing H0?1?2 vs. Ha?1gt?2 - Generalization to LR test is important for
variable window
14Generalized SS Notation/Models
- G Geographical area / study space
- A Window ½ G
- N(A) Random of points in A
- A spatial point process
- Goal to find the prominent cluster
- Two useful models for point process
- (a) Bernoulli model
- (b) Poisson model
15Standard Models for SS
- For Bernoulli model, measure ? is such that ?(A)
is an integer for all subsets A of G - Two states (disease point or no disease) for
each unit - Location of the points define a point process
16(No Transcript)
17LR Test Bernoulli Model
18LR Test Bernoulli Model
19Poisson Model
- Under Poisson model, points generated by inhom.
Poiss. Proc. There is exactly one zone Z ? G s.t.
N(A) ? Po(pµ(A??Z) qµ(A?Zc)) for all A. - Null hypothesis H0pq
- Alternative hypo H1 pgtq, Z ??.
- Under H0, N(A) ? Po(pµ(A)) for all A.
- - the parameter Z disappears under H0
20Poisson Model (continued)
21Poisson Model (continued)
22Poisson Model (continued)
23Choice of Zones
- How is ? selected? Possibilities
- All circular subsets
- All circles centered at any of several foci on a
fixed grid, with a possible upper limit on size - Same as (2) but with a fixed size
- All rectangles of fixed size and shape
- If looking for space-time clusters, use
cylinders scanning circular geographical areas
over variable time intervals
24Bernoulli vs. Posson Model
- Choice between a Bernoulli or Poisson model does
not matter much if - n(G) ltlt ?(G)
-
- In other cases, use the model most appropriate
for application
25A Useful Result
- An important result on most likely cluster
based on these models is given in the paper. It
states that as long as the points within the zone
constituting the most likely cluster are located
where they are, H_0 will be rejected irrespective
of the other points in G. If a cluster is located
in Seattle, locations of the points in the east
coast of U.S. do not matter (Theorem 1)
26Computations and MC
- To find the value of ?, we need to calculate LR
maximized over collection of zones in H1. Seems
like a daunting task since of zones could be
infinite. - of observed points finite
- For a fixed of points, likelihood decreases as
µ(Z) increases
27Computations (contd)
- If the circle size increases for a fixed foci,
need to recalculate likelihood whenever a new
point enters the circle. For a finite points,
of recalcing likelihood for each foci is finite. - Distribution of ? is difficult. MC simulation
used to generate histogram of ? . Under H0,
replicate the data sets conditional on nG .
28Application of SSS to SIDS
- Bernoulli and Poisson models are illustrated
using the SIDS data from NC - For 100 counties in NC, total of live births
and of SIDS cases for 1974-84. - Live births range from 567 to 52345
- Location of county seats are the coordinates.
Measure is the of live births in a county
29Application to SIDS (continued)
- Zones for scanning window are circles centered at
a county coordinate point including at most half
of the total population - Zones are circular only wrt the aggregated data.
As circles around a county seat are drawn, other
counties will either be completely part of a zone
or else not at all, depending on whether its
county seat is within the circle or not
30Bernoulli model for SIDS
- Bernoulli model is very natural. Each birth can
correspond to at most one SID. Table 1 summarizes
the results of the analysis. - From Figure 1, the most likely cluster A,
consists of Bladen, Columbus, Hoke, Robeson, and
Scotland. - Using a conservative test, a secondary cluster is
B, consists of Halifax, Hartford and Northampton
counties.
31Poisson model for SIDS
- For a rare disease SIDS, Poisson model gives a
close approximation to Bernoulli. Results are
reported in Table 1 - Both models detect the same cluster
- P-values for the primary cluster are same for
both the models p-values for the secondary
cluster are very close
32Application to SIDS (continued)
33Two significant clusters based on SSS
34SSS adjusted for Race
- For SIDS one useful covariate is race
- Race is related to SIDS through unobserved
covariates such as quality of housing, access to
health care - Overall incidence of SIDS for white children is
1.512 per 1000 and for black children is 2.970
per 1000.
35SSS race-adjusted (continued)
- Racial distribution differs widely among the
counties in NC - This analysis leads to the same primary cluster
(see Figure 2) - Previous secondary cluster disappeared but a
third secondary cluster C emerges. Cluster C
consists of a bunch of counties in the western
part of the state
36Application to SIDS (continued)
37SSS to SIDS adjusted for race
38A Bayesian alternative to SSS
- Scott and Berger (2006) Idea of Bayesian
multiple testing. - Observe Xj ? N(µj, s2), j1,,M,
- To determine which µj are nonzero ? we have M
- (conditionally) independent tests, each
testing
- H0jµj 0 vs. H1j µj ? 0
- p0 prior probability that µj is zero
- Crucial point here let data estimate p0 .
- SB use the hierarchical model
-
- 1. Xjµj , s2, ?j N(?jµj, s2),
independently - 2. µj t2 I.I.D. N(0, t2 ), ?j p0
I.I.D. Bern (1-p0) - 3. (t2 , s2) p (t2 , s2) (t2 s2)-2, p0
p(p0) - Several choices for p(p0) Uniform,
Beta(a,1) - SB computed posterior probability ?j 1.
39Modification of SB Model
- Assume Xj ? N(µj, s2), j1,,M,
- To determine which µj are positive ? we have M
- (conditionally) independent tests, each
testing
- H0jµj 0 vs. H1j µj gt 0
- As before
- 1. Xjµj , s2, ?j N(?jµj, s2),
independently - 2. µj µ(-j), ?, t2 N(??qjkµk, t2 ),
CAR - ?j pj Ind. Bern (1-pj)
- 3. (t2, s2, ?) p (t2 , s2, ?) (t2 s2)-2
- 4. CAR model on logit(pj)
-
- Compute posterior probability of µj gt0.