A Bayesian Scan Statistic for Spatial Cluster Detection - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

A Bayesian Scan Statistic for Spatial Cluster Detection

Description:

... there any emerging clusters of symptoms that are worthy of further investigation? ... Report all regions with probability Pthresh ' ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 20

Provided by: daniel538

Category:

more less

Transcript and Presenter's Notes

Title: A Bayesian Scan Statistic for Spatial Cluster Detection

1
A Bayesian Scan Statistic for Spatial Cluster
Detection Daniel B. Neill1 Andrew W.
Moore1 Gregory F. Cooper2 1Carnegie Mellon
University, School of Computer Science 2University
of Pittsburgh, Center for Biomedical
Informatics neill, awm_at_cs.cmu.edu,
gfc_at_cbmi.pitt.edu
2
Prospective disease surveillance

Nationwide disease surveillance data, aggregated
by zip code.
Daily counts of OTC drug sales in 18 product
categories from 20,000 retail stores (NRDM).
Daily counts of Emergency Department visits,
grouped by syndrome.
Each day we want to answer the questions whats
happening, and where?
Are there any emerging clusters of symptoms that
are worthy of further investigation?
Where are they, how large, and how serious?

National Retail Data Monitor rods.health.pitt.ed
u
Goal automatically detect emerging disease
outbreaks, as quickly as possible, while keeping
number of false positives low.
3
Spatial cluster detection
Given count ci and baseline bi for each zip code
si.
(e.g. number of Emergency Dept. visits, or
over-the-counter drug sales, of a specific type)
At-risk population or expected count, inferred
from historical data
Our typical assumption counts are aggregated to
a uniform grid search over set of rectangles on
grid.
S
Does any spatial region S have sufficiently high
counts ci to be indicative of an emerging disease
epidemic in that area?
4
The spatial scan statistic

The spatial scan statistic (Kulldorff, 1997) is a
powerful method for spatial cluster detection.
Search over a given set of spatial regions.
Find those regions which are most likely to be
clusters.
Correctly adjust for multiple hypothesis testing.
Problems with the spatial scan
Difficult to incorporate prior knowledge.
Size and shape of outbreak? Impact on disease
rate?
Computing statistical significance (p-values) by
randomization requires searching a huge number of
replica datasets.

Computationally infeasible for massive datasets!
5
The spatial scan statistic

The spatial scan statistic (Kulldorff, 1997) is a
powerful method for spatial cluster detection.
Search over a given set of spatial regions.
Find those regions which are most likely to be
clusters.
Correctly adjust for multiple hypothesis testing.
Problems with the spatial scan
Difficult to incorporate prior knowledge.
Size and shape of outbreak? Impact on disease
rate?
Computing statistical significance (p-values) by
randomization requires searching a huge number of
replica datasets and comparing results to
original.

Here we propose a Bayesian spatial scan
statistic, which allows us to incorporate prior
knowledge, and (since randomization testing is
unnecessary) is much more efficient to compute.
Computationally infeasible for massive datasets!
6
The generalized spatial scan

Obtain data for a set of spatial locations si.
Choose a set of spatial regions S to search.
Choose models of the data under null hypothesis
H0 (no clusters) and alternative hypotheses H1(S)
(cluster in region S).
Derive a score function F(S) based on H1(S) and
H0.
Find the most anomalous regions (i.e. those
regions S with highest F(S)).
Determine whether each of these potential
clusters is actually an anomalous cluster.

7
Population-based model (Kulldorff, 1997)

Each count ci (number of cases in location si) is
generated from a Poisson distribution with mean
qibi.
bi represents the at-risk population, often
estimated from census data.
qi represents the disease rate.
Is there any region with disease rates
significantly higher inside than outside?

qout .01
qin .02
8
The frequentist model
ci Poisson(qibi)
Null hypothesis H0 (no clusters)
Alternative hypothesis H1(S) (cluster in region S)
qi qin inside region S, qi qout elsewhere
qi qall everywhere
Use maximum likelihood estimate of qall.
Use maximum likelihood estimates of qin and qout,
subject to qin gt qout.
9
The Bayesian hierarchical model
ci Poisson(qibi)
Null hypothesis H0 (no clusters)
Alternative hypothesis H1(S) (cluster in region S)
qi qin inside region S, qi qout elsewhere
qi qall everywhere
qin Gamma(ain(S), bin(S)) qout Gamma(aout(S),
bout(S))
qall Gamma(aall, ball)
Gamma(a,b) m a / b s2 a / b2
Top two levels of hierarchy are same as
frequentist model.
10
Frequentist approach
Bayesian approach
Use likelihood ratio
Use posterior probability
Use maximum likelihood parameter estimates
Use marginal likelihood (integrate over possible
values of parameters)
Calculate statistical significance by
randomization testing Compute maximum F(S) for
each of R1000 replica grids generated under
H0. p-value (Rbeat1) / (R1), where Rbeat
of replicas with max score gt original region
No randomization testing necessary Instead,
normalize posterior probabilities by computing
and dividing by the total data likelihood, P(Data)
P(Data H0) P(H0) ?S P(Data
H1(S)) P(H1(S)) This gives probability of an
outbreak in each region sum these to get total
probability of an outbreak.
11
Computing Bayesian likelihoods

Marginal likelihood approach integrate over
possible values of disease rate parameters (qin,
qout, qall), weighted by prior probability.
Conjugate prior allows closed form solution.
Gamma prior, Poisson counts ? negative binomial

where C ? ci, B ? bi
12
Obtaining priors

Choose prior outbreak probability P1, assume
uniform region prior.
Choose parameter priors (a and b) by matching a/b
and a/b2 to mean and variance of observed
disease rate q C/B.
Assume outbreak increases rate by a
multiplicative factor m.
Use (discretized) uniform distribution for m.

P(H0) 1 - P1 P(H1(S)) P1 / Nreg
m Uniform1, 3
13
Computing the Bayesian statistic
ain(S), bin(S), aout(S), bout(S)
Cin(S), Bin(S), Cout(S), Bout(S)
aall, ball
Call, Ball
P1
Pr(Data H0)
Pr(Data H1(S))
Pr(H1(S))
Pr(H0)
(x)
(x)
score(H0)
score(S)
()
do for all regions S
Pr(Data)
()
()
Pr(H0 Data)
Pr(H1(S) Data)
Report all regions with probability gt Pthresh
Sound the alarm if total probability of
outbreak gt Palarm
OR
14
Testing detection power

We use a semi-synthetic testing framework, in
which we inject simulated respiratory outbreaks
into real baseline ED and OTC data (assumed to
have no outbreaks)

1 year of ED data from Allegheny County, 2002
BARD-simulated anthrax cases
Simulated outbreak baseline injected
Fictional Linear Onset Outbreak (FLOO)
1 year of OTC data from Allegheny County,
2004-2005
15
Testing methodology

Baseline data without injected cases compute the
score F for each day.
For each simulated outbreak
Inject outbreak into baseline data.
Compute score F for each day of outbreak.
For each day of outbreak (t1..Toutbreak),
compute fraction of baseline days with scores
higher than maximum score of outbreak days 1
through t. This is the proportion of false
positives we would have to accept to detect that
outbreak by day t.
Average over multiple outbreaks to obtain an AMOC
curve (avg. days to detect outbreak vs. false
positive rate)
Also compute proportion of outbreaks detected and
days to detect at various false positive rates
(e.g. 1/month).

Frequentist F maxS F(S) Bayesian F ?S
Pr(H1(S) Data)
16
Why Bayesian? (part 1)

Better detection power!

Average days to detect (1 false positive/month)
Frequentist
Bayesian
Outbreak type
Bayesian better for 6 of 7 outbreaks, by an
average of 0.22 days.
Should do even better with an informative prior!
17
Why Bayesian? (part 2)

Its fast! Because no randomization testing is
necessary, we can search approximately 1000x
faster than the (naïve) frequentist approach.
For small to moderate grid sizes, this is even
faster than the fast spatial scan (Neill and
Moore, 2004)
For larger grid sizes, the fast spatial scan
wins.

For 128 x 128 grid 44 minutes vs. 31 days
For 128 x 128 grid 44 minutes vs. 77 minutes
For 256 x 256 grid 10 hours vs. 12 hours
18
Why Bayesian? (part 3)

Easier to interpret results (posterior
probability of an outbreak, distribution over
possible regions).
Easier to visualize
Easier to calibrate (by setting prior probability
P1).
Easier to combine evidence from multiple
detectors, by modeling the joint distribution.

Cell color is based on posterior probability of
outbreak in that cell, ranging from white (0)
to black (100). Red rectangle represents most
likely region.
Total posterior probability of outbreak
86.61 Maximum region outbreak probability
12.27 Maximum cell outbreak probability 86.57
19
Making the spatial scan fast
256 x 256 grid 1 billion rectangular regions!
Naïve frequentist scan
1000 replicas x 12 hrs / replica 500 days!
(Neill and Moore, KDD 2004)
(Neill, Moore, and Cooper, NIPS 2005)
Fast frequentist scan
Naïve Bayesian scan
12 hrs (to search original grid)
1000 replicas x 36 sec / replica 10 hrs
Fast Bayesian scan??

Write a Comment

User Comments (0)