Title: A Nonparametric Scan Statistic for Multivariate Disease Surveillance
1A Nonparametric Scan Statistic for Multivariate
Disease Surveillance Daniel B. Neill Carnegie
Mellon University Heinz School of Public
Policy E-mail neill_at_cs.cmu.edu
Joint work with Jeff Lingwall (Heinz School,
CMU). Partially supported by NSF grant
IIS-0325581 and CDC grant 1 R01 PH000028-01.
2Multivariate disease surveillance
d1 respiratory ED
d2 constitutional ED
d3 OTC cough/cold
Time series of counts ci,mt for each zip code si
for each data stream dm.
d4 OTC anti-fever
etc.
Daily data feeds from over 20,000 hospitals and
pharmacies nationwide.
Given all of this nationwide health data on a
daily basis, we want to obtain a complete
situational awareness by integrating information
from the multiple data streams.
More precisely, we have three main goals to
detect any emerging outbreaks of disease,
characterize the type of outbreak, and
pinpoint the affected areas.
3Expectation-based scan statistics
(Kulldorff, 1997 Neill and Moore, 2005)
To detect and localize outbreaks, we can search
for spatial regions where the counts are
significantly higher than expected.
Imagine moving a space-time window around the
scan area, allowing the window size, shape, and
duration to vary.
4Expectation-based scan statistics
(Kulldorff, 1997 Neill and Moore, 2005)
To detect and localize outbreaks, we can search
for spatial regions where the counts are
significantly higher than expected.
Imagine moving a space-time window around the
scan area, allowing the window size, shape, and
duration to vary.
5Expectation-based scan statistics
(Kulldorff, 1997 Neill and Moore, 2005)
To detect and localize outbreaks, we can search
for spatial regions where the counts are
significantly higher than expected.
Imagine moving a space-time window around the
scan area, allowing the window size, shape, and
duration to vary.
6Expectation-based scan statistics
(Kulldorff, 1997 Neill and Moore, 2005)
To detect and localize outbreaks, we can search
for spatial regions where the counts are
significantly higher than expected.
Imagine moving a space-time window around the
scan area, allowing the window size, shape, and
duration to vary.
Historical counts
Current counts (3 day duration)
For each of these regions, we compare the current
counts for each location to the time series of
historical counts for that location.
7Parametric scan statistics
For the standard scan statistic approach, we
assume that each count is drawn from a Poisson
distribution with unknown mean.
We perform time series analysis to find the
expected counts for each recent day, then compare
actual to expected counts.
Historical counts
Current counts (3 day duration)
For each of these regions, we compare the current
counts for each location to the time series of
historical counts for that location.
Expected counts
8Parametric scan statistics
For the standard scan statistic approach, we
assume that each count is drawn from a Poisson
distribution with unknown mean.
Similarly, we can compute a Gaussian scan
statistic by obtaining the expectations and
variances from historical data.
Historical counts
Current counts (3 day duration)
For each of these regions, we compare the current
counts for each location to the time series of
historical counts for that location.
Expected counts
9Parametric scan statistics
In either case, we find the regions with highest
values of a likelihood ratio statistic, and
compute the statistical significance of each
region by randomization testing.
2nd highest score 8.4
Not significant (p .098)
Maximum region score 9.8
Alternative hypothesis outbreak in region S
Significant! (p .013)
Null hypothesis no outbreak
Parametric scan statistic approaches assume some
parametric model for the distribution of counts,
and learn the parameters from historical data.
Performance degrades when models are incorrect.
How to combine multiple data streams sensibly?
10The nonparametric scan statistic
Rather than assuming a parametric distribution
and learning the mean and variance parameters
from past counts, NPSS compares the current
counts to the entire empirical distribution of
historical counts.
Simple assumption under H0, all counts for a
given location and data stream are drawn
independently from the same distribution.
In this case, the proportion of historical counts
that are greater than current count ci,mt will be
asymptotically uniformly distributed on 0,1.
Compute the empirical p-value Pi,mt corresponding
to each current count ci,mt
Historical counts
Current counts (3 day duration)
Pi,mt (Tbeat 1) / (T 1)
of historical counts gt ci,mt
Total of historical counts
11The nonparametric scan statistic
Rather than assuming a parametric distribution
and learning the mean and variance parameters
from past counts, NPSS compares the current
counts to the entire empirical distribution of
historical counts.
Simple assumption under H0, all counts for a
given location and data stream are drawn
independently from the same distribution.
In this case, the proportion of historical counts
that are greater than current count ci,mt will be
asymptotically uniformly distributed on 0,1.
Compute the empirical p-value Pi,mt corresponding
to each current count ci,mt
Under H0, Pi,mt U0,1
Under H1(S), the counts in region S will be
higher than expected under H0, and thus the
empirical p-values will be lower than expected.
Pi,mt (Tbeat 1) / (T 1)
of historical counts gt ci,mt
Total of historical counts
12The nonparametric scan statistic
We search for regions (D, S, W) with a
surprisingly large number of low empirical
p-values.
D subset of data streams S set of spatial
locations W number of days
Total number of empirical p-values in region N
D x S x W
How many low empirical p-values (Pi,mt lt a) do we
expect under H0?
Let Na Pi,mt lt a). Then Na Binomial(N,
a), with mean Na and variance Na(1 a).
Following Donoho and Jin (2004), we define the
higher criticism statistic F(D, S, W) maxa (Na
Na) / vNa(1 a).
We find the multivariate space-time regions (D,
S, W) with highest scores F(D, S, W), and compute
statistical significance by randomization testing.
13The nonparametric scan statistic
Advantages of the nonparametric scan statistic
(NPSS) No parametric model assumptions. Can
easily combine information from multiple streams
and identify which subset of streams
are affected. Randomization testing is easy (draw
each Pi,mt U0,1).
However, NPSS assumes that all of the counts for
a given time series are drawn from the same
(unknown) distribution, which will not be true if
the time series is nonstationary.
Solution use the standardized residuals ri,mt
(ci,mt bi,mt) / vbi,mt, where the expected
counts bi,mt are inferred by time series analysis.
This method also assumes that counts are drawn
independently but we can extend it to deal with
auto- and cross-correlations.
14Evaluation framework
We compared the nonparametric scan to the
standard expectation-based Poisson statistic
(EBP) on a variety of spatial disease
surveillance tasks.
We used five different datasets four datasets
from Allegheny County, PA (respiratory ED visits,
OTC cough/cold sales, OTC antifever, OTC
thermometers) and simulated biological sensor
data.
Binary sensors triggered with Pr 0.5 if
outbreak, Pr 0.2 if no outbreak.
Experiment 1 Univariate detection power for
simulated outbreaks of varying size injected into
each dataset.
Experiment 2 Multivariate detection power for
simulated outbreaks of varying size injected
simultaneously into the three OTC datasets.
We computed each methods detection rate and days
to detect at 1 false positive per month.
14
15Results univariate
At a fixed false positive rate of 1/month, the
nonparametric scan performed comparably to the
parametric method for the four traditional data
sources (0.5 days faster for ED, 0.3 days slower
for OTC), but also detected 0.7 days faster for
the biological sensor data.
15
16Results multivariate
The multivariate NPSS method showed further gains
in detection power for outbreaks that
simultaneously affected the three streams of OTC
data, detecting in 3.71 days at 1 fp/month (as
compared to 4.12, 4.81, and 5.34 days for the
three univariate NPSS detectors).
16
17More (preliminary) results
Comparing the nonparametric scan statistic to the
parametric multivariate scan statistic (Kulldorff
et al., 2007), we see comparable detection power
on OTC datasets.
However, the nonparametric scan statistic was
better able to identify which subset of streams
were affected. It also demonstrated better
calibration on the OTC data (5 fpr at a .05,
as compared to 20-40). On the other hand, NPSS
tended to underestimate the outbreak size and
detect only the most affected locations.
Further comparisons are needed, on a wider range
of multivariate datasets, to confirm and quantify
these results.
17
18Some open questions
What is the best way to compute the empirical
p-values?
Dealing with tied counts compute EF using
upper and lower p-values.
Dealing with small N dont use normal
approximation to binomial.
Dealing with high counts use kernel density
estimation.
What is the best way to combine the empirical
p-values?
Many other nonparametric statistics can be
defined, including the Kolmogorov-Smirnov,
Cramer-von Mises, Anderson-Darling, and
Berk-Jones statistics.
Berk-Jones has particular appeal since it can be
interpreted as a log-likelihood ratio under H1,
p-values are uniformly generated on 0, k with
some probability gt k, and uniformly generated on
k, 1 otherwise. This may make a Bayesian scan
statistic possible.
19Conclusions
The nonparametric scan statistic has high
univariate detection power and improved
calibration, especially for datasets that do not
fit the standard parametric assumptions.
By combining multiple data streams in a
principled manner, the NPSS method achieves high
multivariate detection power, and also enables
outbreak characterization by accurately
identifying the affected streams.
Unlike our recently proposed multivariate
Bayesian scan statistic (MBSS), NPSS cannot model
and differentiate between multiple potential
causes of an outbreak.
Instead, it is a general multivariate anomaly
detector that complements MBSS by detecting and
characterizing relevant patterns, without relying
on parametric model assumptions.