Title: Statistical Change Detection for MultiDimensional Data
1Statistical Change Detection for
Multi-Dimensional Data
Presented by Xiuyao Song Authors Xiuyao Song,
Mingxi Wu, Chris Jermaine, Sanjay Ranka
2Motivation example antibiotic
resistance pattern
- Culturing the specimen of E. coli bacteria and
then testing its resistant rate to multiple
antibiotic drugs. - Typical data could be like (R resistant S
susceptible, U undetermined) - drug1 drug2 drug3 drug4
- R S U
R - We have a baseline data set and a recently
observed data set. - Question Does E.Coli show different resistance
pattern recently? - If a change is detected, it might be caused by
the presence of new E. Coli strains. We will
raise an alarm for further investigations.
3Problem definition
Multi-dimensional space
data set S
data set S
baseline data
recently observed data
Question FS FS ?
4Related work
- For uni-dimensional data, many existed tests,
such as K-S test, chi-square test - Only two tests to detect a generic distributional
change in multi-dimensional space. - Kdq-tree by Dasu et al relies discretization
scheme, suffer from curse of dimensionality. - Cross-match by Rosenbaum computationally
expensive due to maximum matching algorithm
5hypothesis test framework
data set S
data set S
null hypothesis H0 FS FS
Null distribution ?
6Density test high-level overview
data set S
data set S
Step 1 Gaussian kernel density estimate of S1.
Step 3 derive the null distribution
Kernel Density Estimate
null distribution
Step 4 calculate the critical value and make a
decision.
KS1
7Step 1 Kernel Density Estimate (KDE)
--bandwidth selection
- Plug-in bandwidth asymptotically efficient, but
not accurate. - Data-driven bandwidth converge better to the
true distribution.
bandwidth
correctness of density test can always be
guaranteed. power of test is increased when
estimate is accurate.
8Choose bandwidth by MLE/EM (maximum likelihood
estimation / Expectation Maximization)
kernel
adding constraint
9Effectiveness of EM bandwidth
Samples from the real distribution
Samples from KDE with scotts plug-in bandwidth
Samples from KDE with our EM bandwidth
10Step 2 define and calculate
data set S
data set S
Kernel Density Estimate
KS1
11Step 3 derive the null distribution
? normal By Central Limit Theorem
?1normal
?2 normal
Need to be estimated
Tk be r.v. with distribution FS
12Estimating
13Step 4 calculate critical value and
make a decision
estimated null distribution ?
14Density test all 4 steps
data set S
data set S
Step 1 Gaussian kernel density estimate of S1.
Step 3 derive the null distribution
Kernel Density Estimate
null distribution
Step 4 calculate the critical value and make a
decision.
KS1
15Run density test in 2 directions
the test is not symmetric, 2-way test may
increase the power. E.g.
FS
FS
S
S
16False positive
Data consists of low-D group and high-D
group. User-given p value is 8
17false negative on low-D group
false neg. ()
type of changes
18false negative on high-D group
false neg. ()
type of changes
19Scalability
density test has amortizable time cost (one-time
cost 84)
20Conclusion
- Our density test
- can correctly bound the type I error
- is most powerful on all 5 changes
- can easily scale to large data sets and has an
amortizable time cost
Poster session ( 15)
21Thanks for your attention!