Statistical Change Detection for MultiDimensional Data - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Statistical Change Detection for MultiDimensional Data

Description:

Culturing the specimen of E. coli bacteria and then testing its ... Samples from KDE with scott's plug-in bandwidth. Samples from KDE with our EM bandwidth ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 22

Provided by: yao8

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Change Detection for MultiDimensional Data

1
Statistical Change Detection for
Multi-Dimensional Data
Presented by Xiuyao Song Authors Xiuyao Song,
Mingxi Wu, Chris Jermaine, Sanjay Ranka
2
Motivation example antibiotic
resistance pattern

Culturing the specimen of E. coli bacteria and
then testing its resistant rate to multiple
antibiotic drugs.
Typical data could be like (R resistant S
susceptible, U undetermined)
drug1 drug2 drug3 drug4
R S U
R
We have a baseline data set and a recently
observed data set.
Question Does E.Coli show different resistance
pattern recently?
If a change is detected, it might be caused by
the presence of new E. Coli strains. We will
raise an alarm for further investigations.

3
Problem definition
Multi-dimensional space
data set S
data set S
baseline data
recently observed data
Question FS FS ?
4
Related work

For uni-dimensional data, many existed tests,
such as K-S test, chi-square test
Only two tests to detect a generic distributional
change in multi-dimensional space.
Kdq-tree by Dasu et al relies discretization
scheme, suffer from curse of dimensionality.
Cross-match by Rosenbaum computationally
expensive due to maximum matching algorithm

5
hypothesis test framework
data set S
data set S
null hypothesis H0 FS FS
Null distribution ?
6
Density test high-level overview
data set S
data set S
Step 1 Gaussian kernel density estimate of S1.
Step 3 derive the null distribution
Kernel Density Estimate
null distribution
Step 4 calculate the critical value and make a
decision.
KS1
7
Step 1 Kernel Density Estimate (KDE)
--bandwidth selection

Plug-in bandwidth asymptotically efficient, but
not accurate.
Data-driven bandwidth converge better to the
true distribution.

bandwidth
correctness of density test can always be
guaranteed. power of test is increased when
estimate is accurate.
8
Choose bandwidth by MLE/EM (maximum likelihood
estimation / Expectation Maximization)
kernel
adding constraint
9
Effectiveness of EM bandwidth
Samples from the real distribution
Samples from KDE with scotts plug-in bandwidth
Samples from KDE with our EM bandwidth
10
Step 2 define and calculate
data set S
data set S
Kernel Density Estimate
KS1
11
Step 3 derive the null distribution
? normal By Central Limit Theorem
?1normal
?2 normal
Need to be estimated
Tk be r.v. with distribution FS
12
Estimating
13
Step 4 calculate critical value and
make a decision
estimated null distribution ?
14
Density test all 4 steps
data set S
data set S
Step 1 Gaussian kernel density estimate of S1.
Step 3 derive the null distribution
Kernel Density Estimate
null distribution
Step 4 calculate the critical value and make a
decision.
KS1
15
Run density test in 2 directions
the test is not symmetric, 2-way test may
increase the power. E.g.
FS
FS
S
S
16
False positive
Data consists of low-D group and high-D
group. User-given p value is 8
17
false negative on low-D group
false neg. ()
type of changes
18
false negative on high-D group
false neg. ()
type of changes
19
Scalability
density test has amortizable time cost (one-time
cost 84)
20
Conclusion