Title: Detecting Projected Outliers in Highdimensional Data Streams
1Detecting Projected Outliers in High-dimensional
Data Streams
Ji Zhang, Dr. Department of Mathematics and
Computing University of Southern
Queensland Ji.Zhang_at_usq.edu.au 15 Oct 2009
1
2About me
- Lecturer, University of Southern Queensland,
Australia - Postdoc, CSIRO ICT Centre, Hobart, Australia
- Ph.D., Dalhousie University, Canada
- M.Sc., National University of Singapore,
Singapore - B. E., Southeast University, Nanjing, China
2
3Research Interests
- Data Mining and Knowledge Discovery
- Outlier detection
- Clustering
- Very large databases
- Data Stream management
- Web data management
- Data Quality
- Data Privacy and Security
- Privacy preserving data management
- Intrusion detection
- Bioinformatics
- Gene expression data
- Computational Intelligence
- Genetic Algorithm
- Machine learning
3
4Research Interests
- Data Mining and Knowledge Discovery
- Outlier detection
- Clustering
- Very large databases
- Data Stream management
- Web data management
- Data Quality
- Data Privacy and Security
- Privacy preserving data management
- Intrusion detection
- Bioinformatics
- Gene expression data
- Computational Intelligence
- Genetic Algorithm
- Machine learning
4
5Roadmap
- Introduction
- Data Synopsis
- Stream Projected Outlier Detector (SPOT)
- Multi-objective Genetic Algorithm
- Experimental Results
- Conclusions
5
6Roadmap
- Introduction
- Data Synopsis
- Stream Projected Outlier Detector (SPOT)
- Multi-objective Genetic Algorithm
- Experimental Results
- Conclusions
6
7Introduction
- The explosion of data streams has sparked a lot
of research in recent years. - Outlier detection from these data streams can
potentially lead to discovery of useful abnormal
and irregular patterns hidden in the streams. - Outlier detection in data streams can be useful
in many fields such as analysis and monitoring of
- financial transactions,
- sensor networks and
- network traffic.
7
8Introduction (cont.)
- Applications
- Outlier detection in wireless sensor network
- Chemical sensors deployed in the
environment to monitor - toxic spills and nuclear incidents
gather the chemical data periodically. Outlier
detection can trigger alarms and locate the
source when abnormal data are generated. - Computer/network security
- Find abnormal network traffic data that may
indicate - intrusions/attacks.
- Credit card fraud detection
- Find those credit card transactions that are
abnormal in - transaction time, transaction
place and amounts.
8
9Introduction (cont.)
- Characteristics of data in high-dimensional space
- Data tend to be equi-distant with each other (due
to curse of dimensionality) - Data density/sparsity can only be observed in
relatively lower dimensional subspaces - Outliers are only embedded in low-dimensional
subspaces
9
10Introduction (cont.)
- Challenges
- The nature of streaming data applications
- Take only one pass over the data stream.
- Process data on an incremental and real-time
paradigm. - Space limitation and time-criticality.
Data Stream Mining Algorithm
Single data scan
10
11Introduction (cont.)
- Challenges
- High-dimensionality further complicates the
problem. - Detection requires subspace exploration/search
mechanism. - The exhaustive search for the outlying subspaces
is a NP problem.
11
12Introduction (cont.)
Traditional outlier detection methods Low
dimensional static data
High-dimensional outlier detection methods
static data only
Data stream outlier detection methods Full data
space only
?
?
A new outlier detection method for
high-dimensional data stream High-dimensional
data stream
12
13Introduction (cont.)
- Problem formulation
- Projected outlier detection method performs a
mapping as - f pi ?
(b, Si, Scorei)
bi
Scorei
Si
True/false
Scorei1
Si1
Scorei2
Si2
...
...
Scorei n
Si n
13
14Roadmap
- Introduction
- Data Synopsis
- Stream Projected Outlier Detector (SPOT)
- Multi-objective Genetic Algorithm
- Experimental Results
- Conclusions
14
15Data Synopsis
- Data synopsis is used to capture data
characteristics that can be used for outlier
detection. - Equi-width partition of domain space
- The domain space is partitioned into a set of
- non-overlapping cells with equal side
length in each - dimension.
- Only equi-width partition is applicable to data
- stream application.
15
16Data Synopsis (cont.)
- Projected Cell Summary (PCS)
- The Projected Cell Summary of a cell c in a
subspace s is a triplet of scalars defined as - PCS(c, s)
RD(c,s), IRSD(c,s), IkRD(c,s) -
- where
- RD Relative Density
- IRSD Inverse Relative Standard Deviation
- IkRD Inverse k-Relative Distance
- Data synopsis will be constructed for projected
cells in subspaces for detecting outliers - PCS of projected cells in subspaces can be
computed and updated efficiently (in an online
manner)
16
17Data Synopsis (cont.)
- Outlying cell
- A projected cell whose PCS component (RD, IRSD or
IkRD) is lower than human-specified threshold. - Outlying subspace
- An outlying subspace s of p is a subspace
- that contains the outlying cell of
p. - Projected outlier
- A data point p is considered as a projected
outlier if there exists at least one outlying
subspace s of p.
p
X1
X2
17
18Roadmap
- Introduction
- Data Synopsis
- Stream Projected Outlier Detector (SPOT)
- Multi-objective Genetic Algorithm
- Experimental Results
- Conclusions
18
19Our technique SPOT
- SPOT Stream Projected Outlier Detector (SPOT)
- Technique for detecting outliers embedded in
lower dimensional subspaces
19
20SPOT (cont.)
- Major advantages
- SPOT provides a support of projected outlier
detection and analysis facilities to low, medium
and high-dimensional data streams - SPOT supports both supervised and unsupervised
outlier detection - SPOT allows for multi-criteria outlier-ness
measurement and employs Genetic Algorithm for
effective subspace search - SPOT is equipped with mechanism to handle false
positives when labeled data are available.
20
21SPOT (cont.)
- Two Stages
- Training stage
- Detection stage
SPOT
Training stage
Detection stage
Construct SST
Detecting outliers from the data stream
21
22Training Stage
- Problem we are facing
- The number of subspaces grows exponentially with
regard to dimensions of the data stream - Evaluating each data point in each possible
subspace is prohibitively expensive. - We evaluate outlier-ness of each point in a few
subspaces in the space lattice alternatively.
These subspaces are called Sparse Subspace
Template (SST).
SST
Supervised SST Subspaces (SS)
Fixed SST Subspaces (FS)
Unsupervised SST Subspaces (US)
22
23Training Stage (cont.)
- Fixed SST Subspaces (FS)
- FS contains all the subspaces in the full lattice
whose maximum dimension is MaxDimention, where
MaxDimention is a user-specified parameter. - FS contains all the subspaces with dimensions of
1, 2, , MaxDimention. Maxidmension is
typically small, e.g., 2 or 3. - FS establishes the bottom-line of the detection
performance.
Create space lattice
FS
Specify MaxDimension
23
24Training Stage (cont.)
- Unsupervised SST Subspaces (US)
- US consists of the outlying subspaces of the top
training data that have the highest overall
outlying degree. - Rationale
- The selected training data are more likely to be
considered as outliers. - Can be potentially used to detect more subsequent
outliers in the stream. - Allow for unsupervised learning.
Clustering
MOGA
Top training data
Outlying subspaces
US
Unlabeled training data
24
25Training Stage (cont.)
- Supervised SST Subspaces (SS)
- A few outlier examples may be provided by domain
experts or previous detection process. - SS is the set of outlying subspaces of these
outlier examples. - Rationale
- Based on SS, example-based outlier detection can
be performed that detects more outliers that are
similar to these outlier examples. - Allow for supervised learning.
MOGA
SS
Outlier examples
Outlying subspaces
25
26Detection Stage
- The detection stage performs outlier detection
for incoming stream data. - Two sub-steps
- Update Step
- Data synopsis of the projected cells in each
subspace of SST to which the incoming point
belongs are updated - Detection Step
- The outlier-ness of the data is checked in each
of subspaces in SST to decide whether or not it
is a projected outlier.
26
27System Architecture of SPOT
27
28Roadmap
- Introduction
- Data Synopsis
- Stream Projected Outlier Detector (SPOT)
- Multi-objective Genetic Algorithm
- Experimental Results
- Conclusions
28
29Multi-objective Genetic Algorithm
- Genetic Algorithm (GA) is used to search for
subspaces where top training data/outlier
examples are more outlying. - It is used when constructing US and SS.
- Multi-objective Genetic Algorithm (MOGA) is used
for subspace search for optimizing multiple (i.e.
three) criteria in SPOT.
29
30Multi-objective Genetic Algorithm
30
31Roadmap
- Introduction
- Data Synopsis
- Stream Projected Outlier Detector (SPOT)
- Multi-objective Genetic Algorithm
- Experimental Results
- Conclusions
31
32Experimental Results
- Synthetic data sets
- Generated by two high-dimensional data
generators. - SD1 produce data sets that generally exhibit
remarkably different data characteristics in
projections of different subsets of features in
terms of the number, location, size and
distribution of the data generated.
32
33Experimental Results (cont.)
- Synthetic data sets
- SD2 specially designed for comparative study of
SPOT and the existing method. An important
characteristic of SD2 is that the projected
outliers appear perfectly normal in all
1-dimensional subspaces.
33
34Experimental Results (cont.)
- Real data sets
- RD1 Letter Image (17 dimensions, UCI machine
learning repository) - RD2 Musk (168 dimensions, UCI machine learning
repository) - RD3 MIT wireless network data set (15
dimensions) - RD4 KDD-CUP99 Network Intrusion Detection
stream data set ( 42 dimensions)
34
35Experimental Results (cont.)
Scalability of training w.r.t stream length
Scalability of training w.r.t stream
dimensionality
Scalability of detection w.r.t stream length
Scalability of detection w.r.t stream
dimensionality
35
36Experimental Results (cont.)
- Competitive methods
- Histogram
- Non-parametric statistical method
- Kernel Function
- Non-parametric statistical method
- Incremental LOF
- Incremental variant of LOF
- HPStream
- Projected clustering method for high-dimensional
data streams - Largest_Cluster
- Clustering method (normal data only reside in the
largest cluster)
36
37Experimental Results (cont.)
- Experimental results
- Purple DRgt85 and FPRlt10
- Red DRgt90 and FPRlt1
37
38Experimental Results (cont.)
- Experimental results
- Compared with other competitive methods, SPOT is
advantageous that - It is equipped with subspace exploration
capability, which contributes to a good detection
rate, and - It uses multiple criteria enables SPOT to deliver
much more accurate detection which helps SPOT to
reduce its false positive rate.
38
39Roadmap
- Introduction
- Data Synopsis
- Stream Projected Outlier Detector (SPOT)
- Multi-objective Genetic Algorithm
- Experimental Results
- Conclusions
39
40Conclusions
- We approach the problem of projected outlier
detection for high-dimensional data streams (new
and challenging!) - SPOT utilizes compact data synopsis PCS to
capture necessary data statistical information
for outlier detection. - SPOT detects outliers from SST, a well-designed
subspace template. - SPOT adopts a flexible framework for using
multiple measures for outlier detection. MOGA as
an effective search method to find subspaces that
are able to optimize these outlier-ness criteria.
- Experimental results demonstrate the good
performance of SPOT.
40
41Conclusions (cont.)
- Limitations of SPOT
- Generally, SPOT is slower than traditional
outlier detection methods (SPOT explores hundreds
or thousands of subspaces) - A large SST will impose a stronger pressure for
SPOT to result in a high false positive rate.
Become salient when dealing with unlabeled data. - Tradeoff between detection rate and false
positive rate - Equal number of intervals for each dimension, may
not be the optimal partition.
41
42 Thank you for your time!
42