Detecting Projected Outliers in Highdimensional Data Streams - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Detecting Projected Outliers in Highdimensional Data Streams

Description:

Lecturer, University of Southern Queensland, Australia. Postdoc, CSIRO ICT Centre, Hobart, Australia. Ph.D., Dalhousie University, Canada. M.Sc. ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 43

Provided by: sciUs

Category:

more less

Transcript and Presenter's Notes

Title: Detecting Projected Outliers in Highdimensional Data Streams

1
Detecting Projected Outliers in High-dimensional
Data Streams
Ji Zhang, Dr. Department of Mathematics and
Computing University of Southern
Queensland Ji.Zhang_at_usq.edu.au 15 Oct 2009
1
2
About me

Lecturer, University of Southern Queensland,
Australia
Postdoc, CSIRO ICT Centre, Hobart, Australia
Ph.D., Dalhousie University, Canada
M.Sc., National University of Singapore,
Singapore
B. E., Southeast University, Nanjing, China

2
3
Research Interests

Data Mining and Knowledge Discovery
Outlier detection
Clustering
Very large databases
Data Stream management
Web data management
Data Quality
Data Privacy and Security
Privacy preserving data management
Intrusion detection
Bioinformatics
Gene expression data
Computational Intelligence
Genetic Algorithm
Machine learning

3
4
Research Interests

Data Mining and Knowledge Discovery
Outlier detection
Clustering
Very large databases
Data Stream management
Web data management
Data Quality
Data Privacy and Security
Privacy preserving data management
Intrusion detection
Bioinformatics
Gene expression data
Computational Intelligence
Genetic Algorithm
Machine learning

4
5
Roadmap

Introduction
Data Synopsis
Stream Projected Outlier Detector (SPOT)
Multi-objective Genetic Algorithm
Experimental Results
Conclusions

5
6
Roadmap

Introduction
Data Synopsis
Stream Projected Outlier Detector (SPOT)
Multi-objective Genetic Algorithm
Experimental Results
Conclusions

6
7
Introduction

The explosion of data streams has sparked a lot
of research in recent years.
Outlier detection from these data streams can
potentially lead to discovery of useful abnormal
and irregular patterns hidden in the streams.
Outlier detection in data streams can be useful
in many fields such as analysis and monitoring of
financial transactions,
sensor networks and
network traffic.

7
8
Introduction (cont.)

Applications
Outlier detection in wireless sensor network
Chemical sensors deployed in the
environment to monitor
toxic spills and nuclear incidents
gather the chemical data periodically. Outlier
detection can trigger alarms and locate the
source when abnormal data are generated.
Computer/network security
Find abnormal network traffic data that may
indicate
intrusions/attacks.
Credit card fraud detection
Find those credit card transactions that are
abnormal in
transaction time, transaction
place and amounts.

8
9
Introduction (cont.)

Characteristics of data in high-dimensional space
Data tend to be equi-distant with each other (due
to curse of dimensionality)
Data density/sparsity can only be observed in
relatively lower dimensional subspaces
Outliers are only embedded in low-dimensional
subspaces

9
10
Introduction (cont.)

Challenges
The nature of streaming data applications
Take only one pass over the data stream.
Process data on an incremental and real-time
paradigm.
Space limitation and time-criticality.

Data Stream Mining Algorithm
Single data scan
10
11
Introduction (cont.)

Challenges
High-dimensionality further complicates the
problem.
Detection requires subspace exploration/search
mechanism.
The exhaustive search for the outlying subspaces
is a NP problem.

11
12
Introduction (cont.)

The state of the art

Traditional outlier detection methods Low
dimensional static data
High-dimensional outlier detection methods
static data only
Data stream outlier detection methods Full data
space only
?
?
A new outlier detection method for
high-dimensional data stream High-dimensional
data stream
12
13
Introduction (cont.)

Problem formulation
Projected outlier detection method performs a
mapping as
f pi ?
(b, Si, Scorei)

bi
Scorei
Si
True/false
Scorei1
Si1
Scorei2
Si2
...
...
Scorei n
Si n
13
14
Roadmap

Introduction
Data Synopsis
Stream Projected Outlier Detector (SPOT)
Multi-objective Genetic Algorithm
Experimental Results
Conclusions

14
15
Data Synopsis

Data synopsis is used to capture data
characteristics that can be used for outlier
detection.
Equi-width partition of domain space
The domain space is partitioned into a set of
non-overlapping cells with equal side
length in each
dimension.
Only equi-width partition is applicable to data
stream application.

15
16
Data Synopsis (cont.)

Projected Cell Summary (PCS)
The Projected Cell Summary of a cell c in a
subspace s is a triplet of scalars defined as
PCS(c, s)
RD(c,s), IRSD(c,s), IkRD(c,s)
where
RD Relative Density
IRSD Inverse Relative Standard Deviation
IkRD Inverse k-Relative Distance
Data synopsis will be constructed for projected
cells in subspaces for detecting outliers
PCS of projected cells in subspaces can be
computed and updated efficiently (in an online
manner)

16
17
Data Synopsis (cont.)

Outlying cell
A projected cell whose PCS component (RD, IRSD or
IkRD) is lower than human-specified threshold.
Outlying subspace
An outlying subspace s of p is a subspace
that contains the outlying cell of
p.
Projected outlier
A data point p is considered as a projected
outlier if there exists at least one outlying
subspace s of p.

p
X1
X2
17
18
Roadmap

Introduction
Data Synopsis
Stream Projected Outlier Detector (SPOT)
Multi-objective Genetic Algorithm
Experimental Results
Conclusions

18
19
Our technique SPOT

SPOT Stream Projected Outlier Detector (SPOT)
Technique for detecting outliers embedded in
lower dimensional subspaces

19
20
SPOT (cont.)

Major advantages
SPOT provides a support of projected outlier
detection and analysis facilities to low, medium
and high-dimensional data streams
SPOT supports both supervised and unsupervised
outlier detection
SPOT allows for multi-criteria outlier-ness
measurement and employs Genetic Algorithm for
effective subspace search
SPOT is equipped with mechanism to handle false
positives when labeled data are available.

20
21
SPOT (cont.)

Two Stages
Training stage
Detection stage

SPOT
Training stage
Detection stage
Construct SST
Detecting outliers from the data stream
21
22
Training Stage

Problem we are facing
The number of subspaces grows exponentially with
regard to dimensions of the data stream
Evaluating each data point in each possible
subspace is prohibitively expensive.
We evaluate outlier-ness of each point in a few
subspaces in the space lattice alternatively.
These subspaces are called Sparse Subspace
Template (SST).

SST
Supervised SST Subspaces (SS)
Fixed SST Subspaces (FS)
Unsupervised SST Subspaces (US)
22
23
Training Stage (cont.)

Fixed SST Subspaces (FS)
FS contains all the subspaces in the full lattice
whose maximum dimension is MaxDimention, where
MaxDimention is a user-specified parameter.
FS contains all the subspaces with dimensions of
1, 2, , MaxDimention. Maxidmension is
typically small, e.g., 2 or 3.
FS establishes the bottom-line of the detection
performance.

Create space lattice
FS
Specify MaxDimension
23
24
Training Stage (cont.)

Unsupervised SST Subspaces (US)
US consists of the outlying subspaces of the top
training data that have the highest overall
outlying degree.
Rationale
The selected training data are more likely to be
considered as outliers.
Can be potentially used to detect more subsequent
outliers in the stream.
Allow for unsupervised learning.

Clustering
MOGA
Top training data
Outlying subspaces
US
Unlabeled training data
24
25
Training Stage (cont.)

Supervised SST Subspaces (SS)
A few outlier examples may be provided by domain
experts or previous detection process.
SS is the set of outlying subspaces of these
outlier examples.
Rationale
Based on SS, example-based outlier detection can
be performed that detects more outliers that are
similar to these outlier examples.
Allow for supervised learning.

MOGA
SS
Outlier examples
Outlying subspaces
25
26
Detection Stage

The detection stage performs outlier detection
for incoming stream data.
Two sub-steps
Update Step
Data synopsis of the projected cells in each
subspace of SST to which the incoming point
belongs are updated
Detection Step
The outlier-ness of the data is checked in each
of subspaces in SST to decide whether or not it
is a projected outlier.

26
27
System Architecture of SPOT
27
28
Roadmap

Introduction
Data Synopsis
Stream Projected Outlier Detector (SPOT)
Multi-objective Genetic Algorithm
Experimental Results
Conclusions

28
29
Multi-objective Genetic Algorithm

Genetic Algorithm (GA) is used to search for
subspaces where top training data/outlier
examples are more outlying.
It is used when constructing US and SS.
Multi-objective Genetic Algorithm (MOGA) is used
for subspace search for optimizing multiple (i.e.
three) criteria in SPOT.

29
30
Multi-objective Genetic Algorithm
30
31
Roadmap

Introduction
Data Synopsis
Stream Projected Outlier Detector (SPOT)
Multi-objective Genetic Algorithm
Experimental Results
Conclusions

31
32
Experimental Results

Synthetic data sets
Generated by two high-dimensional data
generators.
SD1 produce data sets that generally exhibit
remarkably different data characteristics in
projections of different subsets of features in
terms of the number, location, size and
distribution of the data generated.

32
33
Experimental Results (cont.)

Synthetic data sets
SD2 specially designed for comparative study of
SPOT and the existing method. An important
characteristic of SD2 is that the projected
outliers appear perfectly normal in all
1-dimensional subspaces.

33
34
Experimental Results (cont.)

Real data sets
RD1 Letter Image (17 dimensions, UCI machine
learning repository)
RD2 Musk (168 dimensions, UCI machine learning
repository)
RD3 MIT wireless network data set (15
dimensions)
RD4 KDD-CUP99 Network Intrusion Detection
stream data set ( 42 dimensions)

34
35
Experimental Results (cont.)

Scalability study

Scalability of training w.r.t stream length
Scalability of training w.r.t stream
dimensionality
Scalability of detection w.r.t stream length
Scalability of detection w.r.t stream
dimensionality
35
36
Experimental Results (cont.)

Competitive methods
Histogram
Non-parametric statistical method
Kernel Function
Non-parametric statistical method
Incremental LOF
Incremental variant of LOF
HPStream
Projected clustering method for high-dimensional
data streams
Largest_Cluster
Clustering method (normal data only reside in the
largest cluster)

36
37
Experimental Results (cont.)

Experimental results
Purple DRgt85 and FPRlt10
Red DRgt90 and FPRlt1

37
38
Experimental Results (cont.)

Experimental results
Compared with other competitive methods, SPOT is
advantageous that
It is equipped with subspace exploration
capability, which contributes to a good detection
rate, and
It uses multiple criteria enables SPOT to deliver
much more accurate detection which helps SPOT to
reduce its false positive rate.

38
39
Roadmap

Introduction
Data Synopsis
Stream Projected Outlier Detector (SPOT)
Multi-objective Genetic Algorithm
Experimental Results
Conclusions

39
40
Conclusions

We approach the problem of projected outlier
detection for high-dimensional data streams (new
and challenging!)
SPOT utilizes compact data synopsis PCS to
capture necessary data statistical information
for outlier detection.
SPOT detects outliers from SST, a well-designed
subspace template.
SPOT adopts a flexible framework for using
multiple measures for outlier detection. MOGA as
an effective search method to find subspaces that
are able to optimize these outlier-ness criteria.
Experimental results demonstrate the good
performance of SPOT.

40
41
Conclusions (cont.)

Limitations of SPOT
Generally, SPOT is slower than traditional
outlier detection methods (SPOT explores hundreds
or thousands of subspaces)
A large SST will impose a stronger pressure for
SPOT to result in a high false positive rate.
Become salient when dealing with unlabeled data.
Tradeoff between detection rate and false
positive rate
Equal number of intervals for each dimension, may
not be the optimal partition.

41
42
Thank you for your time!
42

Write a Comment

User Comments (0)