Detecting Projected Outliers in Highdimensional Data Streams - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Detecting Projected Outliers in Highdimensional Data Streams

Description:

Lecturer, University of Southern Queensland, Australia. Postdoc, CSIRO ICT Centre, Hobart, Australia. Ph.D., Dalhousie University, Canada. M.Sc. ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 43
Provided by: sciUs
Category:

less

Transcript and Presenter's Notes

Title: Detecting Projected Outliers in Highdimensional Data Streams


1
Detecting Projected Outliers in High-dimensional
Data Streams
Ji Zhang, Dr. Department of Mathematics and
Computing University of Southern
Queensland Ji.Zhang_at_usq.edu.au 15 Oct 2009
1
2
About me
  • Lecturer, University of Southern Queensland,
    Australia
  • Postdoc, CSIRO ICT Centre, Hobart, Australia
  • Ph.D., Dalhousie University, Canada
  • M.Sc., National University of Singapore,
    Singapore
  • B. E., Southeast University, Nanjing, China

2
3
Research Interests
  • Data Mining and Knowledge Discovery
  • Outlier detection
  • Clustering
  • Very large databases
  • Data Stream management
  • Web data management
  • Data Quality
  • Data Privacy and Security
  • Privacy preserving data management
  • Intrusion detection
  • Bioinformatics
  • Gene expression data
  • Computational Intelligence
  • Genetic Algorithm
  • Machine learning

3
4
Research Interests
  • Data Mining and Knowledge Discovery
  • Outlier detection
  • Clustering
  • Very large databases
  • Data Stream management
  • Web data management
  • Data Quality
  • Data Privacy and Security
  • Privacy preserving data management
  • Intrusion detection
  • Bioinformatics
  • Gene expression data
  • Computational Intelligence
  • Genetic Algorithm
  • Machine learning

4
5
Roadmap
  • Introduction
  • Data Synopsis
  • Stream Projected Outlier Detector (SPOT)
  • Multi-objective Genetic Algorithm
  • Experimental Results
  • Conclusions

5
6
Roadmap
  • Introduction
  • Data Synopsis
  • Stream Projected Outlier Detector (SPOT)
  • Multi-objective Genetic Algorithm
  • Experimental Results
  • Conclusions

6
7
Introduction
  • The explosion of data streams has sparked a lot
    of research in recent years.
  • Outlier detection from these data streams can
    potentially lead to discovery of useful abnormal
    and irregular patterns hidden in the streams.
  • Outlier detection in data streams can be useful
    in many fields such as analysis and monitoring of
  • financial transactions,
  • sensor networks and
  • network traffic.

7
8
Introduction (cont.)
  • Applications
  • Outlier detection in wireless sensor network
  • Chemical sensors deployed in the
    environment to monitor
  • toxic spills and nuclear incidents
    gather the chemical data periodically. Outlier
    detection can trigger alarms and locate the
    source when abnormal data are generated.
  • Computer/network security
  • Find abnormal network traffic data that may
    indicate
  • intrusions/attacks.
  • Credit card fraud detection
  • Find those credit card transactions that are
    abnormal in
  • transaction time, transaction
    place and amounts.

8
9
Introduction (cont.)
  • Characteristics of data in high-dimensional space
  • Data tend to be equi-distant with each other (due
    to curse of dimensionality)
  • Data density/sparsity can only be observed in
    relatively lower dimensional subspaces
  • Outliers are only embedded in low-dimensional
    subspaces

9
10
Introduction (cont.)
  • Challenges
  • The nature of streaming data applications
  • Take only one pass over the data stream.
  • Process data on an incremental and real-time
    paradigm.
  • Space limitation and time-criticality.

Data Stream Mining Algorithm
Single data scan
10
11
Introduction (cont.)
  • Challenges
  • High-dimensionality further complicates the
    problem.
  • Detection requires subspace exploration/search
    mechanism.
  • The exhaustive search for the outlying subspaces
    is a NP problem.

11
12
Introduction (cont.)
  • The state of the art

Traditional outlier detection methods Low
dimensional static data
High-dimensional outlier detection methods
static data only
Data stream outlier detection methods Full data
space only
?
?
A new outlier detection method for
high-dimensional data stream High-dimensional
data stream
12
13
Introduction (cont.)
  • Problem formulation
  • Projected outlier detection method performs a
    mapping as
  • f pi ?
    (b, Si, Scorei)

bi
Scorei
Si
True/false
Scorei1
Si1
Scorei2
Si2
...
...
Scorei n
Si n
13
14
Roadmap
  • Introduction
  • Data Synopsis
  • Stream Projected Outlier Detector (SPOT)
  • Multi-objective Genetic Algorithm
  • Experimental Results
  • Conclusions

14
15
Data Synopsis
  • Data synopsis is used to capture data
    characteristics that can be used for outlier
    detection.
  • Equi-width partition of domain space
  • The domain space is partitioned into a set of
  • non-overlapping cells with equal side
    length in each
  • dimension.
  • Only equi-width partition is applicable to data
  • stream application.

15
16
Data Synopsis (cont.)
  • Projected Cell Summary (PCS)
  • The Projected Cell Summary of a cell c in a
    subspace s is a triplet of scalars defined as
  • PCS(c, s)
    RD(c,s), IRSD(c,s), IkRD(c,s)
  • where
  • RD Relative Density
  • IRSD Inverse Relative Standard Deviation
  • IkRD Inverse k-Relative Distance
  • Data synopsis will be constructed for projected
    cells in subspaces for detecting outliers
  • PCS of projected cells in subspaces can be
    computed and updated efficiently (in an online
    manner)

16
17
Data Synopsis (cont.)
  • Outlying cell
  • A projected cell whose PCS component (RD, IRSD or
    IkRD) is lower than human-specified threshold.
  • Outlying subspace
  • An outlying subspace s of p is a subspace
  • that contains the outlying cell of
    p.
  • Projected outlier
  • A data point p is considered as a projected
    outlier if there exists at least one outlying
    subspace s of p.

p
X1
X2
17
18
Roadmap
  • Introduction
  • Data Synopsis
  • Stream Projected Outlier Detector (SPOT)
  • Multi-objective Genetic Algorithm
  • Experimental Results
  • Conclusions

18
19
Our technique SPOT
  • SPOT Stream Projected Outlier Detector (SPOT)
  • Technique for detecting outliers embedded in
    lower dimensional subspaces

19
20
SPOT (cont.)
  • Major advantages
  • SPOT provides a support of projected outlier
    detection and analysis facilities to low, medium
    and high-dimensional data streams
  • SPOT supports both supervised and unsupervised
    outlier detection
  • SPOT allows for multi-criteria outlier-ness
    measurement and employs Genetic Algorithm for
    effective subspace search
  • SPOT is equipped with mechanism to handle false
    positives when labeled data are available.

20
21
SPOT (cont.)
  • Two Stages
  • Training stage
  • Detection stage

SPOT
Training stage
Detection stage
Construct SST
Detecting outliers from the data stream
21
22
Training Stage
  • Problem we are facing
  • The number of subspaces grows exponentially with
    regard to dimensions of the data stream
  • Evaluating each data point in each possible
    subspace is prohibitively expensive.
  • We evaluate outlier-ness of each point in a few
    subspaces in the space lattice alternatively.
    These subspaces are called Sparse Subspace
    Template (SST).

SST
Supervised SST Subspaces (SS)
Fixed SST Subspaces (FS)
Unsupervised SST Subspaces (US)
22
23
Training Stage (cont.)
  • Fixed SST Subspaces (FS)
  • FS contains all the subspaces in the full lattice
    whose maximum dimension is MaxDimention, where
    MaxDimention is a user-specified parameter.
  • FS contains all the subspaces with dimensions of
    1, 2, , MaxDimention. Maxidmension is
    typically small, e.g., 2 or 3.
  • FS establishes the bottom-line of the detection
    performance.

Create space lattice
FS
Specify MaxDimension
23
24
Training Stage (cont.)
  • Unsupervised SST Subspaces (US)
  • US consists of the outlying subspaces of the top
    training data that have the highest overall
    outlying degree.
  • Rationale
  • The selected training data are more likely to be
    considered as outliers.
  • Can be potentially used to detect more subsequent
    outliers in the stream.
  • Allow for unsupervised learning.

Clustering
MOGA
Top training data
Outlying subspaces
US
Unlabeled training data
24
25
Training Stage (cont.)
  • Supervised SST Subspaces (SS)
  • A few outlier examples may be provided by domain
    experts or previous detection process.
  • SS is the set of outlying subspaces of these
    outlier examples.
  • Rationale
  • Based on SS, example-based outlier detection can
    be performed that detects more outliers that are
    similar to these outlier examples.
  • Allow for supervised learning.

MOGA
SS
Outlier examples
Outlying subspaces
25
26
Detection Stage
  • The detection stage performs outlier detection
    for incoming stream data.
  • Two sub-steps
  • Update Step
  • Data synopsis of the projected cells in each
    subspace of SST to which the incoming point
    belongs are updated
  • Detection Step
  • The outlier-ness of the data is checked in each
    of subspaces in SST to decide whether or not it
    is a projected outlier.

26
27
System Architecture of SPOT
27
28
Roadmap
  • Introduction
  • Data Synopsis
  • Stream Projected Outlier Detector (SPOT)
  • Multi-objective Genetic Algorithm
  • Experimental Results
  • Conclusions

28
29
Multi-objective Genetic Algorithm
  • Genetic Algorithm (GA) is used to search for
    subspaces where top training data/outlier
    examples are more outlying.
  • It is used when constructing US and SS.
  • Multi-objective Genetic Algorithm (MOGA) is used
    for subspace search for optimizing multiple (i.e.
    three) criteria in SPOT.

29
30
Multi-objective Genetic Algorithm
30
31
Roadmap
  • Introduction
  • Data Synopsis
  • Stream Projected Outlier Detector (SPOT)
  • Multi-objective Genetic Algorithm
  • Experimental Results
  • Conclusions

31
32
Experimental Results
  • Synthetic data sets
  • Generated by two high-dimensional data
    generators.
  • SD1 produce data sets that generally exhibit
    remarkably different data characteristics in
    projections of different subsets of features in
    terms of the number, location, size and
    distribution of the data generated.

32
33
Experimental Results (cont.)
  • Synthetic data sets
  • SD2 specially designed for comparative study of
    SPOT and the existing method. An important
    characteristic of SD2 is that the projected
    outliers appear perfectly normal in all
    1-dimensional subspaces.

33
34
Experimental Results (cont.)
  • Real data sets
  • RD1 Letter Image (17 dimensions, UCI machine
    learning repository)
  • RD2 Musk (168 dimensions, UCI machine learning
    repository)
  • RD3 MIT wireless network data set (15
    dimensions)
  • RD4 KDD-CUP99 Network Intrusion Detection
    stream data set ( 42 dimensions)

34
35
Experimental Results (cont.)
  • Scalability study

Scalability of training w.r.t stream length
Scalability of training w.r.t stream
dimensionality
Scalability of detection w.r.t stream length
Scalability of detection w.r.t stream
dimensionality
35
36
Experimental Results (cont.)
  • Competitive methods
  • Histogram
  • Non-parametric statistical method
  • Kernel Function
  • Non-parametric statistical method
  • Incremental LOF
  • Incremental variant of LOF
  • HPStream
  • Projected clustering method for high-dimensional
    data streams
  • Largest_Cluster
  • Clustering method (normal data only reside in the
    largest cluster)

36
37
Experimental Results (cont.)
  • Experimental results
  • Purple DRgt85 and FPRlt10
  • Red DRgt90 and FPRlt1

37
38
Experimental Results (cont.)
  • Experimental results
  • Compared with other competitive methods, SPOT is
    advantageous that
  • It is equipped with subspace exploration
    capability, which contributes to a good detection
    rate, and
  • It uses multiple criteria enables SPOT to deliver
    much more accurate detection which helps SPOT to
    reduce its false positive rate.

38
39
Roadmap
  • Introduction
  • Data Synopsis
  • Stream Projected Outlier Detector (SPOT)
  • Multi-objective Genetic Algorithm
  • Experimental Results
  • Conclusions

39
40
Conclusions
  • We approach the problem of projected outlier
    detection for high-dimensional data streams (new
    and challenging!)
  • SPOT utilizes compact data synopsis PCS to
    capture necessary data statistical information
    for outlier detection.
  • SPOT detects outliers from SST, a well-designed
    subspace template.
  • SPOT adopts a flexible framework for using
    multiple measures for outlier detection. MOGA as
    an effective search method to find subspaces that
    are able to optimize these outlier-ness criteria.
  • Experimental results demonstrate the good
    performance of SPOT.

40
41
Conclusions (cont.)
  • Limitations of SPOT
  • Generally, SPOT is slower than traditional
    outlier detection methods (SPOT explores hundreds
    or thousands of subspaces)
  • A large SST will impose a stronger pressure for
    SPOT to result in a high false positive rate.
    Become salient when dealing with unlabeled data.
  • Tradeoff between detection rate and false
    positive rate
  • Equal number of intervals for each dimension, may
    not be the optimal partition.

41
42
Thank you for your time!
42
Write a Comment
User Comments (0)
About PowerShow.com