Disk Aware Discord Discovery: - PowerPoint PPT Presentation

About This Presentation

Title:

Disk Aware Discord Discovery:

Description:

The American Association of Variable Star Observers has a database of over 10.5 ... Three classes of light-curves. Eclipsed binaries. Cepheids. RR Lyrae ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 34

Provided by: drag77

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Disk Aware Discord Discovery:

1

Disk Aware Discord Discovery
Finding Unusual Time Series in Terabyte Sized
Datasets

Dragomir Yankov, Eamonn Keogh, Computer Science
Eng. Dept. University of California, Riverside
Umaa Rebbapragada Dept. of Computer
Science Tufts University
Best paper winner ICDM 2007
2
Outline

What inspired the current work
The time series discord detection problem
An efficient algorithm for mining disk resident
discords
Detecting range-based discords
Detecting the top k discords
Experimental results
Evaluating the effectiveness of the discord
definition
Scalability of the discord detection algorithm

3
A motivating example

Myriads of telescopes around the world constantly
record valuable astronomical data, e.g. star
light-curves

Click Image to Play

A light-curve is a real-valued time series
of light magnitude measurements
derived from telescopic images

Eclipsed binary Sirius AB
Movie By kind permissions of Prof. Richard W.
Pogge, OSU
Image Chandra X-ray observatory
4
A motivating example (cont)

The American Association of Variable Star
Observers has a database of over 10.5 million
variable star brightness measurements going back
over ninety years
Over 400,000 new variable star brightness
measurements are added to the database every year
Many of the observations are noisy or are
preprocessed inaccurately prior to storing
Efficient, unsupervised methods for cleaning the
data are required

5
A motivating example (cont)

Data are inherently non-convex and hard to model
probabilistically.

Anomalies should be
defined with respect to
the non-linear manifolds
defined by the light-
curve time series (true
for many time series
datasets)

6
Definitions and assumptions

Notation
time series
subseqence
time series database
Function (may not be a
metric) defines an ordering for the elements in

Nasdaq Composite (Oct06-Oct07)
7
Time series discords

Most-significant discord the subsequence
with maximal distance to its
nearest neighbor

8
Generalized discord definitions

Most-significant k-th NN discord the
subsequence with maximal distance
to its k-th nearest neighbor

9
Generalized discord definitions

Most-significant k-NN discord the subsequence
with maximal distance to its k nearest
neighbors in

The algorithm utilizes the first of these discord
definitions for its computational efficiency and
intuitive interpretation
10
Disk aware discord detection

Detecting discords is harder than finding similar
patterns
anytime algorithms can quickly detect
similarities
anomalies require computation
time
Indexing is not a solution
time series are high dimensional
dimensionality reduction is often inadequate
linear scan is faster than 10 random disk
accesses

We are looking for an algorithm that performs two
disk scans and approximately linear number of
computations
11
Discord detection algorithm

Phase 1 candidates selection phase

- discord range
12
Discord detection algorithm

Phase 1 candidates selection phase

- discord range
13
Discord detection algorithm

Phase 1 candidates selection phase

- discord range
14
Discord detection algorithm

Phase 1 candidates selection phase

- discord range
15
Discord detection algorithm

Phase 1 candidates selection phase

- discord range
16
Discord detection algorithm

Phase 2 candidates refinement phase

?

- discord range
17
Discord detection algorithm

Phase 2 candidates refinement phase

- discord range
18
Discord detection algorithm

Phase 2 candidates refinement phase

Upon completion sort the candidates list C

19
Correctness of the algorithm

The candidates set C contains all discords at
distance at least r from their NN, plus some
other elements
The refinement phase removes from C all false
positives, and no real discord is pruned
Correctness the range discord algorithm detects
all discords and only the discords with respect
to the specified range r

20
Finding a good range parameter

Selecting large r may result in an empty discord
set, while too small r can render the algorithm
inefficient
Computing the nearest neighbor distance
distribution (NNDD) is
expensive
NNDD depends
on the number
of examples in
the data

21
Approximating NNDD

Intuition though the relative volume in the
upper tail decreases, the absolute number of
discords cut by r remains sufficient when adding
more data
Detecting the top k discords
Select a uniformly random sample
Compute the top k discords in
Order their NN distances as
Set
Run the disk aware algorithm with range parameter

22
Experimental evaluation

We performed two sets of experiments
Experiments showing the utility of the time
series discord definition
Experiments showing the scalability of the disk
aware discord detection algorithm

23
Experimental evaluation - utility of the discord
definition

Star light-curve data from the
Optical Gravitational Lensing
Experiment (OGLE)
Three classes of light-curves
Eclipsed binaries
Cepheids
RR Lyrae variables

typical examples
top two discords in each class
24
Experimental evaluation -utility of the discord
definition

MSN web
queries made
in 2002
The most significant discord using rotation
invariant Euclidean distance

patterns dominated by a weekly cycle
anticipated bursts
periodicity 29.5 days the length of a synodic
month
25
Experimental evaluation -utility of the discord
definition

Anomaly detection in video sequences
(multivariate data)
Adapting the method
as a data cleaning
procedure

the top one discord shown with only one of the
existing clusters
our method achieves 100 accuracy on the planted
anomalous trajectories
26
Experimental evaluation -utility of the discord
definition

Population growth data we studied the growth
rate of 206 countries for the last 25 years,
looking for the most dramatic 5 year event

the top 2 discords with a set of 10
representative countries for contrast
27
Experimental evaluation scalability of the disk
aware algorithm

We generated 3 data
sets of size up to 0.35Tb
of random walk time series
Six non-random walk
time series were planted,
we looked for the top 10
discords
Time efficiency on the three random walk data
sets

two of the planted series (top) were among the
top 10 discords
28
Experimental evaluation scalability of the disk
aware algorithm

Time efficiency (Heterogeneous data)
Main memory requirement for different thresholds

29
Experimental evaluation scalability of the disk
aware algorithm

Parallelizing the algorithm (m computers)

Candidate selection phase
Candidate refinement phase
30
Experimental evaluation scalability of the disk
aware algorithm

Parallelizing the algorithm (dataset one million
random walks )

The runtime overhead for 8 computers is
approximately 30. This is due to the increased
candidate set size C at the end of phase 1
31
Conclusion

Discords provide for an effective definition of
rare time series patterns.
The presented disk aware algorithm has all
requirements of a good off-the-shelf data mining
tool
The results are interpretable
It is extremely efficient and largely scalable
Very easy to implement (8 lines in Matlab)
Allows for straight-forward parallel and online
extensions

32
Acknowledgements