Disk Aware Discord Discovery: - PowerPoint PPT Presentation

About This Presentation
Title:

Disk Aware Discord Discovery:

Description:

The American Association of Variable Star Observers has a database of over 10.5 ... Three classes of light-curves. Eclipsed binaries. Cepheids. RR Lyrae ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 34
Provided by: drag77
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Disk Aware Discord Discovery:


1
  • Disk Aware Discord Discovery
  • Finding Unusual Time Series in Terabyte Sized
    Datasets

Dragomir Yankov, Eamonn Keogh, Computer Science
Eng. Dept. University of California, Riverside
Umaa Rebbapragada Dept. of Computer
Science Tufts University
Best paper winner ICDM 2007
2
Outline
  • What inspired the current work
  • The time series discord detection problem
  • An efficient algorithm for mining disk resident
    discords
  • Detecting range-based discords
  • Detecting the top k discords
  • Experimental results
  • Evaluating the effectiveness of the discord
    definition
  • Scalability of the discord detection algorithm

3
A motivating example
  • Myriads of telescopes around the world constantly
    record valuable astronomical data, e.g. star
    light-curves

Click Image to Play
  • A light-curve is a real-valued time series
  • of light magnitude measurements
  • derived from telescopic images

Eclipsed binary Sirius AB
Movie By kind permissions of Prof. Richard W.
Pogge, OSU
Image Chandra X-ray observatory
4
A motivating example (cont)
  • The American Association of Variable Star
    Observers has a database of over 10.5 million
    variable star brightness measurements going back
    over ninety years
  • Over 400,000 new variable star brightness
    measurements are added to the database every year
  • Many of the observations are noisy or are
    preprocessed inaccurately prior to storing
  • Efficient, unsupervised methods for cleaning the
    data are required

5
A motivating example (cont)
  • Data are inherently non-convex and hard to model
    probabilistically.
  • Anomalies should be
  • defined with respect to
  • the non-linear manifolds
  • defined by the light-
  • curve time series (true
  • for many time series
  • datasets)

6
Definitions and assumptions
  • Notation
  • time series
  • subseqence
  • time series database
  • Function (may not be a
    metric) defines an ordering for the elements in

Nasdaq Composite (Oct06-Oct07)
7
Time series discords
  • Most-significant discord the subsequence
    with maximal distance to its
    nearest neighbor

8
Generalized discord definitions
  • Most-significant k-th NN discord the
    subsequence with maximal distance
    to its k-th nearest neighbor

9
Generalized discord definitions
  • Most-significant k-NN discord the subsequence
    with maximal distance to its k nearest
    neighbors in

The algorithm utilizes the first of these discord
definitions for its computational efficiency and
intuitive interpretation
10
Disk aware discord detection
  • Detecting discords is harder than finding similar
    patterns
  • anytime algorithms can quickly detect
    similarities
  • anomalies require computation
    time
  • Indexing is not a solution
  • time series are high dimensional
  • dimensionality reduction is often inadequate
  • linear scan is faster than 10 random disk
    accesses

We are looking for an algorithm that performs two
disk scans and approximately linear number of
computations
11
Discord detection algorithm
  • Phase 1 candidates selection phase



- discord range
12
Discord detection algorithm
  • Phase 1 candidates selection phase



- discord range
13
Discord detection algorithm
  • Phase 1 candidates selection phase



- discord range
14
Discord detection algorithm
  • Phase 1 candidates selection phase



- discord range
15
Discord detection algorithm
  • Phase 1 candidates selection phase



- discord range
16
Discord detection algorithm
  • Phase 2 candidates refinement phase



?


- discord range
17
Discord detection algorithm
  • Phase 2 candidates refinement phase





- discord range
18
Discord detection algorithm
  • Phase 2 candidates refinement phase



Upon completion sort the candidates list C


19
Correctness of the algorithm
  • The candidates set C contains all discords at
    distance at least r from their NN, plus some
    other elements
  • The refinement phase removes from C all false
    positives, and no real discord is pruned
  • Correctness the range discord algorithm detects
    all discords and only the discords with respect
    to the specified range r

20
Finding a good range parameter
  • Selecting large r may result in an empty discord
    set, while too small r can render the algorithm
    inefficient
  • Computing the nearest neighbor distance
    distribution (NNDD) is
  • expensive
  • NNDD depends
  • on the number
  • of examples in
  • the data

21
Approximating NNDD
  • Intuition though the relative volume in the
    upper tail decreases, the absolute number of
    discords cut by r remains sufficient when adding
    more data
  • Detecting the top k discords
  • Select a uniformly random sample
  • Compute the top k discords in
  • Order their NN distances as
  • Set
  • Run the disk aware algorithm with range parameter

22
Experimental evaluation
  • We performed two sets of experiments
  • Experiments showing the utility of the time
    series discord definition
  • Experiments showing the scalability of the disk
    aware discord detection algorithm

23
Experimental evaluation - utility of the discord
definition
  • Star light-curve data from the
  • Optical Gravitational Lensing
  • Experiment (OGLE)
  • Three classes of light-curves
  • Eclipsed binaries
  • Cepheids
  • RR Lyrae variables

typical examples
top two discords in each class
24
Experimental evaluation -utility of the discord
definition
  • MSN web
  • queries made
  • in 2002
  • The most significant discord using rotation
    invariant Euclidean distance

patterns dominated by a weekly cycle
anticipated bursts
periodicity 29.5 days the length of a synodic
month
25
Experimental evaluation -utility of the discord
definition
  • Anomaly detection in video sequences
    (multivariate data)
  • Adapting the method
  • as a data cleaning
  • procedure

the top one discord shown with only one of the
existing clusters
our method achieves 100 accuracy on the planted
anomalous trajectories
26
Experimental evaluation -utility of the discord
definition
  • Population growth data we studied the growth
    rate of 206 countries for the last 25 years,
    looking for the most dramatic 5 year event

the top 2 discords with a set of 10
representative countries for contrast
27
Experimental evaluation scalability of the disk
aware algorithm
  • We generated 3 data
  • sets of size up to 0.35Tb
  • of random walk time series
  • Six non-random walk
  • time series were planted,
  • we looked for the top 10
  • discords
  • Time efficiency on the three random walk data
    sets

two of the planted series (top) were among the
top 10 discords
28
Experimental evaluation scalability of the disk
aware algorithm
  • Time efficiency (Heterogeneous data)
  • Main memory requirement for different thresholds

29
Experimental evaluation scalability of the disk
aware algorithm
  • Parallelizing the algorithm (m computers)



Candidate selection phase
Candidate refinement phase
30
Experimental evaluation scalability of the disk
aware algorithm
  • Parallelizing the algorithm (dataset one million
    random walks )

The runtime overhead for 8 computers is
approximately 30. This is due to the increased
candidate set size C at the end of phase 1
31
Conclusion
  • Discords provide for an effective definition of
    rare time series patterns.
  • The presented disk aware algorithm has all
    requirements of a good off-the-shelf data mining
    tool
  • The results are interpretable
  • It is extremely efficient and largely scalable
  • Very easy to implement (8 lines in Matlab)
  • Allows for straight-forward parallel and online
    extensions

32
Acknowledgements
  • We would like to thank to
  • Dr. Pavlos Protopapas (Harvard University)
    light-curve dataset
  • Dr. Michail Vlachos (IBM Watson) MSN web query
    data
  • Dr. Longin Jan Latecki (Temple University)
    Trajectory dataset1
  • Dr. Andrew Naftel (University of Manchester) -
    Trajectory dataset2
  • also
  • Dr. Jessica Lin (George Mason University) and
  • Dr. Ada Fu (Chinese University of Hong Kong)
    for useful discussions

33
  • All datasets and the code can be downloaded from
    http//www.cs.ucr.edu/dyankov/projects/
  • THANK YOU!
Write a Comment
User Comments (0)
About PowerShow.com