iSAX: Indexing and Mining Terabyte Sized Time Series - PowerPoint PPT Presentation

About This Presentation
Title:

iSAX: Indexing and Mining Terabyte Sized Time Series

Description:

84 = 4,096 possible SAX word labels. Place time series which map to the same label in the same file on disk. Compute label for query and retrieve matching file ... – PowerPoint PPT presentation

Number of Views:237
Avg rating:3.0/5.0
Slides: 18
Provided by: drag77
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: iSAX: Indexing and Mining Terabyte Sized Time Series


1
  • iSAX Indexing and Mining Terabyte Sized Time
    Series

Jin Shieh, Eamonn Keogh Computer Science Eng.
Dept. University of California, Riverside
2
Outline
  • Introduction
  • Motivating example
  • iSAX representation
  • Indexing time series
  • Experimental evaluation
  • Conclusion

3
Introduction
  • Our work extends a popular symbolic
    representation of time series to allow for the
    indexing and retrieval of millions of time series
  • Symbolic Aggregate approXimation (SAX)
  • Represent a time series T of length n in
    w-dimensional space using PAA
  • Where the ith element of is
  • Then discretize into a vector of symbols
  • Breakpoints map to a small alphabet a of
    symbols

4
Introduction (cont.)
  • SAX is lower bounding
  • Given a SAX representations Ta, Sa a lower bound
    to the Euclidean distance is
  • MINDIST(Ta, Sa)
  • dist(ti,si) is the smallest distance between the
    breakpoints that characterize each symbol, 0 if
    they overlap

5
Motivating Example
  • Why not just index using SAX?
  • For example index 1,000,000 time series using
    SAX
  • Choose SAX parameters
  • cardinality 8, wordlength 4
  • 84 4,096 possible SAX word labels
  • Place time series which map to the same label in
    the same file on disk
  • Compute label for query and retrieve matching
    file
  • Time series in file likely to be good approximate
    matches
  • Average label occupancy 1,000,000/4,096 244
    (reasonable)

6
Motivating Example (cont.)
  • In practice, the distribution of time series to
    SAX word labels is not uniform!
  • Empty
  • Disproportionate percentage of the dataset
  • Ideal condition We want to give a threshold th,
    and have the number of entries n mapped to a
    label to be 1 n th
  • Favor larger n
  • How can we achieve this? We need to make SAX more
    flexible

7
iSAX Representation
  • SAX uses a single hard-coded cardinality
  • Unable to differentiate only on dimensions of
    interest
  • We will show that the indexing problem can be
    solved if we extend SAX to allow
  • Different cardinalities within a single word
  • Comparison of words with different cardinalities
  • We call this extension indexable SAX (iSAX)

8
iSAX Representation (cont.)
  • Multi-resolution property
  • Readily convert to any lower resolution that
    differs by a power of two
  • Lower bounding distance between iSAX words
    enforced through examination of both sets of
    breakpoints
  • iSAX offers a bit aware, quantized,
    multi-resolution representation with variable
    granularity

12,13, 6, 1 1100,1101,0110,0001 6, 6, 3, 0 110 ,110 ,011 ,000 3, 3, 1, 0 11 ,11 ,01 ,00 1, 1, 0, 0 1 ,1 ,0 ,0
9
Indexing with iSAX
  • Split a set of time series represented by a
    common iSAX word into mutually exclusive subsets
    (using multi-resolution property)
  • Increase cardinality along dimensions d, word
    length w, 1 d w
  • Fan-out rate bound by 2d
  • Iterative doubling
  • Given a base cardinality b, cardinality at i-th
    increase is b2i
  • Alignment of breakpoints overlap
  • Allows for index structures which are
    hierarchical, with non-overlapping regions, and a
    controlled fan-out rate

10
Indexing with iSAX (cont.)
  • Simple tree-based index (base cardinality b, word
    length w, threshold th)
  • Hierarchically subdivides SAX space until entries
    in each subspace falls within th
  • Leaf nodes point to index files on disk
  • Internal nodes designate a split in SAX space
  • Approximate Search
  • Similar time series often represented by same
    iSAX word
  • Traverse index until leaf
  • Match iSAX representation at each level
  • Apply heuristics if no match
  • Exact Search
  • Leverage approximate search
  • Prune search space
  • Lower bounding distance

11
Experimental Evaluation
  • We conduct experiments to identify
    characteristics of the iSAX representation
  • Tightness of the lower bound
  • Indexing performance on massive datasets
  • Applicability to data-mining algorithms

12
Tightness of Lower Bounds
  • TLB LowerBoundDist(T,S) / EuclideanDist(T,S)
  • For a given dataset
  • Time series length 480, 960, 1440, 1920
  • Bytes available for representation 16, 24, 32,
    40
  • Results similar across thirty datasets

13
Indexing Performance on Massive Datasets
  • Indexed random walk datasets of 1, 2, 4, 8
    million time series of length 256
  • Parameters b 4, w 8, th 100
  • Generated 39,255, 57,365, 92,209, 162,340 index
    files
  • Approximate Search (1000 queries)
  • Exact Search (100 queries)

Avg. Time/Query (min) Avg. Time/Query (min) Avg. Time/Query (min) Avg. Time/Query (min) Avg. Time/Query (min)
1M 2M 4M 8M
Exact Search 3.8 5.8 9.0 14.1
Sequential Scan 71.5 104.8 168.8 297.6
Avg. Disk Accesses/Query Avg. Disk Accesses/Query Avg. Disk Accesses/Query Avg. Disk Accesses/Query Avg. Disk Accesses/Query
1M 2M 4M 8M
Exact Search 2115.3 3172.5 4925.3 7719.1
Sequential Scan 39255 57365 92209 162340
14
Data Mining
  • Definition Time Series Set Difference (TSSD)
    (A,B). Given two collections of time series A and
    B, the time series set difference is the
    subsequence in A whose distance from its nearest
    neighbor in B is maximal
  • Electrocardiogram dataset from a 45 year old male
    subject with suspected sleep-disordered
    breathing
  • 7.2 hours as reference set B (1,000,000 time
    series)
  • 8 minutes 39 seconds as novel set A (20,000
    time series) where the patient woke up

The Time Series Set Difference discovered between
ECGs recorded during a waking cycle and the
previous 7.2 hours (respiration pattern change in
accordance with change in sleep stages)
15
Data Mining (cont.)
  • Solutions
  • Sequential scan A across B
  • Exact search each entry in A using index on B
  • Leverage approximate and exact search
  • Order A by approximate search distance in a queue
  • Perform exact search using index on B in
    descending distance
  • Suspend if distance becomes lower than next entry
    in the queue
  • If search completes, return as TSSD

Distance Computations Disk Accesses Est. Time
1) Sequential Scan 20,000,000,000 31,196 6.25 days
2) Exact Search 325,604,200 5,676,400 1.04 days
3) Leveraged 2,365,553 43,779 34 minutes
16
Conclusion
  • Introduced the iSAX representation and shown how
    it can be used for indexing time series
  • Demonstrated scalability and efficacy on massive
    datasets
  • Showed how approximate and exact search can be
    used in conjunction to produce exact results on
    data mining problems

17
  • THANK YOU!
Write a Comment
User Comments (0)
About PowerShow.com