Mining Data Streams - PowerPoint PPT Presentation

1 / 7
About This Presentation
Title:

Mining Data Streams

Description:

rare interesting data = 'needle in a haystack' Data Preprocessing ... 'Mining needle in a haystack. So much hay and so little time' Mining data streams. 2 ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 8
Provided by: jia120
Category:

less

Transcript and Presenter's Notes

Title: Mining Data Streams


1
Mining Data Streams
Key technical Challenges
  • Large data size
  • High dimensionality
  • Temporal nature of the data
  • Skewed class distribution
  • rare interesting data gt needle in a haystack
  • Data Preprocessing
  • converting network traffic into data
  • High Performance Computing (HPC)
  • critical for on-line analysis and scalability to
    very large data sets

2
Mining Data Streams
Tasks
  • Multi-dimensional (on-line) analysis of streams
  • Clustering data streams
  • Classification of stream data
  • Mining in data stream
  • frequentsequential patterns
  • partial periodicity
  • notable gradients
  • outliers and unusual patterns

3
Mining Data Streams
Structure
Deterministic Bounds
Mining Data Streams Engine
Massive sequence arrive at Rapid rate
Probabilistic Bounds
  • Single pass
  • Bounded storage
  • Real-time

4
Mining Data Streams
Stream Synopses Computation
  • Sampling
  • A small random sample of data often well
    represents all the data
  • Answering queries using samples
  • Example
  • Select avg from R where R. e is
    odd (n 12)
  • avg Returning average of odd
    elements in S
  • Original Data stream 9 3 5 2 7 1
    6 5 8 4 9 1

  • answer (93571591)/85
  • Sample S 9 5 1 8 answer
    (951)/35

5
Mining Data Streams
Stream Synopses Computation
  • Histograms
  • Approximate the frequency distribution of
    element values in a stream
  • A histogram consists of
  • A partitioning of element domain values into
    buckets
  • A count per bucket B (of the number of elements
    in B)
  • Equi-depth histograms
  • Select buckets such that counts per bucket
    are equal
  • V-optimal histograms
  • Select buckets to minimize frequency variance
    within buckets

6
Mining Data Streams
Stream Synopses Computation
  • Wavelets
  • Mathematical tool for hierarchical decomposition
    of functions/signals
  • Haar-wavelet histogram
  • Properties
  • Simple and strict local support
  • Poor localization in frequency
  • Poor compression performance

7
Mining Data streams
Mining Data Streams
  • Decision trees
  • Construct tree in the same way,but wait for
    significant differences
  • Instead of re-reading dataset, use new data
    from the stream
  • Online aggregation model
  • Clustering data streams
  • Build larger model out of smaller building
    blocks
  • Argue that composition does not lose too much
    accuracy
  • Composing approximate query operators
  • Association rules
Write a Comment
User Comments (0)
About PowerShow.com