Applications of Sketch Based Techniques to Data Mining Problems - PowerPoint PPT Presentation

About This Presentation
Title:

Applications of Sketch Based Techniques to Data Mining Problems

Description:

We will construct a pool of sketches that we will pre-compute and store. ... Pick an l = L = u and construct all sketches of length L as before using ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 46
Provided by: AT162
Category:

less

Transcript and Presenter's Notes

Title: Applications of Sketch Based Techniques to Data Mining Problems


1
Applications of Sketch Based Techniques to Data
Mining Problems
  • Nikos Koudas
  • ATT Labs Research
  • joint work with
  • G. Cormonde, P. Indyk, S. Muthukrishnan

2
Taming Massive Data Sets
  • Requirements of data mining algorithms
  • operate on very large data sets
  • scalability
  • incremental
  • Most data mining algorithms have super-linear
    complexity
  • Deploying mining algorithms on very large data
    sets, most likely will result in terrible
    performance

3
Sketching
  • Reduce the dimensionality of the data in a
    systematic way, constructing short data
    summaries
  • Effectively reduce data volume.
  • Deploy mining algorithms on the reduced data
    volume.
  • Main issue Preserve data properties, the mining
    algorithm is concerned with, (e.g., distances) in
    the reduced data volume.

4
Applications
  • Clustering time series data
  • Clustering tabular data sets

5
Introduction
  • Time series data abound in many applications.
  • Financial, performance data, geographical and
    meteorological information, solar and space data
    etc.
  • Various works deal with management and analysis
    aspects of time series data
  • Indexing, storage and retrieval
  • Analysis and mining (forecasting, outlier and
    deviation detection, etc)
  • Active research area in many research communities.

6
Representative Trends
Relaxed period
Average trend
7
Usage Examples
  • Mining/Prediction
  • Identifying periodic trends
  • Uncovering unexpected periodic trends
  • Performance management
  • Networking (routing, traffic engineering,
    bandwidth allocation)
  • System Tuning
  • Financial databases
  • Cyclic behavior

8
Definitions
  • Given a time series V and an integer T define
    V(T) (ViT1, ViT2, , ViTT),
    0 lt ilt n/T-1
  • Define
  • Ci(V(T)) S1ltjltn/T-1 D(vi,uj)
  • Thus
  • Relaxed Period min C0(V(T)), for T in l,u
  • Average Trend min Ci(V(T)), for T in l,u

9
Definitions
n/T vectors for each T, T in l,u
n points in each time series
Relaxed period
T
n/T vectors for each T, each of them a candidate
avg. trend, T in l,u
Average Trend
T
10
Algorithms
  • There exists a quadratic algorithm for
    identifying relaxed periods.
  • There exists a cubic algorithm for identifying
    average trends.
  • Simply evaluate the clustering for each T in
    l,u, it takes linear time to evaluate relaxed
    periods for each T and quadratic time to evaluate
    for average trends.

11
Algorithms
  • But can we really run these?
  • Consider length of sessions in an ATT service
    for each second for a year, it is more than 31M
    values and approximately 256MB.
  • Consider running the previous algorithms for say
    10 years or on a finer time scale.
  • Both brute force algorithms are impractical.
  • Can we run faster on large datasets, too large to
    be in memory?

12
Our Approach
  • Identify representative trends faster but provide
    approximate answers
  • General approach (expresses various notions of
    representative trends)
  • Provides guaranteed approximation performance,
    with high probability
  • We present our approach in the following steps
  • Define the sketch of a vector
  • Algorithms for finding the sketch of all
    sub-vectors of width T
  • Determine the sketch of all sub-vectors of width
    in a given range.

13
Sketch of a vector
  • Given a vector t of length l, we generate its
    sketch S(t) as follows
  • Pick a random vector u of length l, by picking
    each component ui from a normal distribution
    N(0,1) (normalized to 1).
  • Define
  • S(t)i t.u Sjtj.uj

14
Sketch Properties
  • Theorem
  • For any given set L of vectors of length l, for a
    fixed e lt 1/2, if k 9 logL/e2, then for any
    pair of vectors u,w we have
  • (1-e)u-w2 lt S(u)-S(w)2lt(1e)u-w2
    with probability 1/2.
  • By increasing k we can increase the probability
    of success
  • This is the Johnson-Lindenstrauss (JL) lemma.

15
Fixed window sketches
  • Compute all sketches of sub-vectors of length l
    in a sequence of length n.
  • There are n-l1 such sub-vectors.
  • Straightforward application of JL would require
    O(nlk) time since there are O(n) sub-vectors,
    each of length l and the sketch is of size k, not
    practical.

l
k
16
Key Observation
  • We can compute ALL such sketches fast by using
    the fast fourier transform.
  • The problem of computing sketches of all
    sub-vectors of length l simultaneously is exactly
    the problem of computing the convolution of two
    vectors t and u
  • Given two vectors A1a and B1b their
    convolution is C1ab where
  • Ck S1ltiltbAk-iBi for 2 lt k lt ab

17
Example
(-0.97,-0.2) (-0.4,-2.14,-1.57,-3.11,-0.97)
S10 S20 S30
2 1 3 1
Convolution with
(0.11,0.99) (1.98,1.21,3.08,1.32,0.11)
S11 S21 S31
18
Computing all sketches of width in a given range
  • Compute all sketches of all sub-vectors of length
    between l and u.
  • Brute force is cubic and prohibitive
  • Applying our observation would be quadratic and
    still prohibitive
  • Can we compute all sketches of width in a given
    range faster?

19
Approach
  • We will construct a pool of sketches that we will
    pre-compute and store. Following this
    preprocessing we will be able to determine the
    sketch of any sub-vector in O(1) fairly
    accurately.
  • Pick an l lt L lt u and construct all sketches of
    length L as before using convolutions, this is
    O(nlog2n) in the worst case. Assume for now that
    L a power of 2 actually construct two such pools
    S1 and S2

20
Approach
  • Consider any vector ti,.ij-1 we have two
    cases
  • j some power of 2 (L), in this case we have it
    in the pool and we can look it up in O(1)
  • 2r lt j lt 2r1 in this case we can compute the
    sketch as follows
  • S(ti,.,ij-1)j S1(ti,,i2r-1)j
    S2tij-2r,.,ij-1)j both terms belong to
    the pool

21
Example
S10

S21
S(U)
U
2 1 3 1 2 3 2 1
S1
S2
22
Why is this enough?
  • Theorem
  • For any given set L of vector of length l, for
    fixed e lt 1/2 if k 9 log L/e2, then for any
    pair of vectors u,w in L
  • (1-e)u-w2 lt S(u)-S(w)2 lt 2 (1e)
    u-w2 with probability 1/2

23
Putting it all together
  • Given V and l,u range
  • relaxed period
  • Compute sketches in time O(nlog(u-l)klogu)
  • Consider every T in l,u and compute C0(V(T))
    for every T.
  • Choosing k as described will guarantee that we
    are at most 2e away from the true relaxed period
  • Average Trends
  • Proceed similarly by evaluating Ci(V(T)) for
    every i

24
Implementation Issues
  • Computing Sketches
  • The pool of sketches can be computed with a
    single pass over the data set. We only need to
    keep a window worth of data across successive
    sketch computations.
  • Retrieving Sketches
  • Required sketches are retrieved by performing
    random IO. However across successive evaluations
    for various values if T, required sketches are
    related. Random IO can be limited due to
    prefetching.

25
Experimental Evaluation
  • Real data from a service ATT provides
    (utilization information).
  • Size varying from 16MB (approx. 1 month) to 256MB
    (approx 1 year) worth of data.
  • Evaluated
  • Time to construct sketches
  • Scalability of sketch construction
  • Efficiency of the proposed convolution based
    technique
  • Time to compute relaxed period and average trends
  • computing sketches from scratch and with
    pre-computed sketches
  • Using brute force approaches
  • Accuracy of sketching
  • Comparison with other time series reduction
    techniques

26
Time to construct sketches
27
Time to construct sketches
28
Time to construct without convolution
29
Time to construct sketches without convolution
30
Computing relaxed periods
31
Computing Relaxed Periods
32
Computing relaxed period with precomputed sketches
33
Computing Relaxed Periods Without Precomputed
Sketches
34
Brute Force Algorithms
35
Brute Force Algorithms
36
Computing Average Trend
37
Computing Average Trend
38
Accuracy of Sketches
39
Clustering Tabular Data
  • Many applications produce data in two dimensional
    array form.
  • Consider traditional telecommunication
    applications
  • Data are collected from a variety of collection
    stations across the country, recording call
    volume at some temporal granularity.
  • 2d call volume data set (spatial ordering of
    collection stations versus time) recording
    temporal call activity, approx. 18MB/day.

40
(No Transcript)
41
Clustering tabular data
  • Data elements to be clustered are rectangular
    data regions.
  • Clustering might reveal interesting similarities
    (in call volume and time) between geographical
    regions.
  • One month 600MB of data.
  • Sketch rectangular regions
  • extend sketches in 2d
  • sketching with respect to any Lp norm p in (0.2

42
(No Transcript)
43
Summary of results
  • Sketch construction scales nicely with respect to
    data volume and sketch size.
  • Convolution based sketch computation is very
    effective.
  • Sketch based approach is orders of magnitude
    better than brute force for computing relaxed
    periods and average trends.
  • Performance benefits increase for larger data
    sets.
  • If sketches are pre-computed, clustering can be
    performed in seconds even for very large data
    sets.
  • In practice sketches of low dimensionality
    provide great accuracy.
  • Compared with other dimensionality reduction
    techniques, the sketch based approach is more
    accurate and effective.

44
Conclusions
  • Scalability to large data volume requirement of
    the data mining process.
  • Effectively reduce data volume using sketches.
  • Preserve data properties required by mining
    algorithms (e.g., various distances).
  • Core techniques, various algorithms could benefit
    from them.
  • Very large performance benefits, small loss in
    accuracy.

45
Contact
  • koudas_at_research.att.com
  • www.research.att.com/koudas
Write a Comment
User Comments (0)
About PowerShow.com