Applications of Sketch Based Techniques to Data Mining Problems - PowerPoint PPT Presentation

About This Presentation

Title:

Applications of Sketch Based Techniques to Data Mining Problems

Description:

We will construct a pool of sketches that we will pre-compute and store. ... Pick an l = L = u and construct all sketches of length L as before using ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 46

Provided by: AT162

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Applications of Sketch Based Techniques to Data Mining Problems

1
Applications of Sketch Based Techniques to Data
Mining Problems

Nikos Koudas
ATT Labs Research
joint work with
G. Cormonde, P. Indyk, S. Muthukrishnan

2
Taming Massive Data Sets

Requirements of data mining algorithms
operate on very large data sets
scalability
incremental
Most data mining algorithms have super-linear
complexity
Deploying mining algorithms on very large data
sets, most likely will result in terrible
performance

3
Sketching

Reduce the dimensionality of the data in a
systematic way, constructing short data
summaries
Effectively reduce data volume.
Deploy mining algorithms on the reduced data
volume.
Main issue Preserve data properties, the mining
algorithm is concerned with, (e.g., distances) in
the reduced data volume.

4
Applications

Clustering time series data
Clustering tabular data sets

5
Introduction

Time series data abound in many applications.
Financial, performance data, geographical and
meteorological information, solar and space data
etc.
Various works deal with management and analysis
aspects of time series data
Indexing, storage and retrieval
Analysis and mining (forecasting, outlier and
deviation detection, etc)
Active research area in many research communities.

6
Representative Trends
Relaxed period
Average trend
7
Usage Examples

Mining/Prediction
Identifying periodic trends
Uncovering unexpected periodic trends
Performance management
Networking (routing, traffic engineering,
bandwidth allocation)
System Tuning
Financial databases
Cyclic behavior

8
Definitions

Given a time series V and an integer T define
V(T) (ViT1, ViT2, , ViTT),
0 lt ilt n/T-1
Define
Ci(V(T)) S1ltjltn/T-1 D(vi,uj)
Thus
Relaxed Period min C0(V(T)), for T in l,u
Average Trend min Ci(V(T)), for T in l,u

9
Definitions
n/T vectors for each T, T in l,u
n points in each time series
Relaxed period
T
n/T vectors for each T, each of them a candidate
avg. trend, T in l,u
Average Trend
T
10
Algorithms

There exists a quadratic algorithm for
identifying relaxed periods.
There exists a cubic algorithm for identifying
average trends.
Simply evaluate the clustering for each T in
l,u, it takes linear time to evaluate relaxed
periods for each T and quadratic time to evaluate
for average trends.

11
Algorithms

But can we really run these?
Consider length of sessions in an ATT service
for each second for a year, it is more than 31M
values and approximately 256MB.
Consider running the previous algorithms for say
10 years or on a finer time scale.
Both brute force algorithms are impractical.
Can we run faster on large datasets, too large to
be in memory?

12
Our Approach

Identify representative trends faster but provide
approximate answers
General approach (expresses various notions of
representative trends)
Provides guaranteed approximation performance,
with high probability
We present our approach in the following steps
Define the sketch of a vector
Algorithms for finding the sketch of all
sub-vectors of width T
Determine the sketch of all sub-vectors of width
in a given range.

13
Sketch of a vector

Given a vector t of length l, we generate its
sketch S(t) as follows
Pick a random vector u of length l, by picking
each component ui from a normal distribution
N(0,1) (normalized to 1).
Define
S(t)i t.u Sjtj.uj

14
Sketch Properties

Theorem
For any given set L of vectors of length l, for a
fixed e lt 1/2, if k 9 logL/e2, then for any
pair of vectors u,w we have
(1-e)u-w2 lt S(u)-S(w)2lt(1e)u-w2
with probability 1/2.
By increasing k we can increase the probability
of success
This is the Johnson-Lindenstrauss (JL) lemma.

15
Fixed window sketches

Compute all sketches of sub-vectors of length l
in a sequence of length n.
There are n-l1 such sub-vectors.
Straightforward application of JL would require
O(nlk) time since there are O(n) sub-vectors,
each of length l and the sketch is of size k, not
practical.

l
k
16
Key Observation

We can compute ALL such sketches fast by using
the fast fourier transform.
The problem of computing sketches of all
sub-vectors of length l simultaneously is exactly
the problem of computing the convolution of two
vectors t and u
Given two vectors A1a and B1b their
convolution is C1ab where
Ck S1ltiltbAk-iBi for 2 lt k lt ab

17
Example
(-0.97,-0.2) (-0.4,-2.14,-1.57,-3.11,-0.97)
S10 S20 S30
2 1 3 1
Convolution with
(0.11,0.99) (1.98,1.21,3.08,1.32,0.11)
S11 S21 S31
18
Computing all sketches of width in a given range

Compute all sketches of all sub-vectors of length
between l and u.
Brute force is cubic and prohibitive
Applying our observation would be quadratic and
still prohibitive
Can we compute all sketches of width in a given
range faster?

19
Approach

We will construct a pool of sketches that we will
pre-compute and store. Following this
preprocessing we will be able to determine the
sketch of any sub-vector in O(1) fairly
accurately.
Pick an l lt L lt u and construct all sketches of
length L as before using convolutions, this is
O(nlog2n) in the worst case. Assume for now that
L a power of 2 actually construct two such pools
S1 and S2

20
Approach

Consider any vector ti,.ij-1 we have two
cases
j some power of 2 (L), in this case we have it
in the pool and we can look it up in O(1)
2r lt j lt 2r1 in this case we can compute the
sketch as follows
S(ti,.,ij-1)j S1(ti,,i2r-1)j
S2tij-2r,.,ij-1)j both terms belong to
the pool

21
Example
S10

S21
S(U)
U
2 1 3 1 2 3 2 1
S1
S2
22
Why is this enough?

Theorem
For any given set L of vector of length l, for
fixed e lt 1/2 if k 9 log L/e2, then for any
pair of vectors u,w in L
(1-e)u-w2 lt S(u)-S(w)2 lt 2 (1e)
u-w2 with probability 1/2

23
Putting it all together

Given V and l,u range
relaxed period
Compute sketches in time O(nlog(u-l)klogu)
Consider every T in l,u and compute C0(V(T))
for every T.
Choosing k as described will guarantee that we
are at most 2e away from the true relaxed period
Average Trends
Proceed similarly by evaluating Ci(V(T)) for
every i

24
Implementation Issues

Computing Sketches
The pool of sketches can be computed with a
single pass over the data set. We only need to
keep a window worth of data across successive
sketch computations.
Retrieving Sketches
Required sketches are retrieved by performing
random IO. However across successive evaluations
for various values if T, required sketches are
related. Random IO can be limited due to
prefetching.

25
Experimental Evaluation

Real data from a service ATT provides
(utilization information).
Size varying from 16MB (approx. 1 month) to 256MB
(approx 1 year) worth of data.
Evaluated
Time to construct sketches
Scalability of sketch construction
Efficiency of the proposed convolution based
technique
Time to compute relaxed period and average trends
computing sketches from scratch and with
pre-computed sketches
Using brute force approaches
Accuracy of sketching
Comparison with other time series reduction
techniques

26
Time to construct sketches
27
Time to construct sketches
28
Time to construct without convolution
29
Time to construct sketches without convolution
30
Computing relaxed periods
31
Computing Relaxed Periods
32
Computing relaxed period with precomputed sketches
33
Computing Relaxed Periods Without Precomputed
Sketches
34
Brute Force Algorithms
35
Brute Force Algorithms
36
Computing Average Trend
37
Computing Average Trend
38
Accuracy of Sketches
39
Clustering Tabular Data

Many applications produce data in two dimensional
array form.
Consider traditional telecommunication
applications
Data are collected from a variety of collection
stations across the country, recording call
volume at some temporal granularity.
2d call volume data set (spatial ordering of
collection stations versus time) recording
temporal call activity, approx. 18MB/day.

40
(No Transcript)
41
Clustering tabular data

Data elements to be clustered are rectangular
data regions.
Clustering might reveal interesting similarities
(in call volume and time) between geographical
regions.
One month 600MB of data.
Sketch rectangular regions
extend sketches in 2d
sketching with respect to any Lp norm p in (0.2

42
(No Transcript)
43
Summary of results

Sketch construction scales nicely with respect to
data volume and sketch size.
Convolution based sketch computation is very
effective.
Sketch based approach is orders of magnitude
better than brute force for computing relaxed
periods and average trends.
Performance benefits increase for larger data
sets.
If sketches are pre-computed, clustering can be
performed in seconds even for very large data
sets.
In practice sketches of low dimensionality
provide great accuracy.
Compared with other dimensionality reduction
techniques, the sketch based approach is more
accurate and effective.

44
Conclusions

Scalability to large data volume requirement of
the data mining process.
Effectively reduce data volume using sketches.
Preserve data properties required by mining
algorithms (e.g., various distances).
Core techniques, various algorithms could benefit
from them.
Very large performance benefits, small loss in
accuracy.

45
Contact