Northwestern University - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Northwestern University

Description:

Discovery of interesting patterns in large multi-dimensional ... first (k-2) algo. any (k-2) algo. Huge CDU Set. Non Cluster Dims reported. Huge Search Space ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 41

Provided by: AlokCho2

Category:

more less

Transcript and Presenter's Notes

Title: Northwestern University

1
PARSYMONY Scalable Parallel Data Mining

Alok N. Choudhary
Northwestern University
(ACK Harsha Nagesh (Bell Labs) and Sanjay Goil
(Sun)

2
Outline

Overview of PARSIMONY
MAFIA (Subspace clustering)
pMAFIA (Parallel Subspace Clustering)
Performance Results
PARSIMONY
multidimensional data analysis
parallel classification
Summary

3
Overview of Knowledge Discovery Process
4
PARSIMONY Overview
5
MAFIA Subspace Clustering for High-Dimensional
Data Sets

Clustering and subspace clustering
Base MAFIA Algorithm
pMAFIA (parallelization)
Performance Results

6
Clustering

Discovery of interesting patterns in large
multi-dimensional data sets.
What is the average credit for a particular
income group (Financial Services)
Areas with maximum collect calls
(Telecommunications)
Categorize stocks based on their movement
(Investment Banking)
Target Mailing (Marketing)
Analysis of satellite data, detection of clusters
in geographic information systems, categorize web
documents, etc.

7
Clustering Multi-dimensional Dimensional Data Sets
Determine range of attributes in each dimension
of the cluster(s)
8
Issues to be addressed

Basic algorithm computational optimizations
Scalability with database size (out of core data
sets)
Scalability with the dimensionality of data
Efficient Parallelization
Recognition of arbitrary shaped clusters

9
Related Work

Partition based Clustering
User specified k representative points taken as
cluster centers and points assigned to cluster
centers
k-means, k-mediods, CLARANS (VLDB 94), BIRCH
(SIGMOD 96),..
Consider clustering partitioning of points
Hierarchical Clustering
Each point is a cluster. Merge similar points
together gradually.
CURE (SIGMOD 98) (use sampling)
Categorical Clustering
Clustering of categorical data e.g automobile
sales data color, year, model, price, etc
Best suited for non-numerical data
CACTUS (KDD 99), STIRR (VLDB 98)

10
Related Work

Density and Grid Based Clustering
Clusters are high density regions than its
surroundings
WaveCluster (VLDB 98), DBSCAN, CLIQUE (SIGMOD 98)
Number of subspaces is exponential in the data
dimensionality
Multidimensional space divided into grids. The
histogram in each hyper-rectangle is found. Grid
regions with a significant histogram value are
cluster regions.
Post-processing done to grow the connected
cluster regions.
Fine grid size results in explosion in the number
hyper-rectangles, coarser grids fail to detect
clusters.
Correct Grid Size is very critical !

11
Subspace Clustering

CLIQUE (SIGMOD 98)- User specified grid size and
threshold for each dimension
Finer grids enormous computation and coarser
grids loss of quality
Noise is another consideration in Finer grids
A bottom-up algorithm by combining dense regions
in different subspaces.
A hyper-rectangle in a multidimensional space is
dense if it contains more points than a user
specified threshold percentage of the total
number of points.
PROCLUS (SIGMOD 99) - Modification of k-means
algorithm.
User input of number of clusters and average
cluster dimensionality unrealistic for
real-world data sets
Uses cluster centers and points near to it to
compute statistics. These determine the relevant
cluster dimensions of the clusters !
ENCLUS (KDD 99) - Identifies the dimensions of a
cluster followed by the application of any
clustering algorithm.
Entropy Based Clustering Uses entropy as a
measure of correlation between dimensionsrequires
entropy thresholds to be set

12
Subspace Clustering

Observation If a collection of points S is a
cluster in a k-dimensional space, then S is also
a part of a cluster in any (k-1) dimensional
projection of the space
Algorithm Growing clusterscandidate dense
units in any k dimensions are obtained by merging
dense units in (k-1) dimensions which share any
(k-2) dimensions.
Ex ( a1,b7,d9, b7,c8,d9 ) --gt a1,b7,c8,d9
Candidate dense units are populated by a pass on
the data set and the dense units are found out in
each dimension.
Dense units found are combined to form candidate
dense units.
Algorithm terminates when no more candidate dense
units found.

13
Adaptive Grids (reducing computation in practice)

Automatic Grid fitting based on data distribution
MAFIA Merging of Adaptive Finite Intervals !
Optimal Bins in each dimension leads to very few
units in the grid (candidate dense units)
(a) CLIQUE
(b) MAFIA

14
Base MAFIA Algorithm

Divide each dimension into very fine regions.
Compute histogram in these regions along every
dimension.
Set the value of a sliding window to the maximum
histogram value in the window.
Adjacent units which have nearly same histogram
values are merged together to form larger bins.
Threshold of each bin formed is computed
automatically.
A bin having a histogram value much greater (by a
factor 23) than that of equi-distribution of
data is DENSE.

15
(No Transcript)
16
MAFIA Algorithm (merging dense units)

Algorithm Candidate dense units in any k
dimensions are obtained by merging dense units in
(k-1) dimensions which share any (k-2)
dimensions.
Ex ( a1,c7,b8, c7,b8,d9 ) --gt a1,c7,b8,d9

CLIQUE CDUs in any dimension k is formed by
combining dense units of dimension (k-1) which
share first (k-2) dimensions.
MAFIA CDUs in any dimension k is formed by
combining dense units of dimension (k-1) which
share any (k-2) dimensions.

Huge CDU Set
Non Cluster Dims reported
CLIQUE
Fixed Grids
Fixed Grids
first (k-2) algo
Huge Search Space
Data Set
much reduced CDU Set
Adaptive Grids
Correct Cluster Dims reported
any (k-2) algo
MAFIA
Reduced Search Space
Dimensions aware of data distribution
18
Parallel MAFIA

pMAFIA Parallel Subspace Clustering
Scalable in data size and number of dimensions
Grid and Density based clustering algorithm
Parallelization provides speedup for the subspace
clustering algorithm.

19
pMAFIA Algorithm

Each processor reads part of the data in a Data
Parallel fashion and constructs histogram in
every dimension.
// Data read in chunks (out of core) of data
Reduce communication to obtain global histogram.
All processors build Adaptive grids using the
histogram.
// Each bin formed is a Candidate Dense Unit.
Current Dimension k 1
while (no more dense units found)
if( k gt 1)
Build Candidate Dense units()
Populate the candidate dense units in a data
parallel fashion and in chunks (out-of-core) of
B records.
Reduce communication to obtain global CDU
population.
Identify the dense units ()
Build dense unit data structures() // for the
next higher dimension

20
Build-Candidate-Dense-Units

Current Dimension(k) 3.

21
Build Candidate Dense Units

Current Dimension(k) 3.

CLIQUE CDUs in any dimension k is formed by
combining dense units of dimension (k-1) such
that they share first (k-2) dimensions.
pMAFIA CDUs in any dimension k is formed by
combining dense units of dimension (k-1) such
that they share any (k-2) dimensions.
For data dimension (d) 10, current dimension
(k) 5, CLIQUE does not explore 93.3 of
possible combinations. In general,
This problem more so in data sets with clusters
having a very high subspace coverage

22
Build Candidate Dense Units

Each dense unit is compared with every other
dense unit to form CDUs, resulting in an O(Ndu2)
algorithm.(Ndu-number of dense units)
For large values of Ndu CDUs built in parallel.
Processors 0,..,(p-1) work in parallel on parts
of total Ndu dense units. Processor k compares
dense units between Ni and Ni1 with all the
other dense units for optimal task partitioning
we have
Identical CDUs generated during the process need
to be discarded.
Each generated CDU compared with every other CDU
to identify similar ones resulting in O(Ncdu2)
algorithm.

23
Task Parallelism

Identify Dense Units
CDUs generated are populated in a data parallel
fashion.
If the histogram count of a CDU is greater than
the threshold of all the bins which form the CDU
in their respective dimensions, the CDU is a
Dense Unit.
If Ncdu is large, each processor processes Ncdu/p
candidate dense units
Build Dense Unit Data Structures
If Ndu, number of Dense units, is large dense
unit data structures are constructed in parallel.
Each dense unit is completely represented by a
set of dimensions and the corresponding bin
indices in those dimensions.

24
Parallelization (Building CDUs)
Total work done by Pi is shown
25
pMAFIA ANALYSIS

Data parallelism in populating the CDUs in every
dimension effective for massive data sets.
Gains of task parallelism realized when data
contains large number of dense units (clusters).
A k dimensional dense unit allocated just 2k
bytes of memory, k bytes for dimensions and k for
bin indices.
Data structures in form of linear arrays of
bytes communicate very small message buffers,
space optimization.
Although bottom-up algorithm is exponential in
the data dimension, for low subspace coverage,
with use of Adaptive grids and parallel
formulation very promising results.
k dimension of the highest dimension dense unit,
we explore all possible subspaces of these k
dimensions gt O(ck)
For k passes over the data set O( (N/pB)
kTio),
N-total number of records, p-processors,
B- records per chunk, Tio-I/O access time for a
block of B records.
Communication overhead results in O(Tcomm S p
k),
Tcomm - constant for communication, S- size of
message exchanged,.
O(ck (N/pB) kTi Tcomm S p
k )

26
pMAFIA Outline

Clustering and subspace clustering
Base MAFIA Algorithm
pMAFIA (parallelization)
Performance Results
Adaptivity performance
Scalability with data set size.
Scalability with data dimensionality.
Scalability with cluster dimensionality.
Some Real world data sets.

27
Quality of Results

(a) CLIQUE
Loss of quality reports pqrs as the cluster !
Requires a complicated post processing step.
Bin selection and threshold fixing is a non
trivial problem. Cannot validate results.
(b) MAFIA Almost exact cluster boundaries
recognized
No post processing step required.

28
CLIQUE-MAFIA

400,000 records in 10 dimensions 2 clusters in 2
4d subspaces.

29
Advantage of Adaptive Grids

300,000 records in a 15 Dimension space with 1
cluster of 5 dimensions.
A speedup of 80 obtained over CLIQUE.
CLIQUE failed to produce results with our
modified CDU generation algorithm even in 2 hours
on 16 processors.
This relatively small data set mined in 32
seconds on 1 processor

30
Scalability with Data Set Size

20 Dimension data with 5 clusters in 5 different
subspaces
data sets up to 11.8 million records
Clusters detected in just about 3 minutes on 16
processors !
Almost Linear with the increase in the data set
size (because most time in scanning)

31
Parallelization (on IBM SP2)

30 Dimension data with 8.3 million records, 5
clusters each in a 6 dimension subspace.
Near linear speedups
Negligible Communication overheads (lt1)

32
Data Dimensionality

250,000 records, 3 clusters in different 5
dimensional subspaces.
Near linear behavior with data dimensionality
Algorithm depends on the maximum number of
dimensions in a clusters and not on the data
dimensionality.

33
Cluster Dimensionality

50 dimension data with 1 cluster, 650,000
records cluster dimension from 3 to 10.
Behavior in line with the order of the algorithm
increases with subspace coverage of the cluster.

34
Scalability on Movie Data

72,916 users rated 1628 movies in 18 months 2.8
Million ratings
4D data user-id, movie-id, weight, score
Discovered seven interesting 2d clusters !

35
Other Data Sets

One day Ahead Prediction of DAX (German Stock
Exchange)
DAX prediction data set was based on a 12 input
time series like stock indices, bond indices,
gold prices, etc
22 dimensions with 2757 records Major gains from
task parallelism
Mined clusters in 8.16 seconds on 8 processors.
Unique clusters discovered in 3,4,5 and 6
dimensional subspaces.
Ionosphere data (UCI repository)
Radar data collected in Goose Bay, Labrador
34 dimension data, 351 records.
Discovered unique clusters only in 3 and 4
dimensional sub spaces.

36
Performance Results

Implementation on the distributed memory machine
IBM SP2 and on a network of workstations.
Massive data sets with more than 10 Million
records in very high dimension spaces (gt30).
An order of two magnitude improvement over
existing techniques.
Parallelization adds additional scalability to
the algorithm
Near linear speedups achieved.
Negligible Communication Overheads.
Performance results on both synthetic and real
data sets.

37
Summary and Conclusions for pMAFIA

MAFIA unsupervised subspace clustering
algorithm
Introduced the adaptive grids formulation
First parallel subspace clustering algorithm for
massive data sets
Incorporates both task and data parallelism. .
pMAFIA Scalable and Parallel Implementation in
data size, no of dimensions

38
PARSIMONY Overview
39
Sparse Data Structures and Representations
40
(No Transcript)
41
Using OLAP Framework for
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)

Write a Comment

User Comments (0)