Title: Northwestern University
1PARSYMONY Scalable Parallel Data Mining
- Alok N. Choudhary
- Northwestern University
- (ACK Harsha Nagesh (Bell Labs) and Sanjay Goil
(Sun)
2Outline
- Overview of PARSIMONY
- MAFIA (Subspace clustering)
- pMAFIA (Parallel Subspace Clustering)
- Performance Results
- PARSIMONY
- multidimensional data analysis
- parallel classification
- Summary
3Overview of Knowledge Discovery Process
4PARSIMONY Overview
5MAFIA Subspace Clustering for High-Dimensional
Data Sets
- Clustering and subspace clustering
- Base MAFIA Algorithm
- pMAFIA (parallelization)
- Performance Results
6Clustering
- Discovery of interesting patterns in large
multi-dimensional data sets. - What is the average credit for a particular
income group (Financial Services) - Areas with maximum collect calls
(Telecommunications) - Categorize stocks based on their movement
(Investment Banking) - Target Mailing (Marketing)
- Analysis of satellite data, detection of clusters
in geographic information systems, categorize web
documents, etc.
7Clustering Multi-dimensional Dimensional Data Sets
Determine range of attributes in each dimension
of the cluster(s)
8Issues to be addressed
- Basic algorithm computational optimizations
- Scalability with database size (out of core data
sets) - Scalability with the dimensionality of data
- Efficient Parallelization
- Recognition of arbitrary shaped clusters
9Related Work
- Partition based Clustering
- User specified k representative points taken as
cluster centers and points assigned to cluster
centers - k-means, k-mediods, CLARANS (VLDB 94), BIRCH
(SIGMOD 96),.. - Consider clustering partitioning of points
- Hierarchical Clustering
- Each point is a cluster. Merge similar points
together gradually. - CURE (SIGMOD 98) (use sampling)
- Categorical Clustering
- Clustering of categorical data e.g automobile
sales data color, year, model, price, etc - Best suited for non-numerical data
- CACTUS (KDD 99), STIRR (VLDB 98)
10Related Work
- Density and Grid Based Clustering
- Clusters are high density regions than its
surroundings - WaveCluster (VLDB 98), DBSCAN, CLIQUE (SIGMOD 98)
- Number of subspaces is exponential in the data
dimensionality - Multidimensional space divided into grids. The
histogram in each hyper-rectangle is found. Grid
regions with a significant histogram value are
cluster regions. - Post-processing done to grow the connected
cluster regions. - Fine grid size results in explosion in the number
hyper-rectangles, coarser grids fail to detect
clusters. - Correct Grid Size is very critical !
11Subspace Clustering
- CLIQUE (SIGMOD 98)- User specified grid size and
threshold for each dimension - Finer grids enormous computation and coarser
grids loss of quality - Noise is another consideration in Finer grids
- A bottom-up algorithm by combining dense regions
in different subspaces. - A hyper-rectangle in a multidimensional space is
dense if it contains more points than a user
specified threshold percentage of the total
number of points. - PROCLUS (SIGMOD 99) - Modification of k-means
algorithm. - User input of number of clusters and average
cluster dimensionality unrealistic for
real-world data sets - Uses cluster centers and points near to it to
compute statistics. These determine the relevant
cluster dimensions of the clusters ! - ENCLUS (KDD 99) - Identifies the dimensions of a
cluster followed by the application of any
clustering algorithm. - Entropy Based Clustering Uses entropy as a
measure of correlation between dimensionsrequires
entropy thresholds to be set
12Subspace Clustering
- Observation If a collection of points S is a
cluster in a k-dimensional space, then S is also
a part of a cluster in any (k-1) dimensional
projection of the space - Algorithm Growing clusterscandidate dense
units in any k dimensions are obtained by merging
dense units in (k-1) dimensions which share any
(k-2) dimensions. - Ex ( a1,b7,d9, b7,c8,d9 ) --gt a1,b7,c8,d9
- Candidate dense units are populated by a pass on
the data set and the dense units are found out in
each dimension. - Dense units found are combined to form candidate
dense units. - Algorithm terminates when no more candidate dense
units found.
13Adaptive Grids (reducing computation in practice)
- Automatic Grid fitting based on data distribution
- MAFIA Merging of Adaptive Finite Intervals !
- Optimal Bins in each dimension leads to very few
units in the grid (candidate dense units) - (a) CLIQUE
- (b) MAFIA
14Base MAFIA Algorithm
- Divide each dimension into very fine regions.
- Compute histogram in these regions along every
dimension. - Set the value of a sliding window to the maximum
histogram value in the window. - Adjacent units which have nearly same histogram
values are merged together to form larger bins. - Threshold of each bin formed is computed
automatically. - A bin having a histogram value much greater (by a
factor 23) than that of equi-distribution of
data is DENSE.
15(No Transcript)
16MAFIA Algorithm (merging dense units)
- Algorithm Candidate dense units in any k
dimensions are obtained by merging dense units in
(k-1) dimensions which share any (k-2)
dimensions. - Ex ( a1,c7,b8, c7,b8,d9 ) --gt a1,c7,b8,d9
17- CLIQUE CDUs in any dimension k is formed by
combining dense units of dimension (k-1) which
share first (k-2) dimensions. - MAFIA CDUs in any dimension k is formed by
combining dense units of dimension (k-1) which
share any (k-2) dimensions.
Huge CDU Set
Non Cluster Dims reported
CLIQUE
Fixed Grids
Fixed Grids
first (k-2) algo
Huge Search Space
Data Set
much reduced CDU Set
Adaptive Grids
Correct Cluster Dims reported
any (k-2) algo
MAFIA
Reduced Search Space
Dimensions aware of data distribution
18Parallel MAFIA
- pMAFIA Parallel Subspace Clustering
- Scalable in data size and number of dimensions
Grid and Density based clustering algorithm - Parallelization provides speedup for the subspace
clustering algorithm.
19pMAFIA Algorithm
- Each processor reads part of the data in a Data
Parallel fashion and constructs histogram in
every dimension. - // Data read in chunks (out of core) of data
- Reduce communication to obtain global histogram.
All processors build Adaptive grids using the
histogram. - // Each bin formed is a Candidate Dense Unit.
- Current Dimension k 1
- while (no more dense units found)
- if( k gt 1)
- Build Candidate Dense units()
- Populate the candidate dense units in a data
parallel fashion and in chunks (out-of-core) of
B records. - Reduce communication to obtain global CDU
population. - Identify the dense units ()
- Build dense unit data structures() // for the
next higher dimension
20Build-Candidate-Dense-Units
21Build Candidate Dense Units
- CLIQUE CDUs in any dimension k is formed by
combining dense units of dimension (k-1) such
that they share first (k-2) dimensions. - pMAFIA CDUs in any dimension k is formed by
combining dense units of dimension (k-1) such
that they share any (k-2) dimensions. - For data dimension (d) 10, current dimension
(k) 5, CLIQUE does not explore 93.3 of
possible combinations. In general, - This problem more so in data sets with clusters
having a very high subspace coverage
22Build Candidate Dense Units
- Each dense unit is compared with every other
dense unit to form CDUs, resulting in an O(Ndu2)
algorithm.(Ndu-number of dense units) - For large values of Ndu CDUs built in parallel.
- Processors 0,..,(p-1) work in parallel on parts
of total Ndu dense units. Processor k compares
dense units between Ni and Ni1 with all the
other dense units for optimal task partitioning
we have - Identical CDUs generated during the process need
to be discarded. - Each generated CDU compared with every other CDU
to identify similar ones resulting in O(Ncdu2)
algorithm.
23Task Parallelism
- Identify Dense Units
- CDUs generated are populated in a data parallel
fashion. - If the histogram count of a CDU is greater than
the threshold of all the bins which form the CDU
in their respective dimensions, the CDU is a
Dense Unit. - If Ncdu is large, each processor processes Ncdu/p
candidate dense units - Build Dense Unit Data Structures
- If Ndu, number of Dense units, is large dense
unit data structures are constructed in parallel.
- Each dense unit is completely represented by a
set of dimensions and the corresponding bin
indices in those dimensions.
24Parallelization (Building CDUs)
Total work done by Pi is shown
25pMAFIA ANALYSIS
- Data parallelism in populating the CDUs in every
dimension effective for massive data sets. - Gains of task parallelism realized when data
contains large number of dense units (clusters). - A k dimensional dense unit allocated just 2k
bytes of memory, k bytes for dimensions and k for
bin indices. - Data structures in form of linear arrays of
bytes communicate very small message buffers,
space optimization. - Although bottom-up algorithm is exponential in
the data dimension, for low subspace coverage,
with use of Adaptive grids and parallel
formulation very promising results. - k dimension of the highest dimension dense unit,
we explore all possible subspaces of these k
dimensions gt O(ck) - For k passes over the data set O( (N/pB)
kTio), - N-total number of records, p-processors,
- B- records per chunk, Tio-I/O access time for a
block of B records. - Communication overhead results in O(Tcomm S p
k), - Tcomm - constant for communication, S- size of
message exchanged,. - O(ck (N/pB) kTi Tcomm S p
k )
26pMAFIA Outline
- Clustering and subspace clustering
- Base MAFIA Algorithm
- pMAFIA (parallelization)
- Performance Results
- Adaptivity performance
- Scalability with data set size.
- Scalability with data dimensionality.
- Scalability with cluster dimensionality.
- Some Real world data sets.
27Quality of Results
- (a) CLIQUE
- Loss of quality reports pqrs as the cluster !
- Requires a complicated post processing step.
- Bin selection and threshold fixing is a non
trivial problem. Cannot validate results. - (b) MAFIA Almost exact cluster boundaries
recognized - No post processing step required.
28CLIQUE-MAFIA
- 400,000 records in 10 dimensions 2 clusters in 2
4d subspaces.
29Advantage of Adaptive Grids
- 300,000 records in a 15 Dimension space with 1
cluster of 5 dimensions. - A speedup of 80 obtained over CLIQUE.
- CLIQUE failed to produce results with our
modified CDU generation algorithm even in 2 hours
on 16 processors. - This relatively small data set mined in 32
seconds on 1 processor
30Scalability with Data Set Size
- 20 Dimension data with 5 clusters in 5 different
subspaces - data sets up to 11.8 million records
- Clusters detected in just about 3 minutes on 16
processors ! - Almost Linear with the increase in the data set
size (because most time in scanning)
31Parallelization (on IBM SP2)
- 30 Dimension data with 8.3 million records, 5
clusters each in a 6 dimension subspace. - Near linear speedups
- Negligible Communication overheads (lt1)
32Data Dimensionality
- 250,000 records, 3 clusters in different 5
dimensional subspaces. - Near linear behavior with data dimensionality
Algorithm depends on the maximum number of
dimensions in a clusters and not on the data
dimensionality.
33Cluster Dimensionality
- 50 dimension data with 1 cluster, 650,000
records cluster dimension from 3 to 10. - Behavior in line with the order of the algorithm
increases with subspace coverage of the cluster.
34Scalability on Movie Data
- 72,916 users rated 1628 movies in 18 months 2.8
Million ratings - 4D data user-id, movie-id, weight, score
- Discovered seven interesting 2d clusters !
35Other Data Sets
- One day Ahead Prediction of DAX (German Stock
Exchange) - DAX prediction data set was based on a 12 input
time series like stock indices, bond indices,
gold prices, etc - 22 dimensions with 2757 records Major gains from
task parallelism - Mined clusters in 8.16 seconds on 8 processors.
- Unique clusters discovered in 3,4,5 and 6
dimensional subspaces. - Ionosphere data (UCI repository)
- Radar data collected in Goose Bay, Labrador
- 34 dimension data, 351 records.
- Discovered unique clusters only in 3 and 4
dimensional sub spaces.
36Performance Results
- Implementation on the distributed memory machine
IBM SP2 and on a network of workstations. - Massive data sets with more than 10 Million
records in very high dimension spaces (gt30). - An order of two magnitude improvement over
existing techniques. - Parallelization adds additional scalability to
the algorithm - Near linear speedups achieved.
- Negligible Communication Overheads.
- Performance results on both synthetic and real
data sets.
37Summary and Conclusions for pMAFIA
- MAFIA unsupervised subspace clustering
algorithm - Introduced the adaptive grids formulation
- First parallel subspace clustering algorithm for
massive data sets - Incorporates both task and data parallelism. .
- pMAFIA Scalable and Parallel Implementation in
data size, no of dimensions
38PARSIMONY Overview
39Sparse Data Structures and Representations
40(No Transcript)
41Using OLAP Framework for
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)