Northwestern University - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Northwestern University

Description:

Discovery of interesting patterns in large multi-dimensional ... first (k-2) algo. any (k-2) algo. Huge CDU Set. Non Cluster Dims reported. Huge Search Space ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 41
Provided by: AlokCho2
Category:

less

Transcript and Presenter's Notes

Title: Northwestern University


1
PARSYMONY Scalable Parallel Data Mining
  • Alok N. Choudhary
  • Northwestern University
  • (ACK Harsha Nagesh (Bell Labs) and Sanjay Goil
    (Sun)

2
Outline
  • Overview of PARSIMONY
  • MAFIA (Subspace clustering)
  • pMAFIA (Parallel Subspace Clustering)
  • Performance Results
  • PARSIMONY
  • multidimensional data analysis
  • parallel classification
  • Summary

3
Overview of Knowledge Discovery Process
4
PARSIMONY Overview
5
MAFIA Subspace Clustering for High-Dimensional
Data Sets
  • Clustering and subspace clustering
  • Base MAFIA Algorithm
  • pMAFIA (parallelization)
  • Performance Results

6
Clustering
  • Discovery of interesting patterns in large
    multi-dimensional data sets.
  • What is the average credit for a particular
    income group (Financial Services)
  • Areas with maximum collect calls
    (Telecommunications)
  • Categorize stocks based on their movement
    (Investment Banking)
  • Target Mailing (Marketing)
  • Analysis of satellite data, detection of clusters
    in geographic information systems, categorize web
    documents, etc.

7
Clustering Multi-dimensional Dimensional Data Sets
Determine range of attributes in each dimension
of the cluster(s)
8
Issues to be addressed
  • Basic algorithm computational optimizations
  • Scalability with database size (out of core data
    sets)
  • Scalability with the dimensionality of data
  • Efficient Parallelization
  • Recognition of arbitrary shaped clusters

9
Related Work
  • Partition based Clustering
  • User specified k representative points taken as
    cluster centers and points assigned to cluster
    centers
  • k-means, k-mediods, CLARANS (VLDB 94), BIRCH
    (SIGMOD 96),..
  • Consider clustering partitioning of points
  • Hierarchical Clustering
  • Each point is a cluster. Merge similar points
    together gradually.
  • CURE (SIGMOD 98) (use sampling)
  • Categorical Clustering
  • Clustering of categorical data e.g automobile
    sales data color, year, model, price, etc
  • Best suited for non-numerical data
  • CACTUS (KDD 99), STIRR (VLDB 98)

10
Related Work
  • Density and Grid Based Clustering
  • Clusters are high density regions than its
    surroundings
  • WaveCluster (VLDB 98), DBSCAN, CLIQUE (SIGMOD 98)
  • Number of subspaces is exponential in the data
    dimensionality
  • Multidimensional space divided into grids. The
    histogram in each hyper-rectangle is found. Grid
    regions with a significant histogram value are
    cluster regions.
  • Post-processing done to grow the connected
    cluster regions.
  • Fine grid size results in explosion in the number
    hyper-rectangles, coarser grids fail to detect
    clusters.
  • Correct Grid Size is very critical !

11
Subspace Clustering
  • CLIQUE (SIGMOD 98)- User specified grid size and
    threshold for each dimension
  • Finer grids enormous computation and coarser
    grids loss of quality
  • Noise is another consideration in Finer grids
  • A bottom-up algorithm by combining dense regions
    in different subspaces.
  • A hyper-rectangle in a multidimensional space is
    dense if it contains more points than a user
    specified threshold percentage of the total
    number of points.
  • PROCLUS (SIGMOD 99) - Modification of k-means
    algorithm.
  • User input of number of clusters and average
    cluster dimensionality unrealistic for
    real-world data sets
  • Uses cluster centers and points near to it to
    compute statistics. These determine the relevant
    cluster dimensions of the clusters !
  • ENCLUS (KDD 99) - Identifies the dimensions of a
    cluster followed by the application of any
    clustering algorithm.
  • Entropy Based Clustering Uses entropy as a
    measure of correlation between dimensionsrequires
    entropy thresholds to be set

12
Subspace Clustering
  • Observation If a collection of points S is a
    cluster in a k-dimensional space, then S is also
    a part of a cluster in any (k-1) dimensional
    projection of the space
  • Algorithm Growing clusterscandidate dense
    units in any k dimensions are obtained by merging
    dense units in (k-1) dimensions which share any
    (k-2) dimensions.
  • Ex ( a1,b7,d9, b7,c8,d9 ) --gt a1,b7,c8,d9
  • Candidate dense units are populated by a pass on
    the data set and the dense units are found out in
    each dimension.
  • Dense units found are combined to form candidate
    dense units.
  • Algorithm terminates when no more candidate dense
    units found.

13
Adaptive Grids (reducing computation in practice)
  • Automatic Grid fitting based on data distribution
  • MAFIA Merging of Adaptive Finite Intervals !
  • Optimal Bins in each dimension leads to very few
    units in the grid (candidate dense units)
  • (a) CLIQUE
  • (b) MAFIA

14
Base MAFIA Algorithm
  • Divide each dimension into very fine regions.
  • Compute histogram in these regions along every
    dimension.
  • Set the value of a sliding window to the maximum
    histogram value in the window.
  • Adjacent units which have nearly same histogram
    values are merged together to form larger bins.
  • Threshold of each bin formed is computed
    automatically.
  • A bin having a histogram value much greater (by a
    factor 23) than that of equi-distribution of
    data is DENSE.

15
(No Transcript)
16
MAFIA Algorithm (merging dense units)
  • Algorithm Candidate dense units in any k
    dimensions are obtained by merging dense units in
    (k-1) dimensions which share any (k-2)
    dimensions.
  • Ex ( a1,c7,b8, c7,b8,d9 ) --gt a1,c7,b8,d9

17
  • CLIQUE CDUs in any dimension k is formed by
    combining dense units of dimension (k-1) which
    share first (k-2) dimensions.
  • MAFIA CDUs in any dimension k is formed by
    combining dense units of dimension (k-1) which
    share any (k-2) dimensions.

Huge CDU Set
Non Cluster Dims reported
CLIQUE
Fixed Grids
Fixed Grids
first (k-2) algo
Huge Search Space
Data Set
much reduced CDU Set
Adaptive Grids
Correct Cluster Dims reported
any (k-2) algo
MAFIA
Reduced Search Space
Dimensions aware of data distribution
18
Parallel MAFIA
  • pMAFIA Parallel Subspace Clustering
  • Scalable in data size and number of dimensions
    Grid and Density based clustering algorithm
  • Parallelization provides speedup for the subspace
    clustering algorithm.

19
pMAFIA Algorithm
  • Each processor reads part of the data in a Data
    Parallel fashion and constructs histogram in
    every dimension.
  • // Data read in chunks (out of core) of data
  • Reduce communication to obtain global histogram.
    All processors build Adaptive grids using the
    histogram.
  • // Each bin formed is a Candidate Dense Unit.
  • Current Dimension k 1
  • while (no more dense units found)
  • if( k gt 1)
  • Build Candidate Dense units()
  • Populate the candidate dense units in a data
    parallel fashion and in chunks (out-of-core) of
    B records.
  • Reduce communication to obtain global CDU
    population.
  • Identify the dense units ()
  • Build dense unit data structures() // for the
    next higher dimension

20
Build-Candidate-Dense-Units
  • Current Dimension(k) 3.

21
Build Candidate Dense Units
  • Current Dimension(k) 3.
  • CLIQUE CDUs in any dimension k is formed by
    combining dense units of dimension (k-1) such
    that they share first (k-2) dimensions.
  • pMAFIA CDUs in any dimension k is formed by
    combining dense units of dimension (k-1) such
    that they share any (k-2) dimensions.
  • For data dimension (d) 10, current dimension
    (k) 5, CLIQUE does not explore 93.3 of
    possible combinations. In general,
  • This problem more so in data sets with clusters
    having a very high subspace coverage

22
Build Candidate Dense Units
  • Each dense unit is compared with every other
    dense unit to form CDUs, resulting in an O(Ndu2)
    algorithm.(Ndu-number of dense units)
  • For large values of Ndu CDUs built in parallel.
  • Processors 0,..,(p-1) work in parallel on parts
    of total Ndu dense units. Processor k compares
    dense units between Ni and Ni1 with all the
    other dense units for optimal task partitioning
    we have
  • Identical CDUs generated during the process need
    to be discarded.
  • Each generated CDU compared with every other CDU
    to identify similar ones resulting in O(Ncdu2)
    algorithm.

23
Task Parallelism
  • Identify Dense Units
  • CDUs generated are populated in a data parallel
    fashion.
  • If the histogram count of a CDU is greater than
    the threshold of all the bins which form the CDU
    in their respective dimensions, the CDU is a
    Dense Unit.
  • If Ncdu is large, each processor processes Ncdu/p
    candidate dense units
  • Build Dense Unit Data Structures
  • If Ndu, number of Dense units, is large dense
    unit data structures are constructed in parallel.
  • Each dense unit is completely represented by a
    set of dimensions and the corresponding bin
    indices in those dimensions.

24
Parallelization (Building CDUs)
Total work done by Pi is shown
25
pMAFIA ANALYSIS
  • Data parallelism in populating the CDUs in every
    dimension effective for massive data sets.
  • Gains of task parallelism realized when data
    contains large number of dense units (clusters).
  • A k dimensional dense unit allocated just 2k
    bytes of memory, k bytes for dimensions and k for
    bin indices.
  • Data structures in form of linear arrays of
    bytes communicate very small message buffers,
    space optimization.
  • Although bottom-up algorithm is exponential in
    the data dimension, for low subspace coverage,
    with use of Adaptive grids and parallel
    formulation very promising results.
  • k dimension of the highest dimension dense unit,
    we explore all possible subspaces of these k
    dimensions gt O(ck)
  • For k passes over the data set O( (N/pB)
    kTio),
  • N-total number of records, p-processors,
  • B- records per chunk, Tio-I/O access time for a
    block of B records.
  • Communication overhead results in O(Tcomm S p
    k),
  • Tcomm - constant for communication, S- size of
    message exchanged,.
  • O(ck (N/pB) kTi Tcomm S p
    k )

26
pMAFIA Outline
  • Clustering and subspace clustering
  • Base MAFIA Algorithm
  • pMAFIA (parallelization)
  • Performance Results
  • Adaptivity performance
  • Scalability with data set size.
  • Scalability with data dimensionality.
  • Scalability with cluster dimensionality.
  • Some Real world data sets.

27
Quality of Results
  • (a) CLIQUE
  • Loss of quality reports pqrs as the cluster !
  • Requires a complicated post processing step.
  • Bin selection and threshold fixing is a non
    trivial problem. Cannot validate results.
  • (b) MAFIA Almost exact cluster boundaries
    recognized
  • No post processing step required.

28
CLIQUE-MAFIA
  • 400,000 records in 10 dimensions 2 clusters in 2
    4d subspaces.

29
Advantage of Adaptive Grids
  • 300,000 records in a 15 Dimension space with 1
    cluster of 5 dimensions.
  • A speedup of 80 obtained over CLIQUE.
  • CLIQUE failed to produce results with our
    modified CDU generation algorithm even in 2 hours
    on 16 processors.
  • This relatively small data set mined in 32
    seconds on 1 processor

30
Scalability with Data Set Size
  • 20 Dimension data with 5 clusters in 5 different
    subspaces
  • data sets up to 11.8 million records
  • Clusters detected in just about 3 minutes on 16
    processors !
  • Almost Linear with the increase in the data set
    size (because most time in scanning)

31
Parallelization (on IBM SP2)
  • 30 Dimension data with 8.3 million records, 5
    clusters each in a 6 dimension subspace.
  • Near linear speedups
  • Negligible Communication overheads (lt1)

32
Data Dimensionality
  • 250,000 records, 3 clusters in different 5
    dimensional subspaces.
  • Near linear behavior with data dimensionality
    Algorithm depends on the maximum number of
    dimensions in a clusters and not on the data
    dimensionality.

33
Cluster Dimensionality
  • 50 dimension data with 1 cluster, 650,000
    records cluster dimension from 3 to 10.
  • Behavior in line with the order of the algorithm
    increases with subspace coverage of the cluster.

34
Scalability on Movie Data
  • 72,916 users rated 1628 movies in 18 months 2.8
    Million ratings
  • 4D data user-id, movie-id, weight, score
  • Discovered seven interesting 2d clusters !

35
Other Data Sets
  • One day Ahead Prediction of DAX (German Stock
    Exchange)
  • DAX prediction data set was based on a 12 input
    time series like stock indices, bond indices,
    gold prices, etc
  • 22 dimensions with 2757 records Major gains from
    task parallelism
  • Mined clusters in 8.16 seconds on 8 processors.
  • Unique clusters discovered in 3,4,5 and 6
    dimensional subspaces.
  • Ionosphere data (UCI repository)
  • Radar data collected in Goose Bay, Labrador
  • 34 dimension data, 351 records.
  • Discovered unique clusters only in 3 and 4
    dimensional sub spaces.

36
Performance Results
  • Implementation on the distributed memory machine
    IBM SP2 and on a network of workstations.
  • Massive data sets with more than 10 Million
    records in very high dimension spaces (gt30).
  • An order of two magnitude improvement over
    existing techniques.
  • Parallelization adds additional scalability to
    the algorithm
  • Near linear speedups achieved.
  • Negligible Communication Overheads.
  • Performance results on both synthetic and real
    data sets.

37
Summary and Conclusions for pMAFIA
  • MAFIA unsupervised subspace clustering
    algorithm
  • Introduced the adaptive grids formulation
  • First parallel subspace clustering algorithm for
    massive data sets
  • Incorporates both task and data parallelism. .
  • pMAFIA Scalable and Parallel Implementation in
    data size, no of dimensions

38
PARSIMONY Overview
39
Sparse Data Structures and Representations
40
(No Transcript)
41
Using OLAP Framework for
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com