MOSAIC: A Proximity Graph Approach for Agglomerative Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering

Description:

Talk Organization. Motivation. Background. Representative-based clustering ... Our current implementation assumes that only additive fitness functions are used ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 25
Provided by: lindaj156
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering


1
MOSAIC A Proximity Graph Approachfor
Agglomerative Clustering
  • Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen
    Chen, Ulvi Celepcikay, Christian Guisti, and
    Christoph F. Eick
  • Department of Computer Science, University of
    Houston
  • Organization
  • Motivation
  • Scope of the research
  • Region Discovery
  • Traditional Clustering
  • Clustering with Plug-In Fitness Functions
  • Shape-aware Clustering Algorithms
  • Ideas of MOSAIC
  • Background
  • The MOSAIC Algorithm
  • Experimental Evalution
  • Related Work
  • Conclusion and Future Work

2
1.1 Motivation Examples of Region Discovery
Application 1 Hot-spot Discovery
EVDW06 Application 2 Find Interesting Regions
with respect to a Continuous Variable Application
3 Find representative regions
(Sampling) Application 4 Regional Co-location
Mining Application 5 Regional Association Rule
Mining DEWY06 Application 6 Regional
Association Rule Scoping EDYKN07
b1.01
RD-Algorithm
b1.04
Wells in Texas Green safe well with respect to
arsenic Red unsafe well
3
Region Discovery Framework
  • The algorithms we currently investigate solve the
    following problem
  • Given
  • A dataset O with a schema R
  • A distance function d defined on instances of R
  • A fitness function q(X) that evaluates clustering
    Xc1,,ck as follows
  • q(X) ?c?X reward(c)?c?X interestingness(c)size(
    c)? with bgt1
  • Objective
  • Find c1,,ck ? O such that
  • ci?cj? if i?j
  • Xc1,,ck maximizes q(X)
  • All cluster ci?X are contiguous
  • c1?,,?ck ? O
  • c1,,ck are usually ranked based on the reward
    each cluster receives, and low reward clusters
    are frequently not reported

4
1.2 Clustering with Plug-In Fitness Functions
Clustering algorithms
No fitness function
Fixed Fitness Function
Implicit Fitness Function
Provides plug-in fitness function
DBSCAN Hierarchical Clustering
K-Means
PAM
CHAMELEON MOSAIC
5
1.3 Shape-aware Clustering
  • Shape is a significant characteristic in
    traditional clustering and region discovery
  • Examples

Fig. 1 some chain-like patterns in Volcano
dataset
Fig.2 arbitrary shape of regions of high (low)
arsenic concentration in Texas wells
6
1.4 Ideas Underlying MOSAIC
  • MOSAIC provides a generic framework that
    integrates representative-based clustering,
    agglomerative clustering, and proximity graphs,
    and which approximates arbitrary shape clusters
    using unions of small convex polygons

(a) input
(b) output
Fig. 6 An illustration of MOSAICs approach
7
Talk Organization
  • Motivation
  • Background
  • Representative-based clustering
  • Agglomerative clustering
  • Proximity Graphs
  • The MOSAIC Algorithm
  • Experimental Evaluation
  • Related Work
  • Conclusion and Future Work

8
2.1 Representative-based Clustering
2
Attribute1
1
3
Attribute2
4
Objective Find a set of objects OR such that the
clustering X obtained by using the objects in OR
as representatives minimizes q(X). Properties
Cluster shapes are convex polygons Popular
Algorithms K-means, K-medoids, SCEC
9
2.2 MOSAIC and Agglomerative Clustering
  • Advantages MOSAIC over traditional agglomerative
    clustering
  • Wider searchconsiders all neighboring clusters
  • Plug-in fitness function
  • Clusters are always contiguous
  • Expensive algorithm is only run for 20-1000
    iterations
  • Highly generic algorithm

10
2.3 Proximity Graphs
  • How to identify neighboring clusters for
    representative-based clustering algorithms?
  • Proximity graphs provide various definitions of
    neighbour

NNG Nearest Neighbour Graph MST Minimum
Spanning Tree RNG Relative Neighbourhood
Graph GG Gabriel Graph DT Delaunay
Triangulation (neighbours of a 1NN-classifier)
11
Proximity Graphs Delaunay
  • The Delaunay Triangulation is the dual of the
    Voronoi diagram
  • Three points are each others neighbours if their
    tangent sphere contains no other points
  • Complete captures all neighbouring clusters
  • Expensive to compute in high dimensions

12
Proximity Graphs Gabriel
  • The Gabriel graph is a subset of the Delaunay
    Triangulation (some decision boundary might be
    missed)
  • Points are neighbours only if their (diametral)
    sphere of influence is empty
  • Can be computed more efficiently O(k3)

13
3. MOSAIC
Fig. 10 Gabriel graph for clusters generated by
a representative-based clustering algorithm
14
Pseudo Code MOSAIC
1. Run a representative-based clustering
algorithm to create a large number of
clusters. 2. Read the representatives of the
obtained clusters. 3. Create a merge candidate
relation using proximity graphs. 4. WHILE there
are merge-candidates (Ci ,Cj) left BEGIN
Merge the pair of merge-candidates (Ci,Cj), that
enhances fitness function q the most,
into a new cluster C Update
merge-candidates ?C Merge-Candidate(C,C) ?
Merge-Candidate(Ci,C) Merge-Candidate(Cj,C
) END RETURN the best clustering X found.
15
Complexity MOSAIC
  • Let
  • n be the number of objects in the dataset
  • k be the number of clusters returned by the
    representative-based algorithm
  • Complexity MOSAIC O(k3 k2O(q(x)))
  • Remarks
  • The above formula assumes that fitness is
    computed from the scratch when a new clustering
    is obtained
  • Lower complexities can be obtained with
    incrementally reusing results of previous fitness
    computations
  • Our current implementation assumes that only
    additive fitness functions are used

16
4. Experimental Evaluation for Traditional
Clustering
  • Compared MOSAIC with DBSCAN and K-means
  • Used silhouette as q(X) when running MOSAIC
    Silhouette considers cohesion and separation
    (measured as the distance to the nearest
    cluster).
  • Used 9-Diamonds, Volcano, Diabetes, Ionosphere,
    and Vehicle datasets in the experimental
    evaluation

17
Experimental Results
  • Finding good parameter setting for DBSCAN turned
    out to be problematic for the 9-Diamonds and
    Volcano spatial datasets.
  • Neither DBSCAN nor MOSAIC were able to obtain to
    identify all chain-like patterns in the Volcano
    dataset.
  • We compared MOSAIC and K-means for the
    Ionosphere, Diabetes, and Vehicle
    high-dimensional datasets. Cluster quality was
    measured using Silhouette. MOSAIC outperformed
    K-means on these datasets.

18
Volcano Dataset Result MOSAIC
19
Volcano Dataset Result DBSCAN
20
Open Issues What is a Good Fitness Function for
Traditional Clustering?
  • The use plug-in fitness functions within
    traditional clustering algorithms is not very
    common.
  • Use existing cluster evaluation measures as
    fitness function, such as cohesion, separation,
    and silhouette, does not lead to very good
    clustering when confronted with arbitrary shape
    clusters Choo07.
  • Question Can we find better cluster evaluation
    measures or is finding good evaluation measures
    for traditional clustering a hopeless project?

21
5. Related Work
  • CURE integrates a partitioning algorithm with an
    agglomerative hierarchical algorithm GRS98.
  • CHAMELEON KHK99 provides a sophisticated
    two-phased clustering algorithm a multilevel
    graph partitioning algorithm and agglomerative
    clustering algorithm on knn sparse graph.

22
Related Work Continued
  • Lin and Zhong LC02 and ZG03 propose hybrid
    clustering algorithms that combine
    representative-based clustering and agglomerative
    clustering methods.
  • Surdeanu STA05 proposes a hybrid clustering
    approach that combines agglomerative clustering
    algorithm with the Expectation Maximization (EM)
    algorithm.

23
6. Conclusion
  • A new clustering algorithm was introduced that
    approximates arbitrary shape clusters through
    unions of convex polygons
  • The algorithm performs a wider search by
    considering all neighboring clusters as merge
    candidates. Gabriel graphs are used to determine
    neighboring clusters
  • The algorithm is generic in that it can be used
    with any initial merge candidate relation, any
    fitness function, and any representative-based
    algorithms
  • MOSAIC can also be seen as a generalization of
    agglomerative grid-based clustering algorithms.
  • We mainly use MOSAIC in the region discovery
    project mentioned earlier.

24
Future Work Learn fitness function based on
feedback
  • Idea employs machine learning techniques to
    learn a fitness function by using the feedback of
    a domain expert.
  • Pros
  • It provides more adaptive approach to give the
    changes to tailor the fitness function based on
    the domain experts requirements.
  • The process of finding an appropriate fitness
    function is automatic.
  • Cons
  • features selection is non-trivial
  • Learning the function is a difficult machine
    learning task
Write a Comment
User Comments (0)
About PowerShow.com