ABOUT ME - PowerPoint PPT Presentation

About This Presentation
Title:

ABOUT ME

Description:

... k clusters base on the minimum distance between two clusters. ... Distance between mean: Distance between two nearest point within cluster. Result of dmean : ... – PowerPoint PPT presentation

Number of Views:11
Avg rating:3.0/5.0
Slides: 27
Provided by: Bha143
Category:

less

Transcript and Presenter's Notes

Title: ABOUT ME


1
ABOUT ME
  • BHARATH RENGARAJAN
  • PURSUING MY MASTERS IN COMPUTER SCIENCE
  • FALL 2008

2
CONTENTS
  • Problems in the traditional clustering method
  • CURE clustering
  • Summary
  • Drawbacks

3
PROBLEMS IN THE TRADITIONAL CLUSTERING METHOD
4
PARTITIONAL CLUSTERING ALGORITHM
  • Attempts to find k-partitions that try to
    minimize a certain criterion function
  • The square-error criterion is the most common
    criterion function used.
  • Works well for compact, well separated clusters.

5
PARTITIONAL CLUSTERING ALGORITHM
  • You may find error in case the square-error is
    reduced by splitting some large cluster to favor
    some other group.

6
HIERARCHICAL CLUSTERING ALGORITHMS
  • This category of clustering method try to merge
    sequences of disjoint clusters into the target k
    clusters base on the minimum distance between two
    clusters.
  • The distance between clusters can be measured as
  • Distance between mean
  • Distance between two nearest point within cluster

7
HIERARCHICAL CLUSTERING ALGORITHMS
  • Result of dmean

8
HIERARCHICAL CLUSTERING ALGORITHMS
  • Result of dmin

9
PROBLEMS
  • Traditional clustering mainly favors spherical
    shape.
  • Data in the cluster must be compact together.
  • Each cluster must separate far away enough.
  • Outliner will greatly disturb the cluster result.

10
CURE CLUSTERING
11
CURE CLUSTERING
  • It is similar to hierarchical clustering
    approach. But it use sample point variant as the
    cluster representative rather than every point in
    the cluster.
  • First set a target sample number c . Than we try
    to select c well scattered sample points from the
    cluster.
  • The chosen scattered points are shrunk toward the
    centroid in a fraction of ? where 0 lt?lt1
  • These points are used as representative of
    clusters and will be used as the point in dmin
    cluster merging approach.

12
CURE CLUSTERING
  • After each merging, c sample points will be
    selected from original representative of previous
    clusters to represent new cluster.
  • Cluster merging will be stopped until target k
    cluster is found

13
PSEUDO FUNCTION OF CURE
14
EFFICIENCY OF CURE
  • The worst-case time complexity is O(n2logn)
  • The space complexity is O(n) due to the use of
    k-d tree and heap.

15
RANDOM SAMPLING
  • In case of dealing with large database, we cant
    store every data point to the memory.
  • Handle of data merge in large database require
    very long time.
  • We use random sampling to both reduce the time
    complexity and memory usage.
  • By using random sampling, there exists a trade
    off between accuracy and efficiency.

16
OUTLIER ELIMINATION
  • We can introduce outliners elimination by two
    method.
  • Random sampling With random sampling, most of
    outlier points are filtered out.
  • Outlier elimination As outliner is not a compact
    group, it will grow in size very slowly during
    the cluster merge stage. We will then kick in the
    elimination procedure during the merging stage
    such that those cluster with 1 2 data points
    are removed from the cluster list.

17
DATA LABELING
  • Due to the use of random sample. We need to label
    back every remaining data points to the proper
    cluster group.
  • Each data point is assigned to the cluster group
    with a representative point nearest to the data
    point.

18
OVERVIEW OF CURE
Data
19
SENSITIVITY TO SHRINK PARAMETER (?)
20
SENSITIVITY TO NO. OF REPRESENTATIVE POINTS (c)
21
SENSITIVITY TO THE NO. OF PARTITIONS
22
SUMMARY
23
SUMMARY
  • CURE can effectively detect proper shape of the
    cluster with the help of scattered representative
    point and centroid shrinking.
  • CURE can reduce computation time with random
    sampling.
  • CURE can effectively remove outlier.
  • The quality and effectiveness of CURE can be
    tuned be varying different s,p,c,? to adapt
    different input data set.

24
DRAWBACKS
25
DRAWBACKS
  • Clusters shown are somewhat standard shapes.
  • Too many parameters are involved.

26
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com