Clustering Analysis - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Clustering Analysis

Description:

Presented by: Ching-Pin Kao. 2. Problem Description ... 1. For i = 1 to 5, repeat the following steps: ... ELSE // i.e. o is a core object ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 36
Provided by: sunCis
Category:

less

Transcript and Presenter's Notes

Title: Clustering Analysis


1
  • Clustering Analysis
  • Presented by Ching-Pin Kao

2
Problem Description
  • Cluster analysis is the classification of items
    into a number of groups, or clusters.
  • Clusters collections of objects whose
    intra-class similarity is high and inter-class
    similarity is low.
  • Tasks of clustering analysis
  • Similarity measurements
  • Clustering methods
  • Validation techniques

3
Similarity Measurements
  • Types of similarity measurements
  • Distance measurements
  • Correlation coefficients
  • Association coefficients
  • Probabilistic similarity coefficients

4
Similarity Measurements Correlation Coefficients
  • The most popular correlation coefficient is
    Pearson correlation coefficient. (1892)
  • Correlation between XX1, X2, , Xn and YY1,
    Y2, , Yn
  • where

5
Similarity Measurements Correlation
Coefficients (Cont.)
  • It captures similarity of the shapes of two
    expression profiles, and ignores differences
    between their magnitudes.

r1.0
6
Concept Correlation Coefficients
7
Validation Techniques
  • Types of validation techniques
  • External indices
  • based on some gold standards
  • Matching coefficient, Jaccard coefficient
  • Internal indices
  • based on some statistics of the results
  • Huberts G Statistics

8
Validation Techniques Huberts G Statistics
  • XX(i, j) and YY(i, j) are two n n matrix
  • X(i, j) similarity of object i and object j
  • Huberts G statistic represents the point serial
    correlation
  • where M n (n - 1) / 2 is the number of entries
    in the double sum
  • A higher value of G represents the better
    clustering quality.

9
Concept Huberts G Statistics
10
Validation Techniques External Indices
  • Given two binary matrices A and B of the same
    dimensions
  • Matching coefficient (ad) / (abcd)
  • Jaccard coefficient a / (abc)

B
A
11
Types of Clustering Methods
  • Partitioning
  • K-Means, K-Medoids, PAM, CLARA, CLARANS, CAST,
  • Hierarchical
  • HAC, BIRCH, CURE, ROCK, CHAMELEON,
  • Density-based
  • DBSCAN, OPTICS, CLIQUE, WaveCluster,
  • Grid-based
  • STING, CLIQUE, WaveCluster,
  • Model-based
  • COBWEB, SOM, CLASSIT, AutoClass,
  • Outlier analysis
  • OLAP,

12
Hierarchical
button-up
top-down
Distinction between agglomerative and divisive
techniques
13
(No Transcript)
14
HAC(Hierarchical agglomerative clustering)
Proximity matrix
Single Link Complete Link
15
Partitioning
A partition with n20 and k 3
16
k-Means Clustering
  • 1. Select an initial partition with K
    clusters.Repeat steps 2-5 until the cluster
    membership stabilizes.
  • 2. Generate a new partition by assigning pattern
    to its closest cluster centre.
  • 3. Compute new cluster centres as the centroids
    of the clusters.
  • 4. Repeat step 2 and 3 until an optimum value of
    the criterion function is found.
  • 5. Adjust the number of clusters by merging and
    splitting existing clusters or by removing small
    clusters, or outliers.

17
k-Means Clustering (Cont.)
18
k-Medoid Methods
  • Medoidoptimal representative object for each
    cluster(the most centrally loacted object within
    the cluster)
  • k-medoid methodsThe method of partitioning
    around medoid.

19
PAM(Partition Around Medoids)
  • Based on the k-medoid model
  • Oia selected object
  • Oha nonselected object
  • SwapOi is replaced by Oh as a medoid
  • Consider another nonselected object Oj and
    calculate its costs(contribution) Cjih to the
    swap

20
PAM (Cont.)
  • a. Oj belongs to a cluster other than the one
    represented by Oi. Let Ol be the representative
    object of that cluster.
  • a1. d(Oj, Oh)?d(Oj, Ol)
  • After swapping, Oj would belong to the
    cluster represented by Ol.
  • Cjih0
  • a2. d(Oj, Oh)ltd(Oj, Ol)
  • After swapping, Oj would belong to the
    cluster represented by Oh.
  • Cjihd(Oj, Oh)-d(Oj, Ol)lt0

21
PAM (Cont.)
  • b. Oj belongs to the cluster represented by Oi.
    Let Oj,2 is the second most similar medoid to Oj
  • b1. d(Oj, Oh)?d(Oj, Oj,2)
  • After swapping, Oj would belong to the
    cluster represented by Oj,2.
  • Cjihd(Oj, Oj,2)-d(Oj, Oi)?0
  • b2. d(Oj, Oh)ltd(Oj, Oj,2)
  • After swapping, Oj would belong to the
    cluster represented by Oh.
  • Cjihd(Oj, Oh)-d(Oj, Oi)

22
PAM (Cont.)
  • a. b.
  • The total cost of replacing Oi with Oh

Oh
Oh
Ol
Oj,2
Oj
Oj
Oi
Oi
23
Algorithm PAM
24
CLARA(Clustering LARge Applications)
  • CLARA draws a sample of the data set, applies PAM
    on the sample, and finds the medoids of the
    sample.
  • CLARA draws multiple samples and gives the best
    clustering as the output.
  • Experiments indicate that 5 samples of size 402k
    give satisfactory results.

25
Algorithm CLARA
  • 1. For i 1 to 5, repeat the following steps
  • 2. Draw a sample of 40 2k objects randomly from
    the entire data set 1, and call Algorithm PAM to
    find k medoids of the sample.
  • 1 Apart from the first sample, subsequent samples
    include the best set of medoids found so far.

26
Algorithm CLARA (Cont.)
  • 3. For each object Oj in the entire data set,
    determine which of the k medoids is the most
    similar to Oj.
  • 4. Calculate the average distance of the
    clustering obtained in the previous step. If this
    value is less than the current minimum, use this
    value as the current minimum, and retain the k
    medoids found in Step (2) as the best set of
    medoids obtained so far.
  • 5. Return to Step (1) to start the next iteration.

27
Density-based DBSCAN
  • Epsthe neighborhood of a given radius
  • MinPtsthe cardinality of the neighborhood has to
    exceed some threshold
  • directly density-reachable
  • p ? NEps(q)
  • Card(NEps(q)) ? MinPts
  • density-reachable
  • p gtD q

p directly density-reachable from q
q
p
28
DBSCAN (Cont.)
  • density-connected
  • p gtD o
  • q gtD o
  • cluster
  • Maximality? p,q ? D, if p ? C and q gtD p ? q ? C
  • Connectivity? p,q ? C, p is density-connected to
    q in C
  • noise p ? D? i p ? Ci

29
DBSCAN (Cont.)
  • two different kinds of objects in a clustering
  • core object
  • non-core object
  • border object
  • noise object

30
DBSCAN Algorithm
  • Algorithm DBSCAN(D, Eps, Minpts)
  • // Precondition All objects in D are
    unclassified.
  • FORALL object o in D DO
  • IF o is unclassified
  • call function expand_cluster to
    construct a cluster wrt.
  • Eps and MinPts containing o.

31
DBSCAN Algorithm (Cont.)
  • FUNCTION expand_cluster(o, D, Eps, MinPts)
  • retrieve the Eps-neighborhood NEps(o) of o
  • IF NEps(o) lt MinPts // i.e. o is not a
    core object
  • mark o as noise and RETURN
  • ELSE // i.e. o is a core object
  • select a new cluster-id and mark all object
    in NEps(o) with this current cluster-id
  • push all object from NEps(o)\o onto the
    stack seeds
  • WHILE NOT seeds.empty() DO
  • currentObject seeds.top()
  • retrieve the Eps-neighborhood
    NEps(currentObject) of object currentObject
  • IF NEps(currentObject) ? MinPts
  • select all objects in
    NEps(currentObject) not yet classified or are
    marked as noise,
  • push the unclassified objects onto
    seeds and mark all of these objects with current
  • cluster-id
  • seeds.pop()
  • RETURN

32
CAST
  • The clusters are constructed one at a time.
  • The currently constructed cluster is denoted by
    Copen.
  • Affinity of element x
  • high affinity a(x) ? t Copen
  • low affinity a(x) lt t Copen
  • CAST alternates between adding high-affinity
    elements to Copen, and removing low-affinity
    elements from it.

33
(No Transcript)
34
CAST-Example
  • 1. C Ø, U 1, 2, 3, , 10
  • 2. Copen Ø, a (?) 0
  • 3. ADD
  • 3.1 Arbitrarily choose a element 1?U
  • Copen 1, U 2, 3, , 10
  • a(1) S(1, 1), , a(10) S(10, 1)
  • 3.2 If element 3?U has max high affinity
  • Copen 1, 3, U 2, 4, , 10
  • a(1) S(1, 3), , a(10) S(10, 3)
  • 3.3 Repeat ADD until all elements?U has low
    affinity

35
CAST-Example (Cont.)
  • 4. If Copen 1, 2, 3, 7, 10, U 4, 5, 6, 8,
    9
  • after ADD procedure
  • 5. REMOVE
  • 5.1 If element 2?Copen has min low affinity
  • Copen 1, 3, 7, 10, U 3, 2, 5, 6, 8,
    9
  • a(1) - S(1, 2), , a(10) - S(10, 2)
  • 5.2 Repeat REMOVE until all elements? Copen
  • has high affinity
  • 6. C C?Copen
  • 7. Repeat step 2 until U Ø
Write a Comment
User Comments (0)
About PowerShow.com