Clustering - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Clustering

Description:

Partition a set of objects into groups or clusters. Main use of Clustering in Statistical NLP. EDA(exploratory data analysis) pictorial visualization ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 22

Provided by: klplReP

Category:

more less

Transcript and Presenter's Notes

Title: Clustering

1
Clustering

???????
???
2002. 6. 15

2
Content

What is Clustering
Clustering Method
Hierarchical Clustering
Single-link, complete-link, group-average ...
Non-Hierarchical Clustering
K-means
EM-algorithm

3
What is clustering(1)

Definition
Partition a set of objects into groups or
clusters
Main use of Clustering in Statistical NLP
EDA(exploratory data analysis)
pictorial visualization
Generalization(forming bins or equivalence
classes)
Monday, Tuesday, ,Sunday(case of preposition
on)
There is no entry for Friday(Learning)
Sort of clustering
Hierarchical vs flat (non-hierarchical)
hard (11) vs soft (1n degree of membership)

4
What is clustering(2)
5
Hierarchical Clustering(1)

The tree of hierarchical clustering can be
produced
Bottom-up(agglomerative clustering)
start with the individual object and grouping the
most similar ones
join cluster with maximum similarity
Top-down(divisive clustering)
start with all the object and divides them into
groups in order to maximize within-group
similarity
split least coherent part in cluster

6
Hierarchical Clustering(2)
7
Three methods in hierarchical clustering

Single-link
Similarity of two most similar members
Complete link
Similarity of two least similar members
Group average
Average similarity between members

8
Single link Clustering

Similarity of two most similar members gt O(n2)
Locally Coherent
close objects are in the same cluster
Chaining Effect
Because of following a chain of large
similarities without taking into account the
global context gt low global cluster quality

9
Complete link Clustering

Similarity of two least similar members gt O(n3)
The function focused on global cluster quality
avoids elongated cluster
a/f or b/e is tighter than a/d (tighter cluster
are better than straggly cluster)

10
Group average agglomerative clustering

Averages similarity between members
The complexity of computing average similarity is
O(n2)
Average similarities are computed at each time a
new group is formed
compromise between single-link and complete-link

11
Comparison

Single-link
Relative efficient
Long straggly clusters
Ellipsoidal cluster
Loosely bound cluster
Complete-link
Tightly bound cluster
Group average
Intermediate between single and complete

12
Language Model

Improving language model
By way of generalization
For rare events (do not have a enough training
data)
More accurate prediction for rare event
Machine Translation
S Source language
T Target language
I went to school on last Sunday.
I went to school on last Saturday.

13
Top-down

Splitting the cluster into objects
By measure of cohesion
Determine all inter-object cohesion in the
cluster (initially only one cluster exists)
Split cluster into two clusters using cohesion
Recalculate cohesion for all clusters
Return to Step2 until all cluster have one object

14
Non-Hierarchical Clustering

Start out randomly selected seed(one seed per
cluster)
Iterative reallocation
Most non-hierarchical algorithms employ multiple
passes of reallocating object(in order to improve
global mutual information)
Hierarchical algorithms gt need only one pass
Stopping criteria
Maximum Likelihood(goodness or cluster quality)
When the curve of improvement flattens
When goodness starts decreasing
number of clusters, cluster size and so on gt
Heuristic
Finding optimal solution is difficult

15
K-means (hard clustering algorithm)

Defines clusters by the center of mass of their
members
Initial center of cluster are randomly selected
Assign objects to cluster using distances between
center and object
Re-compute the center of each cluster
Return step2 until stopping criteria is satisfied

16
K-means

Hard clustering
Issues in K-means
How to break ties when there are several center
Assign objects randomly to one of the candidate
cluster ( this cause algorithm not to converge)
Perturb objects slightly so that their new
position do not give rise to tie

17
EM algorithm(1)

A soft version of K-means clustering
both cluster move towards the centroid of all
three objects
reach the stable final state

18
EM algorithm(2)

We want to calculate probability P(cj vector xi)
Assume that clusteri has a normal distribution
Maximum likelihood of the form

19
Procedure of EM

Expectation Step (E)
Compute hij that is expectation of zij
Maximization Step (M)

20
Advantage and disadvantage of EM

advantage
Simple, easy to implement
Its memory requirement are reasonable
disadvantage
Slow linear convergence
Not guarantee global maxima

21
Properties of hierarchical and flat clustering

Preferable for detailed data analysis
Provides more information than flat
No single best algorithm (dependent on
application)
Less efficient than flat ( N X N similarity
matrix required)

Preferable if efficiency is consideration or data
sets are very large
K-means is the conceptually simplest method
K-means assumes a simple Euclidean representation
space and so cant be used for many data sets
In such case, EM algorithm is chosen

Write a Comment

User Comments (0)