Clustering - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Clustering

Description:

Partition a set of objects into groups or clusters. Main use of Clustering in Statistical NLP. EDA(exploratory data analysis) pictorial visualization ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 22
Provided by: klplReP
Category:
Tags: clustering | eda

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
  • ???????
  • ???
  • 2002. 6. 15

2
Content
  • What is Clustering
  • Clustering Method
  • Hierarchical Clustering
  • Single-link, complete-link, group-average ...
  • Non-Hierarchical Clustering
  • K-means
  • EM-algorithm

3
What is clustering(1)
  • Definition
  • Partition a set of objects into groups or
    clusters
  • Main use of Clustering in Statistical NLP
  • EDA(exploratory data analysis)
  • pictorial visualization
  • Generalization(forming bins or equivalence
    classes)
  • Monday, Tuesday, ,Sunday(case of preposition
    on)
  • There is no entry for Friday(Learning)
  • Sort of clustering
  • Hierarchical vs flat (non-hierarchical)
  • hard (11) vs soft (1n degree of membership)

4
What is clustering(2)
5
Hierarchical Clustering(1)
  • The tree of hierarchical clustering can be
    produced
  • Bottom-up(agglomerative clustering)
  • start with the individual object and grouping the
    most similar ones
  • join cluster with maximum similarity
  • Top-down(divisive clustering)
  • start with all the object and divides them into
    groups in order to maximize within-group
    similarity
  • split least coherent part in cluster

6
Hierarchical Clustering(2)
7
Three methods in hierarchical clustering
  • Single-link
  • Similarity of two most similar members
  • Complete link
  • Similarity of two least similar members
  • Group average
  • Average similarity between members

8
Single link Clustering
  • Similarity of two most similar members gt O(n2)
  • Locally Coherent
  • close objects are in the same cluster
  • Chaining Effect
  • Because of following a chain of large
    similarities without taking into account the
    global context gt low global cluster quality

9
Complete link Clustering
  • Similarity of two least similar members gt O(n3)
  • The function focused on global cluster quality
  • avoids elongated cluster
  • a/f or b/e is tighter than a/d (tighter cluster
    are better than straggly cluster)

10
Group average agglomerative clustering
  • Averages similarity between members
  • The complexity of computing average similarity is
    O(n2)
  • Average similarities are computed at each time a
    new group is formed
  • compromise between single-link and complete-link

11
Comparison
  • Single-link
  • Relative efficient
  • Long straggly clusters
  • Ellipsoidal cluster
  • Loosely bound cluster
  • Complete-link
  • Tightly bound cluster
  • Group average
  • Intermediate between single and complete

12
Language Model
  • Improving language model
  • By way of generalization
  • For rare events (do not have a enough training
    data)
  • More accurate prediction for rare event
  • Machine Translation
  • S Source language
  • T Target language
  • I went to school on last Sunday.
  • I went to school on last Saturday.

13
Top-down
  • Splitting the cluster into objects
  • By measure of cohesion
  • Determine all inter-object cohesion in the
    cluster (initially only one cluster exists)
  • Split cluster into two clusters using cohesion
  • Recalculate cohesion for all clusters
  • Return to Step2 until all cluster have one object

14
Non-Hierarchical Clustering
  • Start out randomly selected seed(one seed per
    cluster)
  • Iterative reallocation
  • Most non-hierarchical algorithms employ multiple
    passes of reallocating object(in order to improve
    global mutual information)
  • Hierarchical algorithms gt need only one pass
  • Stopping criteria
  • Maximum Likelihood(goodness or cluster quality)
  • When the curve of improvement flattens
  • When goodness starts decreasing
  • number of clusters, cluster size and so on gt
    Heuristic
  • Finding optimal solution is difficult

15
K-means (hard clustering algorithm)
  • Defines clusters by the center of mass of their
    members
  • Initial center of cluster are randomly selected
  • Assign objects to cluster using distances between
    center and object
  • Re-compute the center of each cluster
  • Return step2 until stopping criteria is satisfied

16
K-means
  • Hard clustering
  • Issues in K-means
  • How to break ties when there are several center
  • Assign objects randomly to one of the candidate
    cluster ( this cause algorithm not to converge)
  • Perturb objects slightly so that their new
    position do not give rise to tie

17
EM algorithm(1)
  • A soft version of K-means clustering
  • both cluster move towards the centroid of all
    three objects
  • reach the stable final state

18
EM algorithm(2)
  • We want to calculate probability P(cj vector xi)
  • Assume that clusteri has a normal distribution
  • Maximum likelihood of the form

19
Procedure of EM
  • Expectation Step (E)
  • Compute hij that is expectation of zij
  • Maximization Step (M)

20
Advantage and disadvantage of EM
  • advantage
  • Simple, easy to implement
  • Its memory requirement are reasonable
  • disadvantage
  • Slow linear convergence
  • Not guarantee global maxima

21
Properties of hierarchical and flat clustering
  • Preferable for detailed data analysis
  • Provides more information than flat
  • No single best algorithm (dependent on
    application)
  • Less efficient than flat ( N X N similarity
    matrix required)
  • Preferable if efficiency is consideration or data
    sets are very large
  • K-means is the conceptually simplest method
  • K-means assumes a simple Euclidean representation
    space and so cant be used for many data sets
  • In such case, EM algorithm is chosen
Write a Comment
User Comments (0)
About PowerShow.com