Document Clustering - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Document Clustering

Description:

... usually dependent on the quality of the initial assignment ... Assign points to initial clusters using HAC. Until done. Select a candidate point x, in cluster c ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 18
Provided by: hpl5
Category:

less

Transcript and Presenter's Notes

Title: Document Clustering


1
Document Clustering
  • Carl Staelin

2
Motivation
  • It is hard to rapidly understand a big bucket of
    documents
  • Humans look for patterns, and are good at pattern
    matching
  • Random collections of documents dont have a
    recognizable structure
  • Clustering documents into recognizable groups
    makes it easier to see patterns
  • Can rapidly eliminate irrelevant clusters

3
Basic Idea
  • Choose a document similarity measure
  • Choose a cluster cost criterion

4
Basic Idea
  • Choose a document similarity measure
  • Choose a cluster cost or similarity criterion
  • Group like documents into clusters with minimal
    cluster cost

5
Cluster Cost Criteria
  • Sum-of-squared-error
  • Cost ?ixi-x2
  • Average squared distance
  • Cost (1/n2)?i?jxi-xj2

6
Cluster Similarity Measure
  • Measures the similarity of two clusters Ci, Cj
  • dmin(Ci, Cj) minxi?Ci,xj?Cjxi xj
  • dmax(Ci, Cj) maxxi?Ci,xj?Cjxi xj
  • davg(Ci, Cj) (1/ ninj)?xi?Ci?xj?Cjxi xj
  • dmean(Ci, Cj) (1/nj)?xi?Cixi
    (1/nj)?,xj?Cjxj

7
Iterative Clustering
  • Assign points to initial k clusters
  • Often this is done by random assignment
  • Until done
  • Select a candidate point x, in cluster c
  • Find best cluster c for x
  • If c ? c, then move x to c

8
Iterative Clustering
  • The user must pre-select the number of clusters
  • Often the correct number is not known in
    advance!
  • The quality of the outcome is usually dependent
    on the quality of the initial assignment
  • Possibly use some other algorithm to create a
    good initial assignment?

9
Hierarchical Agglomerative Clustering
  • Create N single-document clusters
  • For i in 1..n
  • Merge two clusters with greatest similarity

10
Hierarchical Agglomerative Clustering
  • Create N single-document clusters
  • For i in 1..n
  • Merge two clusters with greatest similarity

11
Hierarchical Agglomerative Clustering
  • Create N single-document clusters
  • For i in 1..n
  • Merge two clusters with greatest similarity

12
Hierarchical Agglomerative Clustering
  • Hierarchical agglomerative clustering gives a
    hierarchy of clusters
  • This makes it easier to explore the set of
    possible k-cluster values to choose the best
    number of clusters

3
4
5
13
High density variations
  • Intuitively correct clustering

14
High density variations
  • Intuitively correct clustering
  • HAC-generated clusters

15
Hybrid
  • Combine HAC and iterative clustering
  • Assign points to initial clusters using HAC
  • Until done
  • Select a candidate point x, in cluster c
  • Find best cluster c for x
  • If c ? c, then move x to c

16
Other Algorithms
  • Support Vector Clustering
  • Information Bottleneck

17
High density variations
Write a Comment
User Comments (0)
About PowerShow.com