Incremental Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Incremental Clustering

Description:

Some IR applications cluster an incoming document stream (e.g., topic tracking) ... In each new cluster, arbitrarily assign one of the existing centers as the ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 24
Provided by: CSU67
Category:

less

Transcript and Presenter's Notes

Title: Incremental Clustering


1
Incremental Clustering
  • Previous clustering algorithms worked in batch
    mode processed all points at essentially the
    same time.
  • Some IR applications cluster an incoming document
    stream (e.g., topic tracking).
  • For these applications, we need incremental
    clustering algorithms.

2
Incremental Clustering Issues
  • How to be efficient? Should all documents be
    cached?
  • How to handle or support concept drift?
  • How to reduce sensitivity to ordering?
  • Goals
  • minimize the maximum cluster diameter
  • minimize the number of clusters given a fixed
    diameter

3
Incremental Clustering Model Charikar et al.
1997
  • Extension to HAC as follows
  • Incremental Clustering for an update sequence
    of n points in M, maintain a collection of k
    clusters such that as each one is presented,
    either it is assigned to one of the current k
    clusters or it starts off a new cluster while two
    existing clusters are merged into one.
  • Maintains a HAC for points added up until current
    time.

M. Charikar, C. Chekuri, T. Feder, R. Motwani.
Incremental Clustering and Dynamic Information
Retrieval, Proc. 29th Annual ACM Symposium on
Theory of Computing, 1997.
4
Doubling Algorithm (ab2)
  • Assign first k1 points to k1 clusters with each
    point as centroid, d1distance between closest
    two points.
  • Do while more points
  • dt1 bdt
  • Merge clusters until all clusters in some new
    cluster
  • Pick an arbitrary cluster merge all clusters
    within dt1 of centers
  • Remove selected clusters from old clusters
  • Calculate the centroid for the new cluster
  • Update clusters while number of clusters ltk
  • Assign new point to closest cluster if within
    adt1 of center otherwise create new cluster.

5
ExamplePlot -- Incremental
11
2
15
9
5
1
10
14
7
8
6
16
4
3
13
12
6
ExampleDoubling Merge d224.08
11
2
X
15
9
5
1
10
14
7
8
6
16
4
3
13
12
7
ExampleDoubling Update d224.08
11
2
X
15
9
5
1
10
14
7
8
6
16
4
3
13
X
12
8
ExampleDoubling Update d224.08
11
2
X
15
9
5
1
10
14
7
8
6
X
16
4
3
13
12
9
ExampleDoubling Update d224.08
11
2
15
9
X
5
1
10
14
7
8
6
X
16
4
3
13
12
10
ExampleDoubling Solution
11
2
15
9
5
1
10
14
7
8
6
16
4
3
13
12
11
Clique Partition Background
  • A clique in G(V,E) is a subset V of V s.t.
    every two vertices in V are joined by an edge in
    E.
  • A clique partition for G is a partition of V into
    disjoint subsets V1Vk s.t. for 1ltIltk, the
    subgraph induced by Vi is a complete graph.

12
Clique Partition Algorithm
  • Assign first k1 points to k1 clusters with each
    point as centroid, d1distance between closest
    two points.
  • Do while more points
  • dt1 2dt
  • Merge clusters
  • Compute minimum clique partition from dt1
    threshold graph
  • Merge clusters in each clique
  • In each new cluster, arbitrarily assign one of
    the existing centers as the center for the new
    cluster
  • Update clusters while number of clusters ltk
  • Assign new point to a cluster if within dt1 of
    center of it or sub-clusters otherwise create
    new cluster.

13
Example CP Merge d112.04
11
2
15
9
5
1
10
14
7
8
6
16
4
3
13
12
14
Example CP Update d224.08
11
2
15
9
5
1
10
14
7
23
8
6
16
4
13
3
12
15
Web Document Clustering Applications
  • Organizing search engine retrieval results
  • Meta-search engine that hierarchically clusters
    of results Vivisimo
  • Meta-search engine that graphically displays
    clusters of results Kartoo
  • Detecting redundancy (e.g., mirror sites or moved
    or re-formatted documents)
  • User interest profiles (aka filtering)

16
Vivisimo Result Organization
17
Kartoo Visual Clustering
18
Detecting Mirrors/Subsumed Web Documents
  • Resemblance assesses similarity between two
    documents.
  • Containment assesses how A is a subset of B.

A.Z. Broder, S.C. Glassman, M.S. Manasse, G.
Zweig, Syntactic Clustering of the Web,
Proceedings of WWW6, 1997.
19
Computing R and C
  • S(D,w) (shingle) is the set of all unique
    contiguous subsequences of length w in document
    D.
  • S(D) is S(D,w) for a fixed size w.
  • To reduce the storage and computation, we can
    sample the shingles for each doc
  • First s MINs(W)
  • Every mth MODm(W)

20
Estimating R C from a Portion of a Document
  • Keep a sketch of each document D, which consists
    of F(D) and/or V(D) .

21
Web Clustering with R C
  • w10, m25, s50?, threshold.5
  • Pre-process documents
  • For each doc, calculate a sketch
  • Sort pairs of ltshingle,docidgt, removing
    lexically-equivalent and shingle-equivalent docs
  • Compute list of doc pairs with of shared
    shingles, ignoring very common shingles
  • Generate clusters
  • if r(A,B) gt threshold, then add link Alt-gtB
  • Produce connected components using union-find

22
Web Clustering Results 1997
  • 30M web pages, 150 GBytes
  • 600M shingles
  • 3.6M clusters of 12.3M docs
  • 2.1M clusters of 5.3M identical docs
  • Took 10.5 CPU days to compute

23
Web Applications of Resemblance Clusters
  • Find URL similar to
  • relies on fixed threshold and requires URLs to
    have been processed
  • WWW Lost and Found
  • requires keeping some historical sketch info
  • Remove similar docs from search results
Write a Comment
User Comments (0)
About PowerShow.com