Title: Incremental Clustering
1Incremental Clustering
- Previous clustering algorithms worked in batch
mode processed all points at essentially the
same time. - Some IR applications cluster an incoming document
stream (e.g., topic tracking). - For these applications, we need incremental
clustering algorithms.
2Incremental Clustering Issues
- How to be efficient? Should all documents be
cached? - How to handle or support concept drift?
- How to reduce sensitivity to ordering?
- Goals
- minimize the maximum cluster diameter
- minimize the number of clusters given a fixed
diameter
3Incremental Clustering Model Charikar et al.
1997
- Extension to HAC as follows
- Incremental Clustering for an update sequence
of n points in M, maintain a collection of k
clusters such that as each one is presented,
either it is assigned to one of the current k
clusters or it starts off a new cluster while two
existing clusters are merged into one. - Maintains a HAC for points added up until current
time.
M. Charikar, C. Chekuri, T. Feder, R. Motwani.
Incremental Clustering and Dynamic Information
Retrieval, Proc. 29th Annual ACM Symposium on
Theory of Computing, 1997.
4Doubling Algorithm (ab2)
- Assign first k1 points to k1 clusters with each
point as centroid, d1distance between closest
two points. - Do while more points
- dt1 bdt
- Merge clusters until all clusters in some new
cluster - Pick an arbitrary cluster merge all clusters
within dt1 of centers - Remove selected clusters from old clusters
- Calculate the centroid for the new cluster
- Update clusters while number of clusters ltk
- Assign new point to closest cluster if within
adt1 of center otherwise create new cluster.
5ExamplePlot -- Incremental
11
2
15
9
5
1
10
14
7
8
6
16
4
3
13
12
6ExampleDoubling Merge d224.08
11
2
X
15
9
5
1
10
14
7
8
6
16
4
3
13
12
7ExampleDoubling Update d224.08
11
2
X
15
9
5
1
10
14
7
8
6
16
4
3
13
X
12
8ExampleDoubling Update d224.08
11
2
X
15
9
5
1
10
14
7
8
6
X
16
4
3
13
12
9ExampleDoubling Update d224.08
11
2
15
9
X
5
1
10
14
7
8
6
X
16
4
3
13
12
10ExampleDoubling Solution
11
2
15
9
5
1
10
14
7
8
6
16
4
3
13
12
11Clique Partition Background
- A clique in G(V,E) is a subset V of V s.t.
every two vertices in V are joined by an edge in
E. - A clique partition for G is a partition of V into
disjoint subsets V1Vk s.t. for 1ltIltk, the
subgraph induced by Vi is a complete graph.
12Clique Partition Algorithm
- Assign first k1 points to k1 clusters with each
point as centroid, d1distance between closest
two points. - Do while more points
- dt1 2dt
- Merge clusters
- Compute minimum clique partition from dt1
threshold graph - Merge clusters in each clique
- In each new cluster, arbitrarily assign one of
the existing centers as the center for the new
cluster - Update clusters while number of clusters ltk
- Assign new point to a cluster if within dt1 of
center of it or sub-clusters otherwise create
new cluster.
13Example CP Merge d112.04
11
2
15
9
5
1
10
14
7
8
6
16
4
3
13
12
14Example CP Update d224.08
11
2
15
9
5
1
10
14
7
23
8
6
16
4
13
3
12
15Web Document Clustering Applications
- Organizing search engine retrieval results
- Meta-search engine that hierarchically clusters
of results Vivisimo - Meta-search engine that graphically displays
clusters of results Kartoo - Detecting redundancy (e.g., mirror sites or moved
or re-formatted documents) - User interest profiles (aka filtering)
16Vivisimo Result Organization
17Kartoo Visual Clustering
18Detecting Mirrors/Subsumed Web Documents
- Resemblance assesses similarity between two
documents. - Containment assesses how A is a subset of B.
A.Z. Broder, S.C. Glassman, M.S. Manasse, G.
Zweig, Syntactic Clustering of the Web,
Proceedings of WWW6, 1997.
19Computing R and C
- S(D,w) (shingle) is the set of all unique
contiguous subsequences of length w in document
D. - S(D) is S(D,w) for a fixed size w.
- To reduce the storage and computation, we can
sample the shingles for each doc - First s MINs(W)
- Every mth MODm(W)
20Estimating R C from a Portion of a Document
- Keep a sketch of each document D, which consists
of F(D) and/or V(D) .
21Web Clustering with R C
- w10, m25, s50?, threshold.5
- Pre-process documents
- For each doc, calculate a sketch
- Sort pairs of ltshingle,docidgt, removing
lexically-equivalent and shingle-equivalent docs - Compute list of doc pairs with of shared
shingles, ignoring very common shingles - Generate clusters
- if r(A,B) gt threshold, then add link Alt-gtB
- Produce connected components using union-find
22Web Clustering Results 1997
- 30M web pages, 150 GBytes
- 600M shingles
- 3.6M clusters of 12.3M docs
- 2.1M clusters of 5.3M identical docs
- Took 10.5 CPU days to compute
23Web Applications of Resemblance Clusters
- Find URL similar to
- relies on fixed threshold and requires URLs to
have been processed - WWW Lost and Found
- requires keeping some historical sketch info
- Remove similar docs from search results