Incremental Clustering - PowerPoint PPT Presentation

About This Presentation

Title:

Incremental Clustering

Description:

Some IR applications cluster an incoming document stream (e.g., topic tracking) ... In each new cluster, arbitrarily assign one of the existing centers as the ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 24

Provided by: CSU67

Learn more at: https://www.cs.colostate.edu

Category:

more less

Transcript and Presenter's Notes

Title: Incremental Clustering

1
Incremental Clustering

Previous clustering algorithms worked in batch
mode processed all points at essentially the
same time.
Some IR applications cluster an incoming document
stream (e.g., topic tracking).
For these applications, we need incremental
clustering algorithms.

2
Incremental Clustering Issues

How to be efficient? Should all documents be
cached?
How to handle or support concept drift?
How to reduce sensitivity to ordering?
Goals
minimize the maximum cluster diameter
minimize the number of clusters given a fixed
diameter

3
Incremental Clustering Model Charikar et al.
1997

Extension to HAC as follows
Incremental Clustering for an update sequence
of n points in M, maintain a collection of k
clusters such that as each one is presented,
either it is assigned to one of the current k
clusters or it starts off a new cluster while two
existing clusters are merged into one.
Maintains a HAC for points added up until current
time.

M. Charikar, C. Chekuri, T. Feder, R. Motwani.
Incremental Clustering and Dynamic Information
Retrieval, Proc. 29th Annual ACM Symposium on
Theory of Computing, 1997.
4
Doubling Algorithm (ab2)

Assign first k1 points to k1 clusters with each
point as centroid, d1distance between closest
two points.
Do while more points
dt1 bdt
Merge clusters until all clusters in some new
cluster
Pick an arbitrary cluster merge all clusters
within dt1 of centers
Remove selected clusters from old clusters
Calculate the centroid for the new cluster
Update clusters while number of clusters ltk
Assign new point to closest cluster if within
adt1 of center otherwise create new cluster.

5
ExamplePlot -- Incremental
11
2
15
9
5
1
10
14
7
8
6
16
4
3
13
12
6
ExampleDoubling Merge d224.08
11
2
X
15
9
5
1
10
14
7
8
6
16
4
3
13
12
7
ExampleDoubling Update d224.08
11
2
X
15
9
5
1
10
14
7
8
6
16
4
3
13
X
12
8
ExampleDoubling Update d224.08
11
2
X
15
9
5
1
10
14
7
8
6
X
16
4
3
13
12
9
ExampleDoubling Update d224.08
11
2
15
9
X
5
1
10
14
7
8
6
X
16
4
3
13
12
10
ExampleDoubling Solution
11
2
15
9
5
1
10
14
7
8
6
16
4
3
13
12
11
Clique Partition Background

A clique in G(V,E) is a subset V of V s.t.
every two vertices in V are joined by an edge in
E.
A clique partition for G is a partition of V into
disjoint subsets V1Vk s.t. for 1ltIltk, the
subgraph induced by Vi is a complete graph.

12
Clique Partition Algorithm

Assign first k1 points to k1 clusters with each
point as centroid, d1distance between closest
two points.
Do while more points
dt1 2dt
Merge clusters
Compute minimum clique partition from dt1
threshold graph
Merge clusters in each clique
In each new cluster, arbitrarily assign one of
the existing centers as the center for the new
cluster
Update clusters while number of clusters ltk
Assign new point to a cluster if within dt1 of
center of it or sub-clusters otherwise create
new cluster.

13
Example CP Merge d112.04
11
2
15
9
5
1
10
14
7
8
6
16
4
3
13
12
14
Example CP Update d224.08
11
2
15
9
5
1
10
14
7
23
8
6
16
4
13
3
12
15
Web Document Clustering Applications

Organizing search engine retrieval results
Meta-search engine that hierarchically clusters
of results Vivisimo
Meta-search engine that graphically displays
clusters of results Kartoo
Detecting redundancy (e.g., mirror sites or moved
or re-formatted documents)
User interest profiles (aka filtering)

16
Vivisimo Result Organization
17
Kartoo Visual Clustering
18
Detecting Mirrors/Subsumed Web Documents

Resemblance assesses similarity between two
documents.
Containment assesses how A is a subset of B.

A.Z. Broder, S.C. Glassman, M.S. Manasse, G.
Zweig, Syntactic Clustering of the Web,
Proceedings of WWW6, 1997.
19
Computing R and C

S(D,w) (shingle) is the set of all unique
contiguous subsequences of length w in document
D.
S(D) is S(D,w) for a fixed size w.
To reduce the storage and computation, we can
sample the shingles for each doc
First s MINs(W)
Every mth MODm(W)

20
Estimating R C from a Portion of a Document

Keep a sketch of each document D, which consists
of F(D) and/or V(D) .

21
Web Clustering with R C

w10, m25, s50?, threshold.5
Pre-process documents
For each doc, calculate a sketch
Sort pairs of ltshingle,docidgt, removing
lexically-equivalent and shingle-equivalent docs
Compute list of doc pairs with of shared
shingles, ignoring very common shingles
Generate clusters
if r(A,B) gt threshold, then add link Alt-gtB
Produce connected components using union-find

22
Web Clustering Results 1997