DB Seminar Series: The Semisupervised Clustering Problem - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

DB Seminar Series: The Semisupervised Clustering Problem

Description:

A small amount of domain knowledge. Output: ... Little domain knowledge: number of resulting connected components (Kc) = 90% dataset size ... – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 61

Provided by: kevi60

Category:

more less

Transcript and Presenter's Notes

Title: DB Seminar Series: The Semisupervised Clustering Problem

1
DB Seminar Series The Semi-supervised Clustering
Problem

By Kevin Yip
2 Jan 2004
(Happy New Year!)

2
First of all, an example

Where are the 2 clusters?

3
First of all, an example

Where are the 2 clusters?

What if I tell you that these two are in the same
cluster
and these two are in the same cluster?
4
First of all, an example

Where are the 2 clusters?

5
Observations

The clustering accuracy was greatly improved
(from 50 accuracy to 100).
All objects were unlabeled gt classification not
possible.
Only a small amount of knowledge was added (the
relationships between 2 pairs of objects, out of
the total 105 pairs).
gt Semi-supervised clustering

6
Observations

Is it more natural to form 4 clusters instead?
True, but then how many clusters are there?

It is most natural to set k to the number of
known classes.

7
Outline

The problem (no formal definitions)
Why not clustering? Classification?
Potential applications of semi-supervised
clustering
Supervision what to input?
When to input?
How to alter the clustering process?
Some examples
Future Work

8
The problem

Input
A set of unlabeled objects, each described by a
set of attributes (numeric and/or categorical)
A small amount of domain knowledge
Output
A partitioning of the objects into k clusters
(possibly with some discarded as outliers)
Objective
Maximum intra-cluster similarity
Minimum inter-cluster similarity
High consistency between the partitioning and the
domain knowledge

9
Why not clustering?

The clusters produced may not be the ones
required.
Algorithms that fit the cluster model may not be
available.
Sometimes there are multiple possible groupings.
There is no way to utilize the domain knowledge
that is accessible (active learning v.s. passive
validation).

(Guha et al., 1998)
10
Why not classification?

Sometimes there are insufficient labeled data
Objects are not labeled.
The amount of labeled objects is statistically
insignificant.
The labeled objects do not cover all classes.
The labeled objects of a class do not cover all
cases (e.g. they are all found at one side of a
class).
The underlying class model may not fit the data
(e.g. pattern-based similarity).

11
Potential applications

Bioinformatics (gene and protein clustering)
Document hierarchy construction
News/email categorization
Image categorization
Road lane detection
XML document clustering?
gt Cluster properties not well known, but some
domain knowledge is available

12
What to input?

A small amount of labeled data
Instance-level constraints
Must-link two objects must be put to the same
cluster.
Cannot-link two objects must not be put to the
same cluster.
Objects that are similar/dissimilar to each
other.
First-order logic language
Good cluster, bad cluster
General comments (e.g. clustering too fine)
Others

13
When to input?

At the beginning of clustering
During cluster validation, to guide the next
round of clustering
When the algorithm requests
The user may give a dont know response.

14
How to alter the clustering process?

Guide the formation of seed clusters.
Force/recommend some objects to be put in the
same cluster/different clusters.
Modify the objective function.
Modify the similarity function.
Modify the distance matrix.

15
Examples
16
Wagstaff et al. 2001

Input knowledge must-link and cannot-link
constraints
Algorithm constrained K-means (COP-KMeans)
Generate cluster seeds.
Assign each object to the closest seed in turn,
such that no constraints are violated (otherwise,
report error and halt).
Make the centroid of each cluster as the new
seed.
Repeat 2 and 3 until convergence.

17
Wagstaff et al. 2001

Some results
Soybean (47 x 35, 4 classes)

At most 47 x 46 / 2 1081 constraints
100 coverage can be achieved by 43 must-links.

18
Wagstaff et al. 2001

Some results
Mushroom (50 x 21, 2 classes)

19
Wagstaff et al. 2001

Some results
Tic-tac-toe (100 x 9, 2 classes)

20
Basu et al. 2002

Input knowledge a small set of labeled objects
for each class
Algorithm Seeded-KMeans
Use the centroid of each set of objects as an
initial seed.
Assign each object to the closest seed.
Make the centroid of each cluster as the new
seed.
Repeat 2 and 3 until convergence.

21
Basu et al. 2002

Algorithm Constrained-KMeans
Use the centroid of each set of objects as an
initial seed.
Assign each object to the closest seed in turn,
such that all labeled objects of each class are
put to the same cluster.
Make the centroid of each cluster as the new
seed.
Repeat 2 and 3 until convergence.

22
Basu et al. 2002

Some results
Same-3 Newsgroups (300 x 21631, 3 classes)

23
Basu et al. 2002

Some results
Different-3 Newsgroups (300 x 21631, 3 classes)

24
Klein et al. 2002

In the previous 2 approaches, unconstrained
objects are not much affected if the amount of
input is small.
Is it possible to propagate the influence to the
neighboring objects?

25
Klein et al. 2002

An illustration

26
Klein et al. 2002

Idea modify the distance matrix such that
must-link objects have zero distances and
cannot-link objects have large distances.
Two problems
The effects are still restricted to constrained
objects.
The resulting distance matrix may not satisfy the
triangle inequality.
Solutions
Propagate the must-link influence by running an
all-pairs shortest paths algorithm.
Propagate the cannot-link influence implicitly
during clustering.

27
Klein et al. 2002

Must-link propagation

C
B
D
A
28
Klein et al. 2002

Must-link propagation

C
B
D
A
29
Klein et al. 2002

Must-link propagation

C
B
D
A
30
Klein et al. 2002

Cannot-link propagation
Similarly, if there is a cannot-link constraint
on a pair of objects, the distance between the
objects is set to a certain large value.
But it is harder to fix the triangle inequality.
If A and B must link, they will be given the same
coordinates.
But if A and B cannot link, what coordinates
should be given to them?
In general, it is desirable to have the new
distance matrix as similar as the original one,
but this poses a difficult optimization problem.

31
Klein et al. 2002

Cannot-link propagation
Solution not to propagate, but choose an
algorithm that can implicitly produce the same
effect during the clustering process.
E.g. complete-link hierarchical algorithm
Distance between two clusters longest distance
between two objects each taken from one of the
clusters
If A is cannot-linked with B, when A merges with
C, C will also be cannot-lined with B.

32
Klein et al. 2002

Active learning
Up to now, we assume all constraints are inputted
at the beginning.
Is it possible for the algorithm to request for
useful constraints?
Advantages
Need fewer constraints
Constraints being input must be the most useful
ones
Disadvantages
Need to answer queries on the fly
Not all queries can be answered

33
Klein et al. 2002

Active learning
Allow the algorithm to ask at most m questions
Observations
In hierarchical clustering, errors usually occur
at later stages.
In complete-link clustering, the merged-cluster
distance must be monotonically non-decreasing.
Idea estimate the distance ? such that all
merges that involve clusters with distance ? ?
are guided by the m questions.

34
Klein et al. 2002

Experiments
Algorithms
CCL (distance matrix learning)
CCL-Active (CCL active learning)
COP-KMeans (forced relationships)

35
Klein et al. 2002

Results on synthetic data (held-out acc.)

36
Klein et al. 2002

Results on synthetic data (held-out acc.)

37
Klein et al. 2002

Results on real data (held-out acc.)
Iris (150 x 4, 3 classes)
Crabs (200 x 5, 2 classes)
Soybean (562 x 35, 15 classes)

38
Klein et al. 2002

Effect of constraints

39
Xing et al. 2003

Is it possible to propagate the effects of
constraints by transforming the distance function
instead of the distance matrix?
Mahalanobis distance
When AI (identity matrix), d(x,y) is the
Euclidean distance between x and y.
When A is a diagonal matrix, d(x,y) is the
weighted Euclidean distance between x and y.

40
Xing et al. 2003

Input knowledge a set S of object pairs that are
similar.
Objective find matrix A that minimizessubject
to the constraints that
(to avoid the trivial solution of A0)
A is positive semi-definite (to ensure triangle
inequality and d(x,y) ? 0 for all x, y)
(Is it also needed to minimize the deviation of
the resulting distance matrix from the original
one?)

41
Xing et al. 2003

Case 1 A is a diagonal matrix
It can be shown that the problem is equivalent to
minimizing the following functionwhich can be
solved by the Newton-Raphson method.

42
Xing et al. 2003

Case 2 A is not a diagonal matrix
Newtons method becomes very expensive (O(n6))
Use a gradient descent method to optimize the
functions of an equivalent problem.
In both cases, metric learning is performed
before the clustering process.

43
Xing et al. 2003

Visualization of the power of metric learning
(with 100 input)

44
Xing et al. 2003

Experiments
Algorithms
KMeans
KMeans learn diagonal A
KMeans learn full A
COP-KMeans
COP-KMeans learn diagonal A
COP-KMeans learn full A
Input size
Little domain knowledge number of resulting
connected components (Kc) 90 dataset size
Much domain knowledge Kc 70 data size

45
Xing et al. 2003

Some results on real data

46
Basu et al. 2003

Main contributions
Combine both metric learning and constraints
enforcing.
Soft constraints constraint violations cause
penalties instead of algorithm halt.
Introduce a new active learning method.

47
Basu et al. 2003

The generalized K-Means model
Maximizing the complete data log-likelihood is
equivalent to minimizing the following objective
function
Can be optimized (locally) by an EM algorithm.

48
Basu et al. 2003

Adding constraint violation penalties
M/C a set of must-link/cannot-link constraints
?, ? weights of constraints based on external
knowledge
1true1, 1false0
FM, FC weights of constraints based on the
distance between the objects (should FM(x, y) be
larger or smaller when d(x, y) is larger?)

49
Basu et al. 2003

A simplified objective function
Algorithm (PCKMeans)
From M and C, identify the pre-specified
connected components.
Use the centroids of the largest connected
components as the initial seeds. If there are not
enough seeds, use objects that are cannot-linked
to all connected components. If still not enough,
use random seeds.
Perform KMeans with the modified objective
function.

50
Basu et al. 2003

A new active learning algorithm ask questions as
early as possible
Explore step to form k initial
neighborhoods.Observation k-means centers are
usually remote from each other.Procedure
Draw a point that is farthest to all drawn
points.
Ask at most k questions to determine which
neighborhood the point should be assigned to.
Repeat until k neighborhoods are formed (or no
more questions are allowed).

51
Basu et al. 2003

Consolidate step to form k clusters.Observation
each object is close to its corresponding
seed.Procedure
Use the centroids of the k neighborhoods as the
initial seeds (all objects in a neighborhood is
immediately assigned to the cluster).
Pick a non-assigned object x.
Starting from the closest seed, ask for the
relationship between x and an object in the
cluster until the response is a must-link.
Repeat to assign all objects (if running out of
questions, resume to normal KMeans).

52
Basu et al. 2003

Some experiment results
News-all20 (2000 x 16089, 20 classes) left
News-diff3 (300 x 3251, 3 classes) right

53
Discussions and Future Work

Special clustering problems
How can input knowledge help dimension selection
in projected clustering / pattern-based
clustering?
Is it possible to refine produced clusters when
new knowledge becomes available?
Other uses of input knowledge
Can input knowledge help determine algorithm
parameter values / stopping conditions?

54
Discussions and Future Work

Clustering algorithm
What kind of algorithms could potentially benefit
most from knowledge inputs?
Should we work on constrained objects first, or
to mix them with other objects at the beginning?
(Induction v.s. transduction)
Performance concerns
How is the speed/space performance affected by
the introduction of external knowledge?
How will the paradigm be affected if there is a
limited amount of memory?

55
Discussions and Future Work

Active learning
What to ask? Questions that are difficult for
algorithms but easy for users.
When to ask? When there is a major consequence.
Noise
How to handle contradicting constraints/
constraints that may not be 100 correct?
How much should be charged for a constraint
violation?
What is the significance of knowledge inputs when
the data is noisy?

56
Discussions and Future Work

Debate questions (from ICML 2003 Workshop
homepage)
Unlabeled data is only useful when there are a
large number of redundant features.
Why doesn't The No Free Lunch Theorem apply when
working with unlabeled data?
Unlabeled data has to come from the same
underlying distribution as the labeled data.
Can unlabeled data be used in temporal domains?
Feature engineering is more important than
algorithm design for semi-supervised learning.
All the interesting problems in semi-supervised
learning have been identified.
Active learning is an interesting "academic"
problem.

57
Discussions and Future Work

Debate questions (from ICML 2003 Workshop
homepage)
Active learning research without user interface
design is only solving half the problem.
Using Unlabeled data in Data Mining is no
different than using it in Machine Learning.
Massive data sets pose problems when using
current semi-supervised algorithms.
Off-the-shelf data mining software incorporating
labeled and unlabeled data is a fantasy.
Unlabeled data is only useful when the classes
are well separated.

58
References

Ayhan Demiriz, K. P. Bennett and M. J. Embrechts,
Semi-supervised Clustering using Genetic
Algorithms, ANNIE 1999.
Luis Talavera and Javier Bejar, Integrating
Declarative Knowledge in Hierarchical Clustering
Tasks, ISIDA 1999.
David Cohn, Rich Caruana and Andrew McCallum,
Semi-supervised Clustering with User Feedback,
Unpublished Work 2000.
Kiri Wagstaff and Claire Cardie, Clustering with
Instance-level Constraints, ICML 2000.
Kiri Wagstaff, Claire Cardie, Seth Rogers and
Stefan Schroedl, Constrained K-means Clustering
with Background Knowledge, ICML 2001.

59
References

Sugato Basu, Arindam Banerjee and Raymond Mooney,
Semi-supervised Clustering by Seeding, ICML 2002.
Dan Klein, Sepandar D. Kamvar and Christopher D.
Manning, From Instance-level Constraints to
Space-level Constraints Making the Most of Prior
Knowledge in Data Clustering, ICML 2002.
Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and
Stuart Russell, Distance Metric Learning, with
Application to Clustering with Side-information,
Advances in Neural Information Processing Systems
15, 2003.

60
References

Sugato Basu, Mikhail Bilenko and Raymond J.
Mooney, Comparing and Unifying Search-Based and
Similarity-Based Approaches to Semi-Supervised
Clustering, ICML-2003 Workshop on the Continuum
from Labeled to Unlabeled Data in Machine
Learning and Data Mining.
Sugato Basu, Arindam Banerjee and Raymond J.
Mooney, Active Semi-Supervision for Pairwise
Constrained Clustering, not yet published, 2003.

Write a Comment

User Comments (0)