Title: Integrating%20Constraints%20and%20Metric%20Learning%20in%20Semi-Supervised%20Clustering
1Integrating Constraints and Metric Learning in
Semi-Supervised Clustering
- Mikhail Bilenko, Sugato Basu, Raymond J. Mooney
- ICML 2004
- Presented by Xin Li
2Semi-Supervised Clustering
K4
3Semi-Supervised Clustering
4Semi-Supervised Clustering
5How to exploit supervision in clustering
- Incorporate supervision as constraints
- Learn a distance metric using supervision
- Integration of these two approaches
6K-means Clustering
- X x1,x2,
- L l1,l2,,lk
- Euclidean Distance
- Minimizing
7Clustering with constraints
- Pairwise constraints
- M Must-link pairs
- (xi, xj) should be in the same cluster
- C -- Cannot-link pairs
- (xi, xj) should be in different clusters
8Learning a pairwise distance metric
- Binary Classification (xi, xj) ? 0/1
- M ? positive examples
- (xi, xj) are the same cluster
- C ? negative examples
- (xi, xj) are in different clusters
- Apply the learned distance metric in clustering
- Metric learning and clustering are disjointed
9Unsupervised Clustering with Metric Learning
- Learn a distance metric that optimize a quality
function
10Integrating Constraints and Metric Learning
Combining the previous two equations leads to the
following objective function that minimizes
cluster dispersion under that learned metrics
while reducing constraint violations.
11Penalty for violating constraints
- Penalty for violating a must-link constraints
between distant points should be higher than that
between nearby points. - Penalty for violating a cannot-link constraints
between nearby points should be lower than that
between nearby points.
12MPCK-MEANS Algorithm
- Constraints are utilized during cluster
initialization and when assigning points to
clusters. - The distance metric is adapted by re-estimating
the weights in matrices Ah.
13Initialization
- An initial guess of the clusters.
- Assign each point x to one of K clusters in a way
that satisfies the constraints. - Compute the centroid of each cluster.
14E-step
- Every point x is assigned to the cluster that
minimizes the sum of the distance of x to the
cluster centroid according to the local metric
and the cost of any constraint violations
incurred by the cluster assignment.
15M-Step
16Experimental Setting
17Single Metric, Diagonal Matrix A
18Single Metric, Diagonal Matrix A
19Multiple Metrics, Full Matrix A
20Multiple Metrics, Full Matrix A
21Conclusion and Discussion
- This paper has presented MPCK-MEANS, a new
approach to semi-supervised clustering. - Supervision and metric learning are helpful in
clustering and multiple distance metrics are not
necessary in most cases. - Question 1 If we have supervision in
clustering, why not utilize supervision in the
same way as in a typical classification task ? - Question 2 If there are infinite number of
classes, can we gain from supervision on part of
them ?