Constrained Clustering - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Constrained Clustering

Description:

However, in real application domains, it is often the case that the experimenter ... background knowledge( about the domain or the dataset) that could be useful in ... – PowerPoint PPT presentation

Number of Views:258

Avg rating:3.0/5.0

Slides: 31

Provided by: bob483

Category:

more less

Transcript and Presenter's Notes

Title: Constrained Clustering

1
Constrained Clustering
INFS 795

Bojun Yan

2
Outline

1. Introduction and background knowledge
2. Constrained-based methods
a. COP-KMeans algorithm
b. Seeded-KMeans
c. Constrained-KMeans
3. Metric-based methods
a. Mahalanobis distance trained using
convex optimization

3
1. Introduction and background knowledge

In many learning tasks, there is a large supply
of unlabeled data but insufficient labeled data
since it can be expensive to generate.
Semi-supervised learning combines labeled and
unlabeled data during training to improve
performance. Semi-supervised learning is
applicable to both classification and clustering.

In supervised classification, there is a known,
fixed set of categories and category-labeled
training data is used to induce a classification
function.
In semi-supervised classification, training also
exploits additional unlabeled data, frequently
resulting in a more accurate classification
function(Blum Mitchell, 1998 Ghahramani
Jordan, 1994)

In unsupervised clustering, an unlabeled dataset
is partitioned into groups of similar examples,
typically by optimizing an objective function
that characterizes good partitions.
In semi-supervised clustering, some labeled data
is used along with the unlabeled data to obtain a
better clustering.

Clustering algorithm are generally used in an
unsupervised fashion. The algorithm has access
only to the set of features describing each
object It is not given any information (e.g.
labels) as to where each of the instance should
be placed within the partition.
However, in real application domains, it is often
the case that the experimenter possesses some
background knowledge( about the domain or the
dataset) that could be useful in clustering the
data.

Traditional clustering algorithms have no way to
take advantage of this information even when it
does exist.
The semi-supervised clustering integrate
background information into clustering algorithm,
and its focus is on clustering large amounts of
supervised data in the presence of a small amount
of supervised data.

Some real-world tasks
a. similar text searching
b. image retrieve
c. speaker identification in a conversation
d. visual correspondence in multiview
image procossing

9
2. Constrained-based methods

a. COP-KMeans algorithm
b. Seeded-KMeans
c. Constrained-KMeans

10
a. COP-KMeans algorithm

In the Cop-KMeans, the initial background
knowledge provided in the form of constraints
between instances in the datasets, is used in the
clustering process.
There are two types of constraints, must-link
(two instances have to be together in the same
cluster) and cannot-link (two instances have to
be in different clusters).

The must-link constraints define a transitive
binary relation over the instances. Consequently,
when making use of a set of constraints (of both
kinds), we take a transitive closure over the
constraints. The full set of derived constraints
is then presented to the clustering algorithm.
In general, constraints may be derived from
partially labeled data or from background
knowledge about the domain or dataset.

How to do transitive closure?
If di must link to to dj, dj must link to dh,
dl must link to dk, di can not link to dl, then
we can do tansitive closure like the following
di must-link dj
dj must-link dh gt di must-link dh
dl must-link dk

di cannot link dl gt dj cannot link dl
dh cannot link
dl
di cannot link
dk
dj cannot link
dk
dh cannot link dk

14
(No Transcript)
15
b. Seeded-KMeans
c. Constrained-KMeans
16
Seeding
17
(No Transcript)
18
(No Transcript)
19

In Seeded-KMeans, the seed clustering is used
to initialize the KMeans algorithm. Thus, rather
than initializing KMeans from K random means, the
mean of the lth cluster is initialized with the
mean of the lth partition Sl of the seed set. The
seed clustering is only used for initialization,
and the seeds are not used in the following steps
of the algorithm.

In Constrained-KMeans, the seed clustering is
used to initialize the KMeans algorithm as
described for the Seeded-KMeans algorithm.
However, in the sub-sequent steps, the cluster
memberships of the data points in the seed set
are not re-computed in the assign cluster steps
of the algorithm -- the cluster labels of the
seed data are kept unchanged, and only the labels
of the non-seed data are re-estimated.

21
3. Metric-based methods

a. Mahalanobis distance trained
using convex optimization

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
It is over!