Constrained Clustering - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Constrained Clustering

Description:

However, in real application domains, it is often the case that the experimenter ... background knowledge( about the domain or the dataset) that could be useful in ... – PowerPoint PPT presentation

Number of Views:258
Avg rating:3.0/5.0
Slides: 31
Provided by: bob483
Category:

less

Transcript and Presenter's Notes

Title: Constrained Clustering


1
Constrained Clustering
INFS 795
  • Bojun Yan

2
Outline
  • 1. Introduction and background knowledge
  • 2. Constrained-based methods
  • a. COP-KMeans algorithm
  • b. Seeded-KMeans
  • c. Constrained-KMeans
  • 3. Metric-based methods
  • a. Mahalanobis distance trained using
  • convex optimization

3
1. Introduction and background knowledge
  • In many learning tasks, there is a large supply
    of unlabeled data but insufficient labeled data
    since it can be expensive to generate.
    Semi-supervised learning combines labeled and
    unlabeled data during training to improve
    performance. Semi-supervised learning is
    applicable to both classification and clustering.

4
  • In supervised classification, there is a known,
    fixed set of categories and category-labeled
    training data is used to induce a classification
    function.
  • In semi-supervised classification, training also
    exploits additional unlabeled data, frequently
    resulting in a more accurate classification
    function(Blum Mitchell, 1998 Ghahramani
    Jordan, 1994)

5
  • In unsupervised clustering, an unlabeled dataset
    is partitioned into groups of similar examples,
    typically by optimizing an objective function
    that characterizes good partitions.
  • In semi-supervised clustering, some labeled data
    is used along with the unlabeled data to obtain a
    better clustering.

6
  • Clustering algorithm are generally used in an
    unsupervised fashion. The algorithm has access
    only to the set of features describing each
    object It is not given any information (e.g.
    labels) as to where each of the instance should
    be placed within the partition.
  • However, in real application domains, it is often
    the case that the experimenter possesses some
    background knowledge( about the domain or the
    dataset) that could be useful in clustering the
    data.

7
  • Traditional clustering algorithms have no way to
    take advantage of this information even when it
    does exist.
  • The semi-supervised clustering integrate
    background information into clustering algorithm,
    and its focus is on clustering large amounts of
    supervised data in the presence of a small amount
    of supervised data.

8
  • Some real-world tasks
  • a. similar text searching
  • b. image retrieve
  • c. speaker identification in a conversation
  • d. visual correspondence in multiview
  • image procossing

9
2. Constrained-based methods
  • a. COP-KMeans algorithm
  • b. Seeded-KMeans
  • c. Constrained-KMeans

10
a. COP-KMeans algorithm
  • In the Cop-KMeans, the initial background
    knowledge provided in the form of constraints
    between instances in the datasets, is used in the
    clustering process.
  • There are two types of constraints, must-link
    (two instances have to be together in the same
    cluster) and cannot-link (two instances have to
    be in different clusters).

11
  • The must-link constraints define a transitive
    binary relation over the instances. Consequently,
    when making use of a set of constraints (of both
    kinds), we take a transitive closure over the
    constraints. The full set of derived constraints
    is then presented to the clustering algorithm.
  • In general, constraints may be derived from
    partially labeled data or from background
    knowledge about the domain or dataset.

12
  • How to do transitive closure?
  • If di must link to to dj, dj must link to dh,
    dl must link to dk, di can not link to dl, then
    we can do tansitive closure like the following
  • di must-link dj
  • dj must-link dh gt di must-link dh
  • dl must-link dk

13
  • di cannot link dl gt dj cannot link dl
  • dh cannot link
    dl
  • di cannot link
    dk
  • dj cannot link
    dk
  • dh cannot link dk

14
(No Transcript)
15
b. Seeded-KMeans
c. Constrained-KMeans
16
Seeding
17
(No Transcript)
18
(No Transcript)
19
  • In Seeded-KMeans, the seed clustering is used
    to initialize the KMeans algorithm. Thus, rather
    than initializing KMeans from K random means, the
    mean of the lth cluster is initialized with the
    mean of the lth partition Sl of the seed set. The
    seed clustering is only used for initialization,
    and the seeds are not used in the following steps
    of the algorithm.

20
  • In Constrained-KMeans, the seed clustering is
    used to initialize the KMeans algorithm as
    described for the Seeded-KMeans algorithm.
    However, in the sub-sequent steps, the cluster
    memberships of the data points in the seed set
    are not re-computed in the assign cluster steps
    of the algorithm -- the cluster labels of the
    seed data are kept unchanged, and only the labels
    of the non-seed data are re-estimated.

21
3. Metric-based methods
  • a. Mahalanobis distance trained
  • using convex optimization

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
It is over!
  • Question?
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com