Semi-Supervised Clustering - PowerPoint PPT Presentation

1 / 66

About This Presentation

Title:

Semi-Supervised Clustering

Description:

... for initialization: initial center for cluster i is the mean of the seed points having label i. ... C: number of points involved in must-link constraints. N: ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 67

Provided by: jiep

Category:

more less

Transcript and Presenter's Notes

Title: Semi-Supervised Clustering

1
Semi-Supervised Clustering
Jieping Ye Department of Computer Science and
Engineering Arizona State University http//www.pu
blic.asu.edu/jye02
2
Outline

Overview of clustering and classification
What is semi-supervised learning?
Semi-supervised clustering
Semi-supervised classification
Semi-supervised clustering
What is semi-supervised clustering?
Why semi-supervised clustering?
Semi-supervised clustering algorithms

3
Supervised classification versus unsupervised
clustering

Unsupervised clustering Group similar objects
together to find clusters
Minimize intra-class distance
Maximize inter-class distance
Supervised classification Class label for each
training sample is given
Build a model from the training data
Predict class label on unseen future data points

4
What is clustering?

Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups

5
What is Classification?
6
Clustering algorithms

K-Means
Hierarchical clustering
Graph based clustering (Spectral clustering)
Bi-clustering

7
Classification algorithms

K-Nearest-Neighbor classifiers
Naïve Bayes classifier
Linear Discriminant Analysis (LDA)
Support Vector Machines (SVM)
Logistic Regression
Neural Networks

8
Supervised Classification Example
.
.
.
.
9
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
Semi-Supervised Learning

Combines labeled and unlabeled data during
training to improve performance
Semi-supervised classification Training on
labeled data exploits additional unlabeled data,
frequently resulting in a more accurate
classifier.
Semi-supervised clustering Uses small amount of
labeled data to aid and bias the clustering of
unlabeled data.

14
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
Semi-Supervised Classification

Algorithms
Semisupervised EM GhahramaniNIPS94,NigamML00.
Co-training BlumCOLT98.
Transductive SVMs Vapnik98,JoachimsICML99.
Graph based algorithms
Assumptions
Known, fixed set of categories given in the
labeled data.
Goal is to improve classification of examples
into these known categories.

17
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
Semi-supervised clustering problem definition

Input
A set of unlabeled objects, each described by a
set of attributes (numeric and/or categorical)
A small amount of domain knowledge
Output
A partitioning of the objects into k clusters
(possibly with some discarded as outliers)
Objective
Maximum intra-cluster similarity
Minimum inter-cluster similarity
High consistency between the partitioning and the
domain knowledge

22
Why semi-supervised clustering?

Why not clustering?
The clusters produced may not be the ones
required.
Sometimes there are multiple possible groupings.
Why not classification?
Sometimes there are insufficient labeled data.
Potential applications
Bioinformatics (gene and protein clustering)
Document hierarchy construction
News/email categorization
Image categorization

23
Semi-Supervised Clustering

Domain knowledge
Partial label information is given
Apply some constraints (must-links and
cannot-links)
Approaches
Search-based Semi-Supervised Clustering
Alter the clustering algorithm using the
constraints
Similarity-based Semi-Supervised Clustering
Alter the similarity measure based on the
constraints
Combination of both

24
Search-Based Semi-Supervised Clustering

Alter the clustering algorithm that searches for
a good partitioning by
Modifying the objective function to give a reward
for obeying labels on the supervised data
DemerizANNIE99.
Enforcing constraints (must-link, cannot-link) on
the labeled data during clustering
WagstaffICML00, WagstaffICML01.
Use the labeled data to initialize clusters in an
iterative refinement algorithm (kMeans,)
BasuICML02.

25
Overview of K-Means Clustering

K-Means is a partitional clustering algorithm
based on iterative relocation that partitions a
dataset into K clusters.
Algorithm
Initialize K cluster centers
randomly. Repeat until convergence
Cluster Assignment Step Assign each data point x
to the cluster Xl, such that L2 distance of x
from (center of Xl) is minimum
Center Re-estimation Step Re-estimate each
cluster center as the mean of the points in
that cluster

26
K-Means Objective Function

Locally minimizes sum of squared distance between
the data points and their corresponding cluster
centers
Initialization of K cluster centers
Totally random
Random perturbation from global mean
Heuristic to ensure well-separated centers

27
K Means Example
28
K Means ExampleRandomly Initialize Means
x
x
29
K Means ExampleAssign Points to Clusters
x
x
30
K Means ExampleRe-estimate Means
x
x
31
K Means ExampleRe-assign Points to Clusters
x
x
32
K Means ExampleRe-estimate Means
x
x
33
K Means ExampleRe-assign Points to Clusters
x
x
34
K Means ExampleRe-estimate Means and Converge
x
x
35
Semi-Supervised K-Means

Partial label information is given
Seeded K-Means
Constrained K-Means
Constraints (Must-link, Cannot-link)
COP K-Means

36
Semi-Supervised K-Means for partially labeled data

Seeded K-Means
Labeled data provided by user are used for
initialization initial center for cluster i is
the mean of the seed points having label i.
Seed points are only used for initialization, and
not in subsequent steps.
Constrained K-Means
Labeled data provided by user are used to
initialize K-Means algorithm.
Cluster labels of seed data are kept unchanged in
the cluster assignment steps, and only the labels
of the non-seed data are re-estimated.

37
Seeded K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points may change.
38
Seeded K-Means Example
39
Seeded K-Means ExampleInitialize Means Using
Labeled Data
x
x
40
Seeded K-Means ExampleAssign Points to Clusters
x
x
41
Seeded K-Means ExampleRe-estimate Means
x
x
42
Seeded K-Means ExampleAssign points to clusters
and Converge
x
the label is changed
x
43
Exercise
Compute the clustering using seeded Kmeans
44
Constrained K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points will not change.
45
Constrained K-Means Example
46
Constrained K-Means ExampleInitialize Means
Using Labeled Data
x
x
47
Constrained K-Means ExampleAssign Points to
Clusters
x
x
48
Constrained K-Means ExampleRe-estimate Means and
Converge
49
Exercise
Compute the clustering using constrained Kmeans
50
COP K-Means

COP K-Means Wagstaff et al. ICML01 is K-Means
with must-link (must be in same cluster) and
cannot-link (cannot be in same cluster)
constraints on data points.
Initialization Cluster centers are chosen
randomly, but as each one is chosen any must-link
constraints that it participates in are enforced
(so that they cannot later be chosen as the
center of another cluster).
Algorithm During cluster assignment step in
COP-K-Means, a point is assigned to its nearest
cluster without violating any of its constraints.
If no such assignment exists, abort.

51
COP K-Means Algorithm
52
Illustration
Determine its label
Must-link
x
x
Assign to the red class
53
Illustration
Determine its label
x
x
Cannot-link
Assign to the red class
54
Illustration
Determine its label
Must-link
x
x
Cannot-link
The clustering algorithm fails
55
Similarity-based semi-supervised clustering

Alter the similarity measure based on the
constraints
Paper From Instance-level Constraints to
Space-Level Constraints Making the Most of Prior
Knowledge in Data Clustering. D. Klein et al.

Two types of constraints Must-links and
Cannot-links
Clustering algorithm Hierarchical clustering
56
Constraints
57
Overview of Hierarchical Clustering Algorithm

Agglomerative versus Divisive
Basic algorithm of Agglomerative clustering
Compute the distance matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the distance matrix
Until only a single cluster remains
Key operation is the update of the distance
between two clusters

58
How to Define Inter-Cluster Distance
Distance?

MIN
MAX
Group Average
Distance Between Centroids

distance matrix
59
Must-link constraints

Distance between must-links pair to zero.
Derive a new metric by running an
all-pairs-shortest distances algorithm.
It is still a metric
Faithful to the original metric
Computational complexity O(N2 C)
C number of points involved in must-link
constraints
N total number of points

60
New distance matrix based on must-link constraints
Hierarchical clustering can be carried out based
on the new distance matrix.
What is missing?
New distance matrix
61
Cannot-link constraint

Run hierarchical clustering with complete link
(MAX)
The distance between two clusters is determined
by the largest distance.
Set the distance between cannot-link pair to be
The new distance matrix does not define a metric.
Work very well in practice

62
Constrained complete-link clustering algorithm
Derive a new distance Matrix based on both Types
of constraints
63
Illustration
0 0.2 0.5 0.1 0.8
0 0.4 0.2 0.6
0 0.3 0.2
0 0.5
0
2
1
4
3
5
Initial distance matrix
64
New distance matrix
0.9
0 0.2 0.5 0.1 0.8
0 0.4 0.2 0.6
0 0.3 0.2
0 0.5
0
0 0 0.1 0.1 0.8
0 0.2 0.2 0.6
0 0 0.2
0 0.2
0
Must-links 12, 34 Cannot-links 2--3
65
Hierarchical clustering
1 and 2 form a cluster, and 3 and 4 form another
cluster
1,2
3,4
5
0 0.9 0.8
0 0.2
0
1,2
3,4
1 2 3 4 5
5
66
Summary

Seeded and Constrained K-Means partially labeled
data
COP K-Means constraints (Must-link and
Cannot-link)
Constrained K-Means and COP K-Means require all
the constraints to be satisfied.
May not be effective if the seeds contain noise.
Seeded K-Means use the seeds only in the first
step to determine the initial centroids.
Less sensitive to the noise in the seeds.
Semi-supervised hierarchical clustering