Semi-Supervised Clustering - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Semi-Supervised Clustering

Description:

... for initialization: initial center for cluster i is the mean of the seed points having label i. ... C: number of points involved in must-link constraints. N: ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 67
Provided by: jiep
Category:

less

Transcript and Presenter's Notes

Title: Semi-Supervised Clustering


1
Semi-Supervised Clustering
Jieping Ye Department of Computer Science and
Engineering Arizona State University http//www.pu
blic.asu.edu/jye02
2
Outline
  • Overview of clustering and classification
  • What is semi-supervised learning?
  • Semi-supervised clustering
  • Semi-supervised classification
  • Semi-supervised clustering
  • What is semi-supervised clustering?
  • Why semi-supervised clustering?
  • Semi-supervised clustering algorithms

3
Supervised classification versus unsupervised
clustering
  • Unsupervised clustering Group similar objects
    together to find clusters
  • Minimize intra-class distance
  • Maximize inter-class distance
  • Supervised classification Class label for each
    training sample is given
  • Build a model from the training data
  • Predict class label on unseen future data points

4
What is clustering?
  • Finding groups of objects such that the objects
    in a group will be similar (or related) to one
    another and different from (or unrelated to) the
    objects in other groups

5
What is Classification?
6
Clustering algorithms
  • K-Means
  • Hierarchical clustering
  • Graph based clustering (Spectral clustering)
  • Bi-clustering

7
Classification algorithms
  • K-Nearest-Neighbor classifiers
  • Naïve Bayes classifier
  • Linear Discriminant Analysis (LDA)
  • Support Vector Machines (SVM)
  • Logistic Regression
  • Neural Networks

8
Supervised Classification Example
.
.
.
.
9
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
Semi-Supervised Learning
  • Combines labeled and unlabeled data during
    training to improve performance
  • Semi-supervised classification Training on
    labeled data exploits additional unlabeled data,
    frequently resulting in a more accurate
    classifier.
  • Semi-supervised clustering Uses small amount of
    labeled data to aid and bias the clustering of
    unlabeled data.

14
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
Semi-Supervised Classification
  • Algorithms
  • Semisupervised EM GhahramaniNIPS94,NigamML00.
  • Co-training BlumCOLT98.
  • Transductive SVMs Vapnik98,JoachimsICML99.
  • Graph based algorithms
  • Assumptions
  • Known, fixed set of categories given in the
    labeled data.
  • Goal is to improve classification of examples
    into these known categories.

17
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
Semi-supervised clustering problem definition
  • Input
  • A set of unlabeled objects, each described by a
    set of attributes (numeric and/or categorical)
  • A small amount of domain knowledge
  • Output
  • A partitioning of the objects into k clusters
    (possibly with some discarded as outliers)
  • Objective
  • Maximum intra-cluster similarity
  • Minimum inter-cluster similarity
  • High consistency between the partitioning and the
    domain knowledge

22
Why semi-supervised clustering?
  • Why not clustering?
  • The clusters produced may not be the ones
    required.
  • Sometimes there are multiple possible groupings.
  • Why not classification?
  • Sometimes there are insufficient labeled data.
  • Potential applications
  • Bioinformatics (gene and protein clustering)
  • Document hierarchy construction
  • News/email categorization
  • Image categorization

23
Semi-Supervised Clustering
  • Domain knowledge
  • Partial label information is given
  • Apply some constraints (must-links and
    cannot-links)
  • Approaches
  • Search-based Semi-Supervised Clustering
  • Alter the clustering algorithm using the
    constraints
  • Similarity-based Semi-Supervised Clustering
  • Alter the similarity measure based on the
    constraints
  • Combination of both

24
Search-Based Semi-Supervised Clustering
  • Alter the clustering algorithm that searches for
    a good partitioning by
  • Modifying the objective function to give a reward
    for obeying labels on the supervised data
    DemerizANNIE99.
  • Enforcing constraints (must-link, cannot-link) on
    the labeled data during clustering
    WagstaffICML00, WagstaffICML01.
  • Use the labeled data to initialize clusters in an
    iterative refinement algorithm (kMeans,)
    BasuICML02.

25
Overview of K-Means Clustering
  • K-Means is a partitional clustering algorithm
    based on iterative relocation that partitions a
    dataset into K clusters.
  • Algorithm
  • Initialize K cluster centers
    randomly. Repeat until convergence
  • Cluster Assignment Step Assign each data point x
    to the cluster Xl, such that L2 distance of x
    from (center of Xl) is minimum
  • Center Re-estimation Step Re-estimate each
    cluster center as the mean of the points in
    that cluster

26
K-Means Objective Function
  • Locally minimizes sum of squared distance between
    the data points and their corresponding cluster
    centers
  • Initialization of K cluster centers
  • Totally random
  • Random perturbation from global mean
  • Heuristic to ensure well-separated centers

27
K Means Example
28
K Means ExampleRandomly Initialize Means
x
x
29
K Means ExampleAssign Points to Clusters
x
x
30
K Means ExampleRe-estimate Means
x
x
31
K Means ExampleRe-assign Points to Clusters
x
x
32
K Means ExampleRe-estimate Means
x
x
33
K Means ExampleRe-assign Points to Clusters
x
x
34
K Means ExampleRe-estimate Means and Converge
x
x
35
Semi-Supervised K-Means
  • Partial label information is given
  • Seeded K-Means
  • Constrained K-Means
  • Constraints (Must-link, Cannot-link)
  • COP K-Means

36
Semi-Supervised K-Means for partially labeled data
  • Seeded K-Means
  • Labeled data provided by user are used for
    initialization initial center for cluster i is
    the mean of the seed points having label i.
  • Seed points are only used for initialization, and
    not in subsequent steps.
  • Constrained K-Means
  • Labeled data provided by user are used to
    initialize K-Means algorithm.
  • Cluster labels of seed data are kept unchanged in
    the cluster assignment steps, and only the labels
    of the non-seed data are re-estimated.

37
Seeded K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points may change.
38
Seeded K-Means Example
39
Seeded K-Means ExampleInitialize Means Using
Labeled Data
x
x
40
Seeded K-Means ExampleAssign Points to Clusters
x
x
41
Seeded K-Means ExampleRe-estimate Means
x
x
42
Seeded K-Means ExampleAssign points to clusters
and Converge
x
the label is changed
x
43
Exercise
Compute the clustering using seeded Kmeans
44
Constrained K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points will not change.
45
Constrained K-Means Example
46
Constrained K-Means ExampleInitialize Means
Using Labeled Data
x
x
47
Constrained K-Means ExampleAssign Points to
Clusters
x
x
48
Constrained K-Means ExampleRe-estimate Means and
Converge
49
Exercise
Compute the clustering using constrained Kmeans
50
COP K-Means
  • COP K-Means Wagstaff et al. ICML01 is K-Means
    with must-link (must be in same cluster) and
    cannot-link (cannot be in same cluster)
    constraints on data points.
  • Initialization Cluster centers are chosen
    randomly, but as each one is chosen any must-link
    constraints that it participates in are enforced
    (so that they cannot later be chosen as the
    center of another cluster).
  • Algorithm During cluster assignment step in
    COP-K-Means, a point is assigned to its nearest
    cluster without violating any of its constraints.
    If no such assignment exists, abort.

51
COP K-Means Algorithm
52
Illustration
Determine its label
Must-link
x
x
Assign to the red class
53
Illustration
Determine its label
x
x
Cannot-link
Assign to the red class
54
Illustration
Determine its label
Must-link
x
x
Cannot-link
The clustering algorithm fails
55
Similarity-based semi-supervised clustering
  • Alter the similarity measure based on the
    constraints
  • Paper From Instance-level Constraints to
    Space-Level Constraints Making the Most of Prior
    Knowledge in Data Clustering. D. Klein et al.

Two types of constraints Must-links and
Cannot-links
Clustering algorithm Hierarchical clustering
56
Constraints
57
Overview of Hierarchical Clustering Algorithm
  • Agglomerative versus Divisive
  • Basic algorithm of Agglomerative clustering
  • Compute the distance matrix
  • Let each data point be a cluster
  • Repeat
  • Merge the two closest clusters
  • Update the distance matrix
  • Until only a single cluster remains
  • Key operation is the update of the distance
    between two clusters

58
How to Define Inter-Cluster Distance
Distance?
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids

distance matrix
59
Must-link constraints
  • Distance between must-links pair to zero.
  • Derive a new metric by running an
    all-pairs-shortest distances algorithm.
  • It is still a metric
  • Faithful to the original metric
  • Computational complexity O(N2 C)
  • C number of points involved in must-link
    constraints
  • N total number of points

60
New distance matrix based on must-link constraints
Hierarchical clustering can be carried out based
on the new distance matrix.
What is missing?
New distance matrix
61
Cannot-link constraint
  • Run hierarchical clustering with complete link
    (MAX)
  • The distance between two clusters is determined
    by the largest distance.
  • Set the distance between cannot-link pair to be
  • The new distance matrix does not define a metric.
  • Work very well in practice

62
Constrained complete-link clustering algorithm
Derive a new distance Matrix based on both Types
of constraints
63
Illustration
0 0.2 0.5 0.1 0.8
0 0.4 0.2 0.6
0 0.3 0.2
0 0.5
0
2
1
4
3
5
Initial distance matrix
64
New distance matrix
0.9
0 0.2 0.5 0.1 0.8
0 0.4 0.2 0.6
0 0.3 0.2
0 0.5
0
0 0 0.1 0.1 0.8
0 0.2 0.2 0.6
0 0 0.2
0 0.2
0
Must-links 12, 34 Cannot-links 2--3
65
Hierarchical clustering
1 and 2 form a cluster, and 3 and 4 form another
cluster
1,2
3,4
5
0 0.9 0.8
0 0.2
0
1,2
3,4
1 2 3 4 5
5
66
Summary
  • Seeded and Constrained K-Means partially labeled
    data
  • COP K-Means constraints (Must-link and
    Cannot-link)
  • Constrained K-Means and COP K-Means require all
    the constraints to be satisfied.
  • May not be effective if the seeds contain noise.
  • Seeded K-Means use the seeds only in the first
    step to determine the initial centroids.
  • Less sensitive to the noise in the seeds.
  • Semi-supervised hierarchical clustering
Write a Comment
User Comments (0)
About PowerShow.com