Semi-supervised Learning presentation

About This Presentation

Transcript and Presenter's Notes

Title: Semi-supervised Learning

1
Semi-supervised Learning

Rong Jin

2
Spectrum of Learning Problems
3
What is Semi-supervised Learning

Learning from a mixture of labeled and unlabeled
examples

Labeled Data
Unlabeled Data
Total number of examples
4
Why Semi-supervised Learning?

Labeling is expensive and difficult
Labeling is unreliable
Ex. Segmentation applications
Need for multiple experts
Unlabeled examples
Easy to obtain in large numbers
Ex. Web pages, text documents, etc.

5
Semi-supervised Learning Problems

Classification
Transductive predict labels of unlabeled data
Inductive learn a classification function
Clustering (constrained clustering)
Ranking (semi-supervised ranking)
Almost every learning problem has a
semi-supervised counterpart.

6
Why Unlabeled Could be Helpful

Clustering assumption
Unlabeled data help decide the decision boundary
Manifold assumption
Unlabeled data help decide decision function

7
Clustering Assumption
8
Clustering Assumption
Suggest A Simple Alg. for Semi-supervised
Learning ?

Points with same label are connected through high
density regions, thereby defining a cluster
Clusters are separated through low-density regions

9
Manifold Assumption

Graph representation
Vertex training example (labeled and unlabeled)
Edge similar examples

Labeled examples

Regularize the classification function f(x)

10
Manifold Assumption

Graph representation
Vertex training example (labeled and unlabeled)
Edge similar examples

Manifold assumption
Data lies on a low-dimensional manifold
Classification function f(x) should follow the
data manifold

11
Statistical View

Generative model for classification

12
Statistical View

Generative model for classification
Unlabeled data help estimate
? Clustering assumption

13
Statistical View

Discriminative model for classification

?
µ
Y
X
14
Statistical View

Discriminative model for classification
Unlabeled data help regularize ?
via a prior ? Manifold assumption

?
µ
Y
X
15
Semi-supervised Learning Algorithms

Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training

16
Label Propagation Key Idea

A decision boundary based on the labeled examples
is unable to take into account the layout of the
data points
How to incorporate the data distribution into the
prediction of class labels?

17
Label Propagation Key Idea

Connect the data points that are close to each
other

18
Label Propagation Key Idea

Connect the data points that are close to each
other
Propagate the class labels over the connected
graph

19
Label Propagation Key Idea

Connect the data points that are close to each
other
Propagate the class labels over the connected
graph
Different from the K Nearest Neighbor

20
Label Propagation Representation

Adjancy matrix
Similarity matrix
Matrix

21
Label Propagation Representation

Adjancy matrix
Similarity matrix
Degree matrix

22
Label Propagation Representation

Given
Label information

23
Label Propagation Representation

Given
Label information

24
Label Propagation

Initial class assignments
Predicted class assignments
First predict the confidence scores
Then predict the class assignments

25
Label Propagation

Initial class assignments
Predicted class assignments
First predict the confidence scores
Then predict the class assignments

26
Label Propagation (II)

One round of propagation

27
Label Propagation (II)

Two rounds of propagation
How to generate any number of iterations?

28
Label Propagation (II)

Two rounds of propagation
Results for any number of iterations

29
Label Propagation (II)

Two rounds of propagation
Results for infinite number of iterations

30
Label Propagation (II)

Two rounds of propagation
Results for infinite number of iterations

31
Local and Global Consistency Zhou et.al., NIPS
03
32
Summary

Construct a graph using pairwise similarities
Propagate class labels along the graph
Key parameters
? the decay of propagation
W similarity matrix
Computational complexity
Matrix inverse O(n3)
Chelosky decomposition
Clustering

33
Questions
?
34
Application Text Classification Zhou et.al.,
NIPS 03

20-newsgroups
autos, motorcycles, baseball, and hockey under
rec
Pre-processing
stemming, remove stopwords rare words, and skip
header
Docs 3970, word 8014

35
Application Image Retrieval Wang et al., ACM
MM 2004

5,000 images
Relevance feedback for the top 20 ranked images
Classification problem
Relevant or not?
f(x) degree of relevance
Learning relevance function f(x)
Supervised learning SVM
Label propagation

Label propagation
SVM
36
Semi-supervised Learning Algorithms

Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training

37
Graph Partition

Classification as graph partitioning
Search for a classification boundary
Consistent with labeled examples
Partition with small graph cut

38
Graph Partitioning

Classification as graph partitioning
Search for a classification boundary
Consistent with labeled examples
Partition with small graph cut

39
Min-cuts Blum and Chawla, ICML 2001

Additional nodes
V source, V- sink
Infinite weights connecting sinks and sources
High computational cost

?
V
?
V _
?
?
Source
Sink
40
Harmonic Function Zhu et al., ICML 2003

Weight matrix W
wi,j? 0 similarity between xi and xi
Membership vector

41
Harmonic Function (contd)

Graph cut
Degree matrix
Diagonal element

42
Harmonic Function (contd)

Graph cut
Graph Laplacian L D W
Pairwise relationships among data poitns
Mainfold geometry of data

43
Harmonic Function
44
Harmonic Function

Relaxation -1, 1 ? continuous real number
Convert continuous f to binary ones

45
Harmonic Function
How to handle a large number of unlabeled data
points
46
Harmonic Function
47
Harmonic Function
Local Propagation
Sound familiar ?
Global propagation
48
Spectral Graph Transducer Joachim , 2003
Soften hard constraints
49
Spectral Graph Transducer Joachim , 2003
Solved by Constrained Eigenvector Problem
50
Manifold Regularization Belkin, 2006
51
Manifold Regularization Belkin, 2006
52
Summary

Construct a graph using pairwise similarity
Key quantity graph Laplacian
Captures the geometry of the graph
Decision boundary is consistent
Graph structure
Labeled examples
Parameters
?, ?, similarity

53
Questions
?
54
Application Text Classification

20-newsgroups
autos, motorcycles, baseball, and hockey under
rec
Pre-processing
stemming, remove stopwords rare words, and skip
header
Docs 3970, word 8014

SVM
KNN
Propagation
Harmonic
55
Application Text Classification
PRBEP precision recall break even point.
56
Application Text Classification

Improvement in PRBEP by SGT

57
Semi-supervised Classification Algorithms

Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training

58
Transductive SVM

Support vector machine
Classification margin
Maximum classification margin
Decision boundary given a small number of labeled
examples

59
Transductive SVM

Decision boundary given a small number of labeled
examples
How to change decision boundary given both
labeled and unlabeled examples ?

60
Transductive SVM

Decision boundary given a small number of labeled
examples
Move the decision boundary to low local density

61
Transductive SVM

Classification margin
f(x) classification function
Supervised learning
Semi-supervised learning
Optimize over both f(x) and yu

62
Transductive SVM

Classification margin
f(x) classification function
Supervised learning
Semi-supervised learning
Optimize over both f(x) and yu

63
Transductive SVM

Classification margin
f(x) classification function
Supervised learning
Semi-supervised learning
Optimize over both f(x) and yu

64
Transductive SVM

Decision boundary given a small number of labeled
examples
Move the decision boundary to place with low
local density
Classification results
How to formulate this idea?

65
Transductive SVM Formulation
66
Computational Issue

No longer convex optimization problem.
Alternating optimization

67
Summary

Based on maximum margin principle
Classification margin is decided by
Labeled examples
Class labels assigned to unlabeled data
High computational cost
Variants Low Density Separation (LDS),
Semi-Supervised Support Vector Machine (S3VM),
?TSVM

68
Questions
?
69
Text Classification by TSVM

10 categories from the Reuter collection
3299 test documents
1000 informative words selected by MI criterion

70
Semi-supervised Classification Algorithms

Label propagation
Graph partitioning based approaches
Transductive Support Vector Machine (TSVM)
Co-training

71
Co-training Blum Mitchell, 1998

Classify web pages into
category for students and category for professors
Two views of web pages
Content
I am currently the second year Ph.D. student
Hyperlinks
My advisor is
Students

72
Co-training for Semi-Supervised Learning
73
Co-training for Semi-Supervised Learning
It is easier to classify this web page using
hyperlinks
It is easy to classify the type of this web page
based on its content
74
Co-training

Two representation for each web page

Content representation (doctoral, student,
computer, university)
Hyperlink representation Inlinks Prof.
Cheng Oulinks Prof. Cheng
75
Co-training

Train a content-based classifier

76
Co-training

Train a content-based classifier using labeled
examples
Label the unlabeled examples that are confidently
classified

77
Co-training

Train a content-based classifier using labeled
examples
Label the unlabeled examples that are confidently
classified
Train a hyperlink-based classifier

78
Co-training

Train a content-based classifier using labeled
examples
Label the unlabeled examples that are confidently
classified
Train a hyperlink-based classifier
Label the unlabeled examples that are confidently
classified

79
Co-training

Train a content-based classifier using labeled
examples
Label the unlabeled examples that are confidently
classified
Train a hyperlink-based classifier
Label the unlabeled examples that are confidently
classified

80
Co-training

Assume two views of objects
Two sufficient representations
Key idea
Augment training examples of one view by
exploiting the classifier of the other view
Extension to multiple view
Problem how to find equivalent views

81
Active Learning

Active learning
Select the most informative examples
In contrast to passive learning
Key question which examples are informative
Uncertainty principle most informative example
is the one that is most uncertain to classify
Measure classification uncertainty

82
Active Learning

Query by committee (QBC)
Construct an ensemble of classifiers
Classification uncertainty ? largest degree of
disagreement
SVM based approach
Classification uncertainty ? distance to decision
boundary
Simple but very effective approaches

83
Semi-supervised Clustering

Clustering data into two clusters

84
Semi-supervised Clustering
Must link
cannot link

Clustering data into two clusters
Side information
Must links vs. cannot links

85
Semi-supervised Clustering

Also called constrained clustering
Two types of approaches
Restricted data partitions
Distance metric learning approaches

86
Restricted Data Partition

Require data partitions to be consistent with the
given links
Links ? hard constraints
E.g. constrained K-Means (Wagstaff et al., 2001)
Links ? soft constraints
E.g., Metric Pairwise Constraints K-means (Basu
et al., 2004)

87
Restricted Data Partition

Hard constraints
Cluster memberships must obey the link constraints

Yes
88
Restricted Data Partition

Hard constraints
Cluster memberships must obey the link constraints

Yes
89
Restricted Data Partition

Hard constraints
Cluster memberships must obey the link constraints

No
90
Restricted Data Partition

Soft constraints
Penalize data clustering if it violates some links

Penality 0
91
Restricted Data Partition

Hard constraints
Cluster memberships must obey the link constraints

Penality 0
92
Restricted Data Partition

Hard constraints
Cluster memberships must obey the link constraints

Penality 1
93
Distance Metric Learning

Learning a distance metric from pairwise links
Enlarge the distance for a cannot-link
Shorten the distance for a must-link
Applied K-means with pairwise distance measured
by the learned distance metric

94
Example of Distance Metric Learning
2D data projection using Euclidean distance metric
Solid lines must links dotted lines cannot
links
95
BoostCluster Liu, Jin Jain, 2007

General framework for semi-supervised clustering
Improves any given unsupervised clustering
algorithm with pairwise constraints
Key challenges
How to influence an arbitrary clustering
algorithm by side information?
Encode constraints into data representation
How to take into account the performance of
underlying clustering algorithm?
Iteratively improve the clustering performance

95
96
BoostCluster
Given (a) pairwise constraints, (b) data
examples, and (c) a clustering algorithm
96
97
BoostCluster
Find the best data rep. that encodes the
unsatisfied pairwise constraints
97
98
BoostCluster
Obtain the clustering results given the new data
representation
98
99
BoostCluster
Update the kernel with the clustering results
99
100
BoostCluster
Run the procedure iteratively
100
101
BoostCluster
Compute the final clustering result
101
102
Summary

Clustering data under given pairwise constraints
Must links vs. cannot links
Two types of approaches
Restricted data partitions (either soft or hard)
Distance metric learning
Questions how to acquire links/constraints?
Manual assignments
Derive from side information hyper links,
citation, user logs, etc.
May be noisy and unreliable

103
Application Document ClusteringBasu et al.,
2004

300 docs from topics (atheism, baseball, space)
of 20-newsgroups
3251 unique words after removal of stopwords and
rare words and stemming
Evaluation metric Normalized Mutual Informtion
(NMI)
KMeans-x-x different variants of constrained
clustering algs.

104
Kernel Learning

Kernel plays central role in machine learning
Kernel functions can be learned from data
Kernel alignment, multiple kernel learning,
non-parametric learning,
Kernel learning is suitable for IR
Similarity measure is key to IR
Kernel learning allows us to identify the optimal
similarity measure automatically

105
Transfer Learning

Different document categories are correlated
We should be able to borrow information of one
class to the training of another class
Key question what to transfer between classes?
Representation, model priors, similarity measure

Write a Comment

User Comments (0)

About PowerShow.com

Semi-supervised Learning PowerPoint PPT Presentation