Semi-supervised Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Semi-supervised Learning

Description:

Semi-supervised Learning Rong Jin Spectrum of Learning Problems What is Semi-supervised Learning Learning from a mixture of labeled and unlabeled examples Why Semi ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 106
Provided by: cseMsuEd7
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Semi-supervised Learning


1
Semi-supervised Learning
  • Rong Jin

2
Spectrum of Learning Problems
3
What is Semi-supervised Learning
  • Learning from a mixture of labeled and unlabeled
    examples

Labeled Data
Unlabeled Data
Total number of examples
4
Why Semi-supervised Learning?
  • Labeling is expensive and difficult
  • Labeling is unreliable
  • Ex. Segmentation applications
  • Need for multiple experts
  • Unlabeled examples
  • Easy to obtain in large numbers
  • Ex. Web pages, text documents, etc.

5
Semi-supervised Learning Problems
  • Classification
  • Transductive predict labels of unlabeled data
  • Inductive learn a classification function
  • Clustering (constrained clustering)
  • Ranking (semi-supervised ranking)
  • Almost every learning problem has a
    semi-supervised counterpart.

6
Why Unlabeled Could be Helpful
  • Clustering assumption
  • Unlabeled data help decide the decision boundary
  • Manifold assumption
  • Unlabeled data help decide decision function

7
Clustering Assumption
8
Clustering Assumption
Suggest A Simple Alg. for Semi-supervised
Learning ?
  • Points with same label are connected through high
    density regions, thereby defining a cluster
  • Clusters are separated through low-density regions

9
Manifold Assumption
  • Graph representation
  • Vertex training example (labeled and unlabeled)
  • Edge similar examples

Labeled examples
  • Regularize the classification function f(x)

10
Manifold Assumption
  • Graph representation
  • Vertex training example (labeled and unlabeled)
  • Edge similar examples
  • Manifold assumption
  • Data lies on a low-dimensional manifold
  • Classification function f(x) should follow the
    data manifold

11
Statistical View
  • Generative model for classification

12
Statistical View
  • Generative model for classification
  • Unlabeled data help estimate
  • ? Clustering assumption

13
Statistical View
  • Discriminative model for classification

?
µ
Y
X
14
Statistical View
  • Discriminative model for classification
  • Unlabeled data help regularize ?
  • via a prior ? Manifold assumption

?
µ
Y
X
15
Semi-supervised Learning Algorithms
  • Label propagation
  • Graph partitioning based approaches
  • Transductive Support Vector Machine (TSVM)
  • Co-training

16
Label Propagation Key Idea
  • A decision boundary based on the labeled examples
    is unable to take into account the layout of the
    data points
  • How to incorporate the data distribution into the
    prediction of class labels?

17
Label Propagation Key Idea
  • Connect the data points that are close to each
    other

18
Label Propagation Key Idea
  • Connect the data points that are close to each
    other
  • Propagate the class labels over the connected
    graph

19
Label Propagation Key Idea
  • Connect the data points that are close to each
    other
  • Propagate the class labels over the connected
    graph
  • Different from the K Nearest Neighbor

20
Label Propagation Representation
  • Adjancy matrix
  • Similarity matrix
  • Matrix

21
Label Propagation Representation
  • Adjancy matrix
  • Similarity matrix
  • Degree matrix

22
Label Propagation Representation
  • Given
  • Label information

23
Label Propagation Representation
  • Given
  • Label information

24
Label Propagation
  • Initial class assignments
  • Predicted class assignments
  • First predict the confidence scores
  • Then predict the class assignments

25
Label Propagation
  • Initial class assignments
  • Predicted class assignments
  • First predict the confidence scores
  • Then predict the class assignments

26
Label Propagation (II)
  • One round of propagation

27
Label Propagation (II)
  • Two rounds of propagation
  • How to generate any number of iterations?

28
Label Propagation (II)
  • Two rounds of propagation
  • Results for any number of iterations

29
Label Propagation (II)
  • Two rounds of propagation
  • Results for infinite number of iterations

30
Label Propagation (II)
  • Two rounds of propagation
  • Results for infinite number of iterations

31
Local and Global Consistency Zhou et.al., NIPS
03
32
Summary
  • Construct a graph using pairwise similarities
  • Propagate class labels along the graph
  • Key parameters
  • ? the decay of propagation
  • W similarity matrix
  • Computational complexity
  • Matrix inverse O(n3)
  • Chelosky decomposition
  • Clustering

33
Questions
?
34
Application Text Classification Zhou et.al.,
NIPS 03
  • 20-newsgroups
  • autos, motorcycles, baseball, and hockey under
    rec
  • Pre-processing
  • stemming, remove stopwords rare words, and skip
    header
  • Docs 3970, word 8014

35
Application Image Retrieval Wang et al., ACM
MM 2004
  • 5,000 images
  • Relevance feedback for the top 20 ranked images
  • Classification problem
  • Relevant or not?
  • f(x) degree of relevance
  • Learning relevance function f(x)
  • Supervised learning SVM
  • Label propagation

Label propagation
SVM
36
Semi-supervised Learning Algorithms
  • Label propagation
  • Graph partitioning based approaches
  • Transductive Support Vector Machine (TSVM)
  • Co-training

37
Graph Partition
  • Classification as graph partitioning
  • Search for a classification boundary
  • Consistent with labeled examples
  • Partition with small graph cut

38
Graph Partitioning
  • Classification as graph partitioning
  • Search for a classification boundary
  • Consistent with labeled examples
  • Partition with small graph cut

39
Min-cuts Blum and Chawla, ICML 2001
  • Additional nodes
  • V source, V- sink
  • Infinite weights connecting sinks and sources
  • High computational cost

?
V
?
V _
?
?
Source
Sink
40
Harmonic Function Zhu et al., ICML 2003
  • Weight matrix W
  • wi,j? 0 similarity between xi and xi
  • Membership vector

41
Harmonic Function (contd)
  • Graph cut
  • Degree matrix
  • Diagonal element

42
Harmonic Function (contd)
  • Graph cut
  • Graph Laplacian L D W
  • Pairwise relationships among data poitns
  • Mainfold geometry of data

43
Harmonic Function
44
Harmonic Function
  • Relaxation -1, 1 ? continuous real number
  • Convert continuous f to binary ones

45
Harmonic Function
How to handle a large number of unlabeled data
points
46
Harmonic Function
47
Harmonic Function
Local Propagation
Sound familiar ?
Global propagation
48
Spectral Graph Transducer Joachim , 2003
Soften hard constraints
49
Spectral Graph Transducer Joachim , 2003
Solved by Constrained Eigenvector Problem
50
Manifold Regularization Belkin, 2006
51
Manifold Regularization Belkin, 2006
52
Summary
  • Construct a graph using pairwise similarity
  • Key quantity graph Laplacian
  • Captures the geometry of the graph
  • Decision boundary is consistent
  • Graph structure
  • Labeled examples
  • Parameters
  • ?, ?, similarity

53
Questions
?
54
Application Text Classification
  • 20-newsgroups
  • autos, motorcycles, baseball, and hockey under
    rec
  • Pre-processing
  • stemming, remove stopwords rare words, and skip
    header
  • Docs 3970, word 8014

SVM
KNN
Propagation
Harmonic
55
Application Text Classification
PRBEP precision recall break even point.
56
Application Text Classification
  • Improvement in PRBEP by SGT

57
Semi-supervised Classification Algorithms
  • Label propagation
  • Graph partitioning based approaches
  • Transductive Support Vector Machine (TSVM)
  • Co-training

58
Transductive SVM
  • Support vector machine
  • Classification margin
  • Maximum classification margin
  • Decision boundary given a small number of labeled
    examples

59
Transductive SVM
  • Decision boundary given a small number of labeled
    examples
  • How to change decision boundary given both
    labeled and unlabeled examples ?

60
Transductive SVM
  • Decision boundary given a small number of labeled
    examples
  • Move the decision boundary to low local density

61
Transductive SVM
  • Classification margin
  • f(x) classification function
  • Supervised learning
  • Semi-supervised learning
  • Optimize over both f(x) and yu

62
Transductive SVM
  • Classification margin
  • f(x) classification function
  • Supervised learning
  • Semi-supervised learning
  • Optimize over both f(x) and yu

63
Transductive SVM
  • Classification margin
  • f(x) classification function
  • Supervised learning
  • Semi-supervised learning
  • Optimize over both f(x) and yu

64
Transductive SVM
  • Decision boundary given a small number of labeled
    examples
  • Move the decision boundary to place with low
    local density
  • Classification results
  • How to formulate this idea?

65
Transductive SVM Formulation
66
Computational Issue
  • No longer convex optimization problem.
  • Alternating optimization

67
Summary
  • Based on maximum margin principle
  • Classification margin is decided by
  • Labeled examples
  • Class labels assigned to unlabeled data
  • High computational cost
  • Variants Low Density Separation (LDS),
    Semi-Supervised Support Vector Machine (S3VM),
    ?TSVM

68
Questions
?
69
Text Classification by TSVM
  • 10 categories from the Reuter collection
  • 3299 test documents
  • 1000 informative words selected by MI criterion

70
Semi-supervised Classification Algorithms
  • Label propagation
  • Graph partitioning based approaches
  • Transductive Support Vector Machine (TSVM)
  • Co-training

71
Co-training Blum Mitchell, 1998
  • Classify web pages into
  • category for students and category for professors
  • Two views of web pages
  • Content
  • I am currently the second year Ph.D. student
  • Hyperlinks
  • My advisor is
  • Students

72
Co-training for Semi-Supervised Learning
73
Co-training for Semi-Supervised Learning
It is easier to classify this web page using
hyperlinks
It is easy to classify the type of this web page
based on its content
74
Co-training
  • Two representation for each web page

Content representation (doctoral, student,
computer, university)
Hyperlink representation Inlinks Prof.
Cheng Oulinks Prof. Cheng
75
Co-training
  • Train a content-based classifier

76
Co-training
  • Train a content-based classifier using labeled
    examples
  • Label the unlabeled examples that are confidently
    classified

77
Co-training
  • Train a content-based classifier using labeled
    examples
  • Label the unlabeled examples that are confidently
    classified
  • Train a hyperlink-based classifier

78
Co-training
  • Train a content-based classifier using labeled
    examples
  • Label the unlabeled examples that are confidently
    classified
  • Train a hyperlink-based classifier
  • Label the unlabeled examples that are confidently
    classified

79
Co-training
  • Train a content-based classifier using labeled
    examples
  • Label the unlabeled examples that are confidently
    classified
  • Train a hyperlink-based classifier
  • Label the unlabeled examples that are confidently
    classified

80
Co-training
  • Assume two views of objects
  • Two sufficient representations
  • Key idea
  • Augment training examples of one view by
    exploiting the classifier of the other view
  • Extension to multiple view
  • Problem how to find equivalent views

81
Active Learning
  • Active learning
  • Select the most informative examples
  • In contrast to passive learning
  • Key question which examples are informative
  • Uncertainty principle most informative example
    is the one that is most uncertain to classify
  • Measure classification uncertainty

82
Active Learning
  • Query by committee (QBC)
  • Construct an ensemble of classifiers
  • Classification uncertainty ? largest degree of
    disagreement
  • SVM based approach
  • Classification uncertainty ? distance to decision
    boundary
  • Simple but very effective approaches

83
Semi-supervised Clustering
  • Clustering data into two clusters

84
Semi-supervised Clustering
Must link
cannot link
  • Clustering data into two clusters
  • Side information
  • Must links vs. cannot links

85
Semi-supervised Clustering
  • Also called constrained clustering
  • Two types of approaches
  • Restricted data partitions
  • Distance metric learning approaches

86
Restricted Data Partition
  • Require data partitions to be consistent with the
    given links
  • Links ? hard constraints
  • E.g. constrained K-Means (Wagstaff et al., 2001)
  • Links ? soft constraints
  • E.g., Metric Pairwise Constraints K-means (Basu
    et al., 2004)

87
Restricted Data Partition
  • Hard constraints
  • Cluster memberships must obey the link constraints

Yes
88
Restricted Data Partition
  • Hard constraints
  • Cluster memberships must obey the link constraints

Yes
89
Restricted Data Partition
  • Hard constraints
  • Cluster memberships must obey the link constraints

No
90
Restricted Data Partition
  • Soft constraints
  • Penalize data clustering if it violates some links

Penality 0
91
Restricted Data Partition
  • Hard constraints
  • Cluster memberships must obey the link constraints

Penality 0
92
Restricted Data Partition
  • Hard constraints
  • Cluster memberships must obey the link constraints

Penality 1
93
Distance Metric Learning
  • Learning a distance metric from pairwise links
  • Enlarge the distance for a cannot-link
  • Shorten the distance for a must-link
  • Applied K-means with pairwise distance measured
    by the learned distance metric

94
Example of Distance Metric Learning
2D data projection using Euclidean distance metric
Solid lines must links dotted lines cannot
links
95
BoostCluster Liu, Jin Jain, 2007
  • General framework for semi-supervised clustering
  • Improves any given unsupervised clustering
    algorithm with pairwise constraints
  • Key challenges
  • How to influence an arbitrary clustering
    algorithm by side information?
  • Encode constraints into data representation
  • How to take into account the performance of
    underlying clustering algorithm?
  • Iteratively improve the clustering performance

95
96
BoostCluster
Given (a) pairwise constraints, (b) data
examples, and (c) a clustering algorithm
96
97
BoostCluster
Find the best data rep. that encodes the
unsatisfied pairwise constraints
97
98
BoostCluster
Obtain the clustering results given the new data
representation
98
99
BoostCluster
Update the kernel with the clustering results
99
100
BoostCluster
Run the procedure iteratively
100
101
BoostCluster
Compute the final clustering result
101
102
Summary
  • Clustering data under given pairwise constraints
  • Must links vs. cannot links
  • Two types of approaches
  • Restricted data partitions (either soft or hard)
  • Distance metric learning
  • Questions how to acquire links/constraints?
  • Manual assignments
  • Derive from side information hyper links,
    citation, user logs, etc.
  • May be noisy and unreliable

103
Application Document ClusteringBasu et al.,
2004
  • 300 docs from topics (atheism, baseball, space)
    of 20-newsgroups
  • 3251 unique words after removal of stopwords and
    rare words and stemming
  • Evaluation metric Normalized Mutual Informtion
    (NMI)
  • KMeans-x-x different variants of constrained
    clustering algs.

104
Kernel Learning
  • Kernel plays central role in machine learning
  • Kernel functions can be learned from data
  • Kernel alignment, multiple kernel learning,
    non-parametric learning,
  • Kernel learning is suitable for IR
  • Similarity measure is key to IR
  • Kernel learning allows us to identify the optimal
    similarity measure automatically

105
Transfer Learning
  • Different document categories are correlated
  • We should be able to borrow information of one
    class to the training of another class
  • Key question what to transfer between classes?
  • Representation, model priors, similarity measure
Write a Comment
User Comments (0)
About PowerShow.com