Title: Semi-supervised Learning
1Semi-supervised Learning
2Spectrum of Learning Problems
3What is Semi-supervised Learning
- Learning from a mixture of labeled and unlabeled
examples
Labeled Data
Unlabeled Data
Total number of examples
4Why Semi-supervised Learning?
- Labeling is expensive and difficult
- Labeling is unreliable
- Ex. Segmentation applications
- Need for multiple experts
- Unlabeled examples
- Easy to obtain in large numbers
- Ex. Web pages, text documents, etc.
5Semi-supervised Learning Problems
- Classification
- Transductive predict labels of unlabeled data
- Inductive learn a classification function
- Clustering (constrained clustering)
- Ranking (semi-supervised ranking)
- Almost every learning problem has a
semi-supervised counterpart.
6Why Unlabeled Could be Helpful
- Clustering assumption
- Unlabeled data help decide the decision boundary
- Manifold assumption
- Unlabeled data help decide decision function
7Clustering Assumption
8Clustering Assumption
Suggest A Simple Alg. for Semi-supervised
Learning ?
- Points with same label are connected through high
density regions, thereby defining a cluster - Clusters are separated through low-density regions
9Manifold Assumption
- Graph representation
- Vertex training example (labeled and unlabeled)
- Edge similar examples
Labeled examples
- Regularize the classification function f(x)
10Manifold Assumption
- Graph representation
- Vertex training example (labeled and unlabeled)
- Edge similar examples
- Manifold assumption
- Data lies on a low-dimensional manifold
- Classification function f(x) should follow the
data manifold
11Statistical View
- Generative model for classification
12Statistical View
- Generative model for classification
- Unlabeled data help estimate
- ? Clustering assumption
13Statistical View
- Discriminative model for classification
?
µ
Y
X
14Statistical View
- Discriminative model for classification
- Unlabeled data help regularize ?
- via a prior ? Manifold assumption
?
µ
Y
X
15Semi-supervised Learning Algorithms
- Label propagation
- Graph partitioning based approaches
- Transductive Support Vector Machine (TSVM)
- Co-training
16Label Propagation Key Idea
- A decision boundary based on the labeled examples
is unable to take into account the layout of the
data points - How to incorporate the data distribution into the
prediction of class labels?
17Label Propagation Key Idea
- Connect the data points that are close to each
other
18Label Propagation Key Idea
- Connect the data points that are close to each
other - Propagate the class labels over the connected
graph
19Label Propagation Key Idea
- Connect the data points that are close to each
other - Propagate the class labels over the connected
graph - Different from the K Nearest Neighbor
20Label Propagation Representation
- Adjancy matrix
- Similarity matrix
- Matrix
21Label Propagation Representation
- Adjancy matrix
- Similarity matrix
- Degree matrix
22Label Propagation Representation
23Label Propagation Representation
24Label Propagation
- Initial class assignments
- Predicted class assignments
- First predict the confidence scores
- Then predict the class assignments
25Label Propagation
- Initial class assignments
- Predicted class assignments
- First predict the confidence scores
- Then predict the class assignments
26Label Propagation (II)
27Label Propagation (II)
- Two rounds of propagation
- How to generate any number of iterations?
28Label Propagation (II)
- Two rounds of propagation
- Results for any number of iterations
29Label Propagation (II)
- Two rounds of propagation
- Results for infinite number of iterations
30Label Propagation (II)
- Two rounds of propagation
- Results for infinite number of iterations
31Local and Global Consistency Zhou et.al., NIPS
03
32Summary
- Construct a graph using pairwise similarities
- Propagate class labels along the graph
- Key parameters
- ? the decay of propagation
- W similarity matrix
- Computational complexity
- Matrix inverse O(n3)
- Chelosky decomposition
- Clustering
33Questions
?
34Application Text Classification Zhou et.al.,
NIPS 03
- 20-newsgroups
- autos, motorcycles, baseball, and hockey under
rec - Pre-processing
- stemming, remove stopwords rare words, and skip
header - Docs 3970, word 8014
35Application Image Retrieval Wang et al., ACM
MM 2004
- 5,000 images
- Relevance feedback for the top 20 ranked images
- Classification problem
- Relevant or not?
- f(x) degree of relevance
- Learning relevance function f(x)
- Supervised learning SVM
- Label propagation
Label propagation
SVM
36Semi-supervised Learning Algorithms
- Label propagation
- Graph partitioning based approaches
- Transductive Support Vector Machine (TSVM)
- Co-training
37Graph Partition
- Classification as graph partitioning
- Search for a classification boundary
- Consistent with labeled examples
- Partition with small graph cut
38Graph Partitioning
- Classification as graph partitioning
- Search for a classification boundary
- Consistent with labeled examples
- Partition with small graph cut
39Min-cuts Blum and Chawla, ICML 2001
- Additional nodes
- V source, V- sink
- Infinite weights connecting sinks and sources
- High computational cost
?
V
?
V _
?
?
Source
Sink
40Harmonic Function Zhu et al., ICML 2003
- Weight matrix W
- wi,j? 0 similarity between xi and xi
- Membership vector
41Harmonic Function (contd)
- Graph cut
- Degree matrix
- Diagonal element
42Harmonic Function (contd)
- Graph cut
- Graph Laplacian L D W
- Pairwise relationships among data poitns
- Mainfold geometry of data
43Harmonic Function
44Harmonic Function
- Relaxation -1, 1 ? continuous real number
- Convert continuous f to binary ones
45Harmonic Function
How to handle a large number of unlabeled data
points
46Harmonic Function
47Harmonic Function
Local Propagation
Sound familiar ?
Global propagation
48Spectral Graph Transducer Joachim , 2003
Soften hard constraints
49Spectral Graph Transducer Joachim , 2003
Solved by Constrained Eigenvector Problem
50Manifold Regularization Belkin, 2006
51Manifold Regularization Belkin, 2006
52Summary
- Construct a graph using pairwise similarity
- Key quantity graph Laplacian
- Captures the geometry of the graph
- Decision boundary is consistent
- Graph structure
- Labeled examples
- Parameters
- ?, ?, similarity
53Questions
?
54Application Text Classification
- 20-newsgroups
- autos, motorcycles, baseball, and hockey under
rec - Pre-processing
- stemming, remove stopwords rare words, and skip
header - Docs 3970, word 8014
SVM
KNN
Propagation
Harmonic
55Application Text Classification
PRBEP precision recall break even point.
56Application Text Classification
- Improvement in PRBEP by SGT
57Semi-supervised Classification Algorithms
- Label propagation
- Graph partitioning based approaches
- Transductive Support Vector Machine (TSVM)
- Co-training
58Transductive SVM
- Support vector machine
- Classification margin
- Maximum classification margin
- Decision boundary given a small number of labeled
examples
59Transductive SVM
- Decision boundary given a small number of labeled
examples - How to change decision boundary given both
labeled and unlabeled examples ?
60Transductive SVM
- Decision boundary given a small number of labeled
examples - Move the decision boundary to low local density
61Transductive SVM
- Classification margin
- f(x) classification function
- Supervised learning
- Semi-supervised learning
- Optimize over both f(x) and yu
62Transductive SVM
- Classification margin
- f(x) classification function
- Supervised learning
- Semi-supervised learning
- Optimize over both f(x) and yu
63Transductive SVM
- Classification margin
- f(x) classification function
- Supervised learning
- Semi-supervised learning
- Optimize over both f(x) and yu
64Transductive SVM
- Decision boundary given a small number of labeled
examples - Move the decision boundary to place with low
local density - Classification results
- How to formulate this idea?
65Transductive SVM Formulation
66Computational Issue
- No longer convex optimization problem.
- Alternating optimization
67Summary
- Based on maximum margin principle
- Classification margin is decided by
- Labeled examples
- Class labels assigned to unlabeled data
- High computational cost
- Variants Low Density Separation (LDS),
Semi-Supervised Support Vector Machine (S3VM),
?TSVM
68Questions
?
69Text Classification by TSVM
- 10 categories from the Reuter collection
- 3299 test documents
- 1000 informative words selected by MI criterion
70Semi-supervised Classification Algorithms
- Label propagation
- Graph partitioning based approaches
- Transductive Support Vector Machine (TSVM)
- Co-training
71Co-training Blum Mitchell, 1998
- Classify web pages into
- category for students and category for professors
- Two views of web pages
- Content
- I am currently the second year Ph.D. student
- Hyperlinks
- My advisor is
- Students
72Co-training for Semi-Supervised Learning
73Co-training for Semi-Supervised Learning
It is easier to classify this web page using
hyperlinks
It is easy to classify the type of this web page
based on its content
74Co-training
- Two representation for each web page
Content representation (doctoral, student,
computer, university)
Hyperlink representation Inlinks Prof.
Cheng Oulinks Prof. Cheng
75Co-training
- Train a content-based classifier
76Co-training
- Train a content-based classifier using labeled
examples - Label the unlabeled examples that are confidently
classified
77Co-training
- Train a content-based classifier using labeled
examples - Label the unlabeled examples that are confidently
classified - Train a hyperlink-based classifier
78Co-training
- Train a content-based classifier using labeled
examples - Label the unlabeled examples that are confidently
classified - Train a hyperlink-based classifier
- Label the unlabeled examples that are confidently
classified
79Co-training
- Train a content-based classifier using labeled
examples - Label the unlabeled examples that are confidently
classified - Train a hyperlink-based classifier
- Label the unlabeled examples that are confidently
classified
80Co-training
- Assume two views of objects
- Two sufficient representations
- Key idea
- Augment training examples of one view by
exploiting the classifier of the other view - Extension to multiple view
- Problem how to find equivalent views
81Active Learning
- Active learning
- Select the most informative examples
- In contrast to passive learning
- Key question which examples are informative
- Uncertainty principle most informative example
is the one that is most uncertain to classify - Measure classification uncertainty
82Active Learning
- Query by committee (QBC)
- Construct an ensemble of classifiers
- Classification uncertainty ? largest degree of
disagreement - SVM based approach
- Classification uncertainty ? distance to decision
boundary - Simple but very effective approaches
83Semi-supervised Clustering
- Clustering data into two clusters
-
84Semi-supervised Clustering
Must link
cannot link
- Clustering data into two clusters
- Side information
- Must links vs. cannot links
85Semi-supervised Clustering
- Also called constrained clustering
- Two types of approaches
- Restricted data partitions
- Distance metric learning approaches
86Restricted Data Partition
- Require data partitions to be consistent with the
given links - Links ? hard constraints
- E.g. constrained K-Means (Wagstaff et al., 2001)
- Links ? soft constraints
- E.g., Metric Pairwise Constraints K-means (Basu
et al., 2004)
87Restricted Data Partition
- Hard constraints
- Cluster memberships must obey the link constraints
Yes
88Restricted Data Partition
- Hard constraints
- Cluster memberships must obey the link constraints
Yes
89Restricted Data Partition
- Hard constraints
- Cluster memberships must obey the link constraints
No
90Restricted Data Partition
- Soft constraints
- Penalize data clustering if it violates some links
Penality 0
91Restricted Data Partition
- Hard constraints
- Cluster memberships must obey the link constraints
Penality 0
92Restricted Data Partition
- Hard constraints
- Cluster memberships must obey the link constraints
Penality 1
93Distance Metric Learning
- Learning a distance metric from pairwise links
- Enlarge the distance for a cannot-link
- Shorten the distance for a must-link
- Applied K-means with pairwise distance measured
by the learned distance metric
94Example of Distance Metric Learning
2D data projection using Euclidean distance metric
Solid lines must links dotted lines cannot
links
95BoostCluster Liu, Jin Jain, 2007
- General framework for semi-supervised clustering
- Improves any given unsupervised clustering
algorithm with pairwise constraints - Key challenges
- How to influence an arbitrary clustering
algorithm by side information? - Encode constraints into data representation
- How to take into account the performance of
underlying clustering algorithm? - Iteratively improve the clustering performance
95
96BoostCluster
Given (a) pairwise constraints, (b) data
examples, and (c) a clustering algorithm
96
97BoostCluster
Find the best data rep. that encodes the
unsatisfied pairwise constraints
97
98BoostCluster
Obtain the clustering results given the new data
representation
98
99BoostCluster
Update the kernel with the clustering results
99
100BoostCluster
Run the procedure iteratively
100
101BoostCluster
Compute the final clustering result
101
102Summary
- Clustering data under given pairwise constraints
- Must links vs. cannot links
- Two types of approaches
- Restricted data partitions (either soft or hard)
- Distance metric learning
- Questions how to acquire links/constraints?
- Manual assignments
- Derive from side information hyper links,
citation, user logs, etc. - May be noisy and unreliable
103Application Document ClusteringBasu et al.,
2004
- 300 docs from topics (atheism, baseball, space)
of 20-newsgroups - 3251 unique words after removal of stopwords and
rare words and stemming - Evaluation metric Normalized Mutual Informtion
(NMI) - KMeans-x-x different variants of constrained
clustering algs.
104Kernel Learning
- Kernel plays central role in machine learning
- Kernel functions can be learned from data
- Kernel alignment, multiple kernel learning,
non-parametric learning, - Kernel learning is suitable for IR
- Similarity measure is key to IR
- Kernel learning allows us to identify the optimal
similarity measure automatically
105Transfer Learning
- Different document categories are correlated
- We should be able to borrow information of one
class to the training of another class - Key question what to transfer between classes?
- Representation, model priors, similarity measure