Title: Techniques%20For%20Exploiting%20Unlabeled%20Data
1Techniques For Exploiting Unlabeled Data
Thesis Defense
September 8,2008
Committee Avrim Blum, CMU (Co-Chair) John
Lafferty, CMU (Co-Chair) William Cohen,
CMU Xiaojin (Jerry) Zhu, Wisconsin
2Motivation
Supervised Machine Learning
induction
Labeled Examples (xi,yi)
Model x ?y
Problems Document classification, image
classification, protein sequence determination.
Algorithms SVM, Neural Nets, Decision Trees, etc.
3Motivation
- In recent years, there has been growing interest
in techniques for using unlabeled data
More data is being collected than ever before.
Labeling examples can be expensive and/or require
human intervention.
4Examples
Images Abundantly available (digital cameras)
labeling requires humans (captchas).
Web Pages Can be easily crawled on the web,
labeling requires human intervention.
- Proteins sequence can be easily determined,
structure determination is a hard problem.
5Motivation
Semi-Supervised Machine Learning
Labeled Examples (xi,yi)
x ?y
Unlabeled Examples xi
6Motivation
7However
Techniques not as well developed as supervised
techniques
Best practices for using unlabeled data
- Techniques for adapting supervised algorithms to
semi-supervised algorithms
8Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Conclusion and Questions
9Graph Mincut (Blum Chawla,2001)
10Construct an (unweighted) Graph
11Add auxiliary super-nodes
12Obtain s-t mincut
-
Mincut
13Classification
14Plain mincut can give very unbalanced cuts.
15Add random weights to the edges
Run plain mincut and obtain a classification.
Repeat the above process several times.
For each unlabeled example take a majority vote.
16Before adding random weights
-
Mincut
17After adding random weights
-
Mincut
18- PAC-Bayes
- PAC-Bayes bounds suggests that when the graph has
many small cuts consistent with the labeling,
randomization should improve generalization
performance. - In this case each distinct cut corresponds to a
different hypothesis. - Hence the average of these cuts will be less
likely to overfit than any single cut.
19- Markov Random Fields
- Ideally we would like to assign a weight to each
cut in the graph (a higher weight to small cuts)
and then take a weighted vote over all the cuts
in the graph. - This corresponds to a Markov Random Field model.
- We dont know how to do this efficiently, but we
can view randomized mincuts as an approximation.
20- How to construct the graph?
- k-NN
- Graph may not have small balanced cuts.
- How to learn k?
- Connect all points within distance d
- Can have disconnected components.
- How to learn d?
- Minimum Spanning Tree
- No parameters to learn.
- Gives connected, sparse graph.
- Seems to work well on most datasets.
21Experiments
- ONE vs. TWO 1128 examples .
- (8 X 8 array of integers, Euclidean distance).
- ODD vs. EVEN 4000 examples .
- (16 X 16 array of integers, Euclidean distance).
- PC vs. MAC 1943 examples .
- (20 newsgroup dataset, TFIDF distance) .
22ONE vs. TWO
23ODD vs. EVEN
24PC vs. MAC
25Summary
Randomization helps plain mincut achieve a
comparable performance to Gaussian Fields.
We can apply PAC sample complexity analysis and
interpret it in terms of Markov Random Fields.
There is an intuitive interpretation for the
confidence of a prediction in terms of the
margin of the vote.
- Semi-supervised Learning Using Randomized
Mincuts, - Blum, J. Lafferty, M.R. Rwebangira, R. Reddy ,
- ICML 2004
26Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Proposed Work and Time Line
27(Supervised) Linear Regression
y
x
28Semi-Supervised Regression
y
x
29Smoothness assumption
Things that are close together should have
similar values
Where wij is the similarity between examples i
and j.
And fi and fj are the predictions for example i
and j.
Gaussian Fields (Zhu, Ghahramani Lafferty)
30Local Constancy
The predictions made by Gaussian Fields are
locally constant
y
u
u ?
x
More formally m (u ?) m(u)
31Local Linearity
For many regression tasks we would prefer
predictions to be locally linear.
y
u ?
u
x
More formally m (u ?) m(u) m(u) ?
32Problem
Develop a version of Gaussian Fields which is
Local Linear
Or a semi-supervised version of Linear Regression
Local Linear Semi-supervised Regression
33Local Linear Semi-supervised Regression
By analogy with ? wij(fi-fj)2
ßj
ßjo
ßio
(ßio XjiTßj)2
XjiTßj
ßi
xi
xj
34Local Linear Semi-supervised Regression
So we find ß to minimize the following objective
function
?(ß) ? wij (ßio XjiTßj)2
Where wij is the similarity between xi and xj.
35Synthetic Data Gong
Gong function y (1/x)sin (15/x)
s2 0.1 (noise)
36Experimental Results GONG
Weighted Kernel Regression, MSE25.7
37Experimental Results GONG
Local Linear Regression, MSE14.4
38Experimental Results GONG
LLSR, MSE7.99
39PROBLEM RUNNING TIME
If we have n examples and dimension d then to
compute a closed form solution we have to invert
an (n(d1) n(d1)) matrix.
This is prohibitively expensive, especially if d
is large.
For example if n1500 and d199 then we have to
invert a matrix of size 720 GB in Matlabs double
precision format.
40SOLUTION ITERATION
It turns out that because of the form of the
equation we can start from an arbitrary initial
guess and do an iterative computation that
provably converges to the desired solution.
In the case of n1500 and d199, instead of
dealing with a matrix of size 720 GB we only have
to store 2.4 MB in memory which makes the
algorithm much more practical.
41Experiments on Real Data
We do model selection using Leave One Out Cross
validation
We compare
Weighted Kernel Regression (WKR) a purely
supervised method.
Local Linear Regression (LLR) another purely
supervised method.
Local Learning Regularization (LL-Reg) an up to
date semi-supervised method
Local Linear Semi-Supervised Regularization (LLSR)
For each algorithm and dataset we give
1. The mean and standard deviation of 10 runs.
2. The results of an OPTIMAL choice of parameters.
42Experimental Results
Dataset n d nl LLSR LLSR-OPT WKR WKR-OPT
Carbon 58 1 10 2725 1911 7036 3711
Alligators 25 1 10 288176 209162 336210 324211
Smoke 25 1 10 8213 7913 8319 8015
Autompg 392 7 100 502 491 573 573
Dataset n d nl LLR LLR-OPT LL-Reg LL-Reg-OPT
Carbon 58 1 10 5716 5410 162199 7422
Alligators 25 1 10 207140 207140 289222 248157
Smoke 25 1 10 8212 8013 8214 706
Autompg 392 7 100 533 523 534 512
43Summary
LLSR is a natural semi-supervised generalization
of Linear Regression
While the analysis is not as clear as with
semi-supervised classification, semi-supervised
regression can perform better than supervised
regression if the function has a smooth manifold
similar to the GONG function.
FUTURE WORK
Carefully analyzing the assumptions under which
unlabeled data can be useful in regression.
44Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Proposed Work and Time Line
45Kernels
K(x,y) Informally considered as a measure of
similarity between x and y
Kernel trick K(x,y) F(x)F(y) (Mercers
theorem)
This allows us to implicitly project non-linearly
separable data into a high dimensional space
where a linear separator can be found .
Kernel must satisfy strict mathematical
definitions
1. Continuous
2. Symmetric
3. Positive semi-definite
46Problems with Kernels
There is a conceptual disconnect between the
notion of kernels as similarity functions and the
notion of finding max-margin separators in
possibly infinite dimensional Hilbert spaces.
The properties of kernels such as being Positive
Semi-Definite are rather restrictive and in
particular similarity functions used in
certain domains, such as the Smith-Waterman score
in molecular biology do do not fit in this
framework.
WANTED A method for using similarity functions
that is both easy and general.
47The Balcan-Blum approach
An approach fitting these requirements was
recently proposed by Balcan and Blum.
Gave a general definition of a good similarity
function for learning.
Showed that kernels are special case of their
definition.
Gave an algorithm for learning with good
similarity functions.
48The Balcan-Blum approach
Suppose S(x,y) \in (-1,1) is our similarity
function. Then
- Draw d examples x1, x2, x3, xd uniformly at
random from the - data set.
2. For each example x compute the mapping x ?
S(x,x1), S(x,x2), S(x,x3), S(x,xd)
KEY POINT This method can make use of UNLABELED
DATA.
49Combining Feature based and Graph Based Methods
Feature based methods directly operate on the
native features- e.g. Decision Tree, MaxEnt,
Winnow, Perceptron
Graph based methods operate on the graph of
similarities between examples, e.g Kernel
methods, Gaussian Fields, Graph mincut and most
semi-supervised learning methods.
These methods can work well on different
datasets, we want to find a way to find a way to
COMBINE these approaches into one algorithm.
50SOLUTION Similarity functions Winnow
Use the Balcan-Blum approach to generate extra
features.
Append the extra features to the original
features- x ? x,S(x,x1), S(x,x2), S(x,x3),
S(x,xd)
Run the Winnow algorithm on the combined features
(Winnow is known to be resistant to irrelevant
features.)
51Our Contributions
Practical techniques for using similarity
functions
Combining graph based and feature based learning.
52How to define a good similarity function?
By modifying a distance metric- K(x,y)
1/(D(x,y)1)
Problem We can end up with all similarities
close to ZERO (not good)
Solution Scale the similarities as follows
Sort the similarities for example x from most
similar to least.
Give the most similar similarity 1 and the
least, similarity -1 and interpolate the
remaining example in between.
VERY IMPORTANT The ranked similarity may not be
symmetric Which is a big difference with kernels.
53Evaluating a similarity function
K is strongly (e,?)-good similarity function for
a learning problem P if at least a (1- e)
probability mass of examples x satisfy ExP
K(x,x)l(x)l(x) ExP K(x,x)l(x)
?l(x) ?
For a particular similarity function and dataset
we can compute the margin ? for each example and
then plot the examples by decreasing margin. If
the margin is large for most examples, this is an
indication that the similarity function may
perform well on a particular dataset.
54Compatibility of the naïve similarity function on
Digits1
55Compatibility of the ranked similarity function
on Digits1
56Experimental Results
Well look at some experimental results on both
real and synthetic datasets.
57Synthetic Data Circle
58Experimental Results Circle
59Synthetic Data Blobs and Lines
Can we create a data set that needs BOTH the
original and the new features to do well?
To answer this we create the data set we will
call Blobs and Lines
We generate the data in the following way
- We select k point to be the centers of our
blobs and assign them - labels in -1,1.
2. We flip a coin.
3. If heads, then we set x to be a random boolean
vector of dimension d and set the label to be
the first coordinate of x.
4. If tails, we pick one of the centers and flip
r bits and set x equal to that and set the label
to the label of the center.
60Synthetic Data Blobs and Lines
-
-
-
-
-
-
-
-
61Experimental Results Blobs and Lines
62Experimental Results Real Data
Dataset n d nl Winnow SVM NN SIM WinnowSVM
Congress 435 16 100 93.79 94.93 90.8 90.90 92.24
Webmaster 582 1406 100 81.97 71.78 72.5 69.90 81.20
Credit 653 46 100 78.50 55.52 61.5 59.10 77.36
Wisc 683 89 100 95.03 94.51 95.3 93.65 94.49
Digit1 1500 241 100 73.26 88.79 94.0 94.21 91.31
USPS 1500 241 100 71.85 74.21 92.0 86.72 88.57
63Experimental Results Concatenation
What if we did something halfway between
synthetic and real, by concatenating two
different datasets? This can be viewed as
simulating a dataset that has two different kinds
of data.
We concatenated the datasets by padding each of
them with a block of ZEROS.
Credit (653 X 46) Padding (653 X 241)
Padding (653 X 46) Digit1 (653 X 241)
Dataset n d nl Winnow SVM NN SIM WinnowSVM
Credit Digit1 1306 287 100 72.41 51.74 75.46 74.25 83.95
64Conclusions
Generic similarity functions have a lot of
potential to be applied to practical applications.
Combining feature based and graph based methods
we can often get the best of both worlds
FUTURE WORK
Designing similarity functions suited to
particular domains.
Theoretically provable guarantees on the quality
of a similarity function
65QUESTIONS?
66Back Up Slides
67References
- Semi-supervised Learning Using Randomized
Mincuts, - Blum, J. Lafferty, M.R. Rwebangira, R. Reddy ,
- ICML 2004
68My Work
Techniques for improving graph mincut algorithms
for semi-supervised classification
Techniques for extending Local Linear Regression
to the semi-supervised setting
- Practical techniques for using unlabeled data and
generic similarity functions to kernelize the
winnow algorithm.
69There may be several minimum cuts in the graph.
Indeed, there are potentially exponentially many
minimum cuts in the graph.
70Real Data CO2
Carbon dioxide concentration in the atmosphere
over the last two centuries.
Source World Watch Institute
71Experimental Results CO2
Local Linear Regression, MSE 144
72Experimental Results CO2
Weighted Kernel Regression, MSE 660
73Experimental ResultsCO2
LLSR, MSE 97.4
74Winnow
A linear separator algorithm, first proposed by
Littlestone.
We are particularly interested in winnow because
- It is known to be able to effectively learn in
the presence of irrelevant - attributes. Since we will be creating many new
features, we expect many - of them will be irrelevant.
2. It is fast and does not require a lot of
memory. Since we hope to use large amounts of
unlabeled data, scalability is an important
consideration.
75Gaussian Fields (Zhu, Ghahramani Lafferty)
This algorithm minimize the following functional
?(f) ? wij(fi-fj)2
Where wij is the similarity between examples i
and j.
And fi and fj are the predictions for example i
and j.
76Locally Constant (Kernel regression)
77Locally Linear
y
x
78Local Linear Regression
This algorithm minimize the following functional
?(ß) ? wi (yi-ßTXxi)2
Where wi is the similarity between examples i and
x.
ß is the coefficient of the local linear fit at x.
79PROPOSED WORK Improving Running Time
Sparsification Ignore examples which are far
away so as to get a sparser matrix to invert.
Iterative Methods for solving Linear systems For
a matrix equation Axb, we can obtain successive
approximations x1, x2 xk. Can be significantly
faster if matrix A is sparse.
80PROPOSED WORK Improving Running Time
Power series Use the identity (I-A)-1 I A
A2 A3
y (Q??)-1Py Q-1Py (-?Q-1?)Q-1Py
(-?Q-1?)2Q-1Py
A few terms may be sufficient to get a good
approximation
Compute supervised answer first, then smooth
the answer to get semi- Supervised solution. This
can be combined with iterative methods as we can
use the supervised solution as the starting point
for our iterative algorithm.
81PROPOSED WORK Experimental Evaluation
Comparison against other proposed semi-supervised
regression algorithms.
Evaluation on a large variety of data sets,
especially high dimensional ones.
82PROPOSED WORK
Overall goal Investigate the practical
applicability of this theory and find out what is
needed to make it work on real problems.
Two main application areas
1. Domains which have expert defined similarity
functions that are not kernels (protein
homology).
2. Domains which have many irrelevant features
and in which the data may not be linearly
separable in the original features (text
classification).
83PROPOSED WORK Protein Homology
The Smith-Waterman score is the best performing
measure of similarity but it does not satisfy the
kernel properties.
Machine learning applications have either used
other similarity functions Or tried to force SW
score into a kernel.
Can we achieve better performance by using SW
score directly?
84PROPOSED WORK Text Classification
Most popular technique is Bag-of-Words (BOW)
where each document is converted into a vector
and each position in the vector indicates
how many times each word occurred.
The vectors tend to be sparse and there will be
many irrelevant features, hence this is well
suited to the Winnow algorithm. Our approach
makes the winnow algorithm more powerful.
Within this framework we have strong motivation
for investigating domain specific similarity
function, e.g. edit distance between documents
instead of cosine similarity.
Can we achieve better performance than current
techniques using domain specific similarity
functions?
85PROPOSED WORK Domain Specific Similarity
Functions
As mentioned in the previous two slides,
designing specific similarity functions for each
domain, is well motivated in this approach.
What are the best practice principles for
designing domain specific similarity functions?
In what circumstances are domain specific
similarity functions likely to be most useful?
We will answer these questions by generalizing
from several different datasets and
systematically noting what seems to work best.
86Proposed Work and Time Line
Summer 2007 Speeding up LLSR Learning with similarity in protein homology and text classification domain.
Fall 2007 Comparison of LLSR with other semi-supervised regression algs. Investigate principles of domain specific similarity functions.
Spring 2008 Start Writing Thesis
Summer 2008 Finish Writing Thesis
87Kernels
K(x,y) F(x)F(y)
Allows us to implicitly project non-linearly
separable data into a high dimensional space
where a linear separator can be found .
Kernel must satisfy strict mathematical
definitions
1. Continuous
2. Symmetric
3. Positive semi-definite
88Generic similarity Functions
What if the best similarity function in a given
domain does not satisfy the properties of a
kernel?
Two options
1. Use a kernel with inferior performance
2. Try to coerce the similarity function into a
kernel by building a kernel that has similar
behavior.
There is another way