Techniques For Exploiting Unlabeled Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: Techniques For Exploiting Unlabeled Data

1
Techniques For Exploiting Unlabeled Data
Thesis Proposal

Mugizi Rwebangira

May 11,2007
Committee Avrim Blum, CMU (Co-Chair) John
Lafferty, CMU (Co-Chair) William Cohen,
CMU Xiaojin (Jerry) Zhu, Wisconsin
2
Motivation
Supervised Machine Learning
induction
Labeled Examples (xi,yi)
Model x ?y
Problems Document classification, image
classification, protein sequence determination.
Algorithms SVM, Neural Nets, Decision Trees, etc.
3
Motivation

In recent years, there has been growing interest
in techniques for using unlabeled data

More data is being collected than ever before.
Labeling examples can be expensive and/or require
human intervention.
4
Examples
Images Abundantly available (digital cameras)
labeling requires humans (captchas).
Web Pages Can be easily crawled on the web,
labeling requires human intervention.

Proteins sequence can be easily determined,
structure determination is a hard problem.

5
Motivation
Semi-Supervised Machine Learning
Labeled Examples (xi,yi)
x ?y
Unlabeled Examples xi
6
Motivation
7
However
Techniques not as well developed as supervised
techniques
Best practices for using unlabeled data

Techniques for adapting supervised algorithms to
semi-supervised algorithms

8
Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Proposed Work and Time Line
9
Graph Mincut (Blum Chawla,2001)
10
Construct an (unweighted) Graph
11
Add auxiliary super-nodes
12
Obtain s-t mincut
-

Mincut
13
Classification
14

Problem

Plain mincut can give very unbalanced cuts.
15

Solution

Add random weights to the edges
Run plain mincut and obtain a classification.
Repeat the above process several times.
For each unlabeled example take a majority vote.
16
Before adding random weights

-
Mincut
17
After adding random weights

-
Mincut
18

PAC-Bayes
PAC-Bayes bounds suggests that when the graph has
many small cuts consistent with the labeling,
randomization should improve generalization
performance.
In this case each distinct cut corresponds to a
different hypothesis.
Hence the average of these cuts will be less
likely to overfit than any single cut.

Markov Random Fields
Ideally we would like to assign a weight to each
cut in the graph (a higher weight to small cuts)
and then take a weighted vote over all the cuts
in the graph.
This corresponds to a Markov Random Field model.
We dont know how to do this efficiently, but we
can view randomized mincuts as an approximation.

How to construct the graph?
k-NN
Graph may not have small balanced cuts.
How to learn k?
Connect all points within distance d
Can have disconnected components.
How to learn d?
Minimum Spanning Tree
No parameters to learn.
Gives connected, sparse graph.
Seems to work well on most datasets.

21
Experiments

ONE vs. TWO 1128 examples .
(8 X 8 array of integers, Euclidean distance).
ODD vs. EVEN 4000 examples .
(16 X 16 array of integers, Euclidean distance).
PC vs. MAC 1943 examples .
(20 newsgroup dataset, TFIDF distance) .

22
ONE vs. TWO
23
ODD vs. EVEN
24
PC vs. MAC
25
Summary
Randomization helps plain mincut achieve a
comparable performance to Gaussian Fields.
We can apply PAC sample complexity analysis and
interpret it in terms of Markov Random Fields.
There is an intuitive interpretation for the
confidence of a prediction in terms of the
margin of the vote.

Semi-supervised Learning Using Randomized
Mincuts,
Blum, J. Lafferty, M.R. Rwebangira, R. Reddy ,
ICML 2004

26
Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Proposed Work and Time Line
27
Gaussian Fields (Zhu, Ghahramani Lafferty)
This algorithm minimize the following functional
?(f) ? wij(fi-fj)2
Where wij is the similarity between examples i
and j.
And fi and fj are the predictions for example i
and j.
28
Locally Constant (Kernel regression)
y

x
29
Locally Linear
y

x
30
Local Linear Regression
This algorithm minimize the following functional
?(ß) ? wi (yi-ßTXxi)2
Where wi is the similarity between examples i and
x.
ß is the coefficient of the local linear fit at x.
31
Problem
Develop Local Linear version of Gaussian Fields
Or semi-supervised version of Local Linear
Regression
Local Linear Semi-supervised Regression
32
Local Linear Semi-supervised Regression
ßj

ßjo
ßio
(ßio XjiTßj)2
XjiTßj
ßi
xi
xj
33
Local Linear Semi-supervised Regression
This algorithm minimize the following functional
?(ß) ? wij (ßio XjiTßj)2
Where wij is the similarity between xi and xj.
34
Synthetic Data Doppler
Doppler function y (1/x)sin (15/x)
s2 0.1 (noise)
35
Experimental Results DOPPLER
Weighted Kernel Regression, LOOCV MSE 6.54,
MSE25.7
36
Experimental Results DOPPLER
Local Linear Regression, LOOCV MSE 80.8,
MSE14.4
37
Experimental Results DOPPLER
LLSR, LOOCV MSE 2.00, MSE7.99
38
PROBLEM RUNNING TIME
If number of examples is n and the dimension of
the examples is d then we have to invert an
n(d1) X n(d1) matrix.
This is prohibitively expensive, especially if
the d is large.
39
PROPOSED WORK Improving Running Time
Sparsification Ignore examples which are far
away so as to get a sparser matrix to invert.
Iterative Methods for solving Linear systems For
a matrix equation Axb, we can obtain successive
approximations x1, x2 xk. Can be significantly
faster if matrix A is sparse.
40
PROPOSED WORK Improving Running Time
Power series Use the identity (I-A)-1 I A
A2 A3
y (Q??)-1Py Q-1Py (-?Q-1?)Q-1Py
(-?Q-1?)2Q-1Py
A few terms may be sufficient to get a good
approximation
Compute supervised answer first, then smooth
the answer to get semi- Supervised solution. This
can be combined with iterative methods as we can
use the supervised solution as the starting point
for our iterative algorithm.
41
PROPOSED WORK Experimental Evaluation
Comparison against other proposed semi-supervised
regression algorithms.
Evaluation on a large variety of data sets,
especially high dimensional ones.
42
Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Proposed Work and Time Line
43
Kernels
K(x,y) F(x)F(y)
Allows us to implicitly project non-linearly
separable data into a high dimensional space
where a linear separator can be found .
Kernel must satisfy strict mathematical
definitions
1. Continuous
2. Symmetric
3. Positive semi-definite
44
Generic similarity Functions
What if the best similarity function in a given
domain does not satisfy the properties of a
kernel?
Two options
1. Use a kernel with inferior performance
2. Try to coerce the similarity function into a
kernel by building a kernel that has similar
behavior.
There is another way
45
The Balcan-Blum approach
Recently Balcan and Blum initiated the theory of
learning with generic similarity functions.
They gave a general definition of a good
similarity function for learning and showed that
the popular large margin kernels are a special
case of their definition.
They also gave an algorithm for learning with
good similarity functions.
Their approach makes use of unlabeled data
46
The Balcan-Blum approach
The algorithm is very simple
Suppose S(x,y) is our similarity function. Then

Draw d examples x1, x2, x3, xd uniformly at
random from the
data set.

2. For each example x compute the mapping x ?
S(x,x1), S(x,x2), S(x,x3), S(x,xd)
47
Synthetic Data Circle
48
Experimental Results Circle
49
PROPOSED WORK
Overall goal Investigate the practical
applicability of this theory and find out what is
needed to make it work on real problems.
Two main application areas
1. Domains which have expert defined similarity
functions that are not kernels (protein
homology).
2. Domains which have many irrelevant features
and in which the data may not be linearly
separable in the original features (text
classification).
50
PROPOSED WORK Protein Homology
The Smith-Waterman score is the best performing
measure of similarity but it does not satisfy the
kernel properties.
Machine learning applications have either used
other similarity functions Or tried to force SW
score into a kernel.
Can we achieve better performance by using SW
score directly?
51
PROPOSED WORK Text Classification
Most popular technique is Bag-of-Words (BOW)
where each document is converted into a vector
and each position in the vector indicates
how many times each word occurred.
The vectors tend to be sparse and there will be
many irrelevant features, hence this is well
suited to the Winnow algorithm. Our approach
makes the winnow algorithm more powerful.
Within this framework we have strong motivation
for investigating domain specific similarity
function, e.g. edit distance between documents
instead of cosine similarity.
Can we achieve better performance than current
techniques using domain specific similarity
functions?
52
PROPOSED WORK Domain Specific Similarity
Functions
As mentioned in the previous two slides,
designing specific similarity functions for each
domain, is well motivated in this approach.
What are the best practice principles for
designing domain specific similarity functions?
In what circumstances are domain specific
similarity functions likely to be most useful?
We will answer these questions by generalizing
from several different datasets and
systematically noting what seems to work best.
53
Proposed Work and Time Line
Summer 2007 Speeding up LLSR Learning with similarity in protein homology and text classification domain.
Fall 2007 Comparison of LLSR with other semi-supervised regression algs. Investigate principles of domain specific similarity functions.
Spring 2008 Start Writing Thesis
Summer 2008 Finish Writing Thesis
54
Back Up Slides
55
References

Semi-supervised Learning Using Randomized
Mincuts,
Blum, J. Lafferty, M.R. Rwebangira, R. Reddy ,
ICML 2004

56
My Work
Techniques for improving graph mincut algorithms
for semi-supervised classification
Techniques for extending Local Linear Regression
to the semi-supervised setting

Practical techniques for using unlabeled data and
generic similarity functions to kernelize the
winnow algorithm.

Problem

There may be several minimum cuts in the graph.
Indeed, there are potentially exponentially many
minimum cuts in the graph.
58
Real Data CO2
Carbon dioxide concentration in the atmosphere
over the last two centuries.
Source World Watch Institute
59
Experimental Results CO2
Weighted Kernel Regression, MSE 660
60
Experimental Results CO2
Local Linear Regression, MSE 144
61
Experimental ResultsCO2
LLSR, MSE 97.4
62
Winnow
A linear separator algorithm, first proposed by
Littlestone.
We are particularly interested in winnow because

It is known to be able to effectively learn in
the presence of irrelevant
attributes. Since we will be creating many new
features, we expect many
of them will be irrelevant.

2. It is fast and does not require a lot of
memory. Since we hope to use large amounts of
unlabeled data, scalability is an important
consideration.
63
Synthetic Data Blobs and Lines
Can we create a data set that needs BOTH the
original and the new features to do well?
To answer this we create the data set we will
call Blobs and Lines
We generate the data in the following way

We select k point to be the centers of our
blobs and assign them
labels in -1,1.

2. We flip a coin.
3. If heads, then we set x to be a random boolean
vector of dimension d and set the label to be
the first coordinate of x.
4. If tails, we pick one of the centers and flip
r bits and set x equal to that and set the label
to the label of the center.
64
Synthetic Data Blobs and Lines

-
-
-
-
-

-

-

-
65
Experimental Results Blobs and Lines

Write a Comment

User Comments (0)

About PowerShow.com

Techniques For Exploiting Unlabeled Data PowerPoint PPT Presentation