Inductive Transfer via Embedding into a Common Feature Space - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Inductive Transfer via Embedding into a Common Feature Space

Description:

one domain (say, legal documents) and then. be used to tag on a different domain (say, ... The embedding should retain labeling-relevant information. ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 25

Provided by: shai160

Category:

more less

Transcript and Presenter's Notes

Title: Inductive Transfer via Embedding into a Common Feature Space

1
Inductive Transfer via Embedding into a Common
Feature Space

Shai Ben-David
School of computer Science
University of Waterloo

2
Background

In April at Snowbird I heard John Blitzer
talk about his work with Crammer and Pereira
on transferring Part Of Speech (POS) tagging
from one domain to another.
It clicked with my interest in investigating the
theoretical foundations of practically successful
learning heuristics.

3
Inductive transfer for POS tagging

POS tagging is a common preprocessing
step in NLP systems.
Can an automatic POS tagger be trained on
one domain (say, legal documents) and then
be used to tag on a different domain (say,
biomedical abstracts).
The relationship between Inductive
Transfer and Multi-Task learning is worth
discussing Ill postpone that to the end
of
my talk.

4
Structural Correspondence Learning(Blitzer,
McDonald, Pereira)

Choose a set of pivot words (determiners,
prepositions, connectors and frequently occurring
verbs).
Represent every word in a text as a vector of its
correlations with each of the
pivot words.
Train a linear separator on the (images of) the
training data coming from one domain and use it
for tagging on the other.

5
Another potential application

Spam detection
Consider the scenario where a spam filter is
trained on the mail of in-house
users, but then has to succeed on mail
of customers.

6
Why does it work?

The CSL representation seems to enjoy
two properties
It is sufficiently reach to allow tagging.
The images of the two domains look
similar under this representation.

7
A formal framework

Our Setting
There is a fixed (unique) domain set X.
There is a unique probabilistic labeling function
f X ? 0,1.
Two tasks, S and T, induce two different
probability distributions, DS and DT over X.
We are given a training sample drawn by DS
and labeled by Prob(l(x)1)f(x).
The test data is generated by DT

8
Our main Inductive Transfer Tool

We propose to embed the original attribute
space into some feature space
in which the two tasks look similar.
Then, treat the images of points from both
domains as if they are coming from
a unique domain.

9
The Common-Feature-Space Idea
10
Some more notation

We call a function R X ? Z a representation
and its range, Z , a feature space.
Let DRS and DRT denote the probability
distributions
induced by DS and DT via R
Let fRS and fRT be the corresponding images of
f.
E.g., fRS(x) ESf(y) R(y)R(x)

11
Requirements from such embedding

The embedding should retain labeling-relevant
information.
Namely EDS (fRs ½) should be as large
as
possible.
(Well refine this requirement later)

12
Requirements from such embedding

The induced distributions DRS and DRT
should be as similar as possible.

13
How should one measure similarity of
distributions?

Common measures
Total Variance
TV(D, D) SupE measurable (D(E)-D(E))
KL divergence
They are too sensitive for our needs
and cannot be reliably estimated from
finite
samples. Batu et al 2000

14
A new distribution-similarity measure

In VLBD 2004, Kifer, B-D and Gehrke
introduced the A-similarity measure,
dA(D, D) SupE e A (D(E)-D(E))
They prove that it can reliably be estimated
from finite samples drawn i.i.d. via D and D
(respectively).
The required sample sizes are a function of the
VC-dim of A

15
Estimating dA from samples

Kifer, B-D, Gehrke

16
A generalization bound

Theorem
Let Z be any feature space, and R any
representation of
X into Z. Let H be a class of functions from Z
to 0,1.
Then, if a random labeled sample of size m
generated by
applying R to a DS -i.i.d. sample labeled
according to f ,
Then, with probability gt (1-d), for every h in H,
Where d is the VC-dim of H, and

17
An algorithmic conclusion

Assuming that
The (unlabeled) distributions induced by the
Source and Target domains under the
representation R are similar.
There exist a predictor in H that works
reasonably well for both distributions.
Empirical risk minimization over training in the
source domain is a good predictor for the target
domain.

18
Learning good embeddings

We can use a semi-supervised learning
approach
Take care of the distribution-similarity
requirement using unlabeled data form both tasks.
Minimize the empirical error ErS(h)
using labeled training sample from the
source domain.

19
The resulting algorithm

Fix a family R of potential embeddings.
Fix a family H of predictors over the feature
space(s).
Draw a labeled training sample from the source
domain.
Search simultaneously over R and H to minimize
the sum of the empirical error of h and d?H(DS ,
DT)
(We implicitly assume that this search does
not
have a significant effect on the value of ?)

20
A generalization bound for semi-supervised
approach