Title: Inductive Transfer via Embedding into a Common Feature Space
1Inductive Transfer via Embedding into a Common
Feature Space
- Shai Ben-David
- School of computer Science
- University of Waterloo
2Background
- In April at Snowbird I heard John Blitzer
- talk about his work with Crammer and Pereira
- on transferring Part Of Speech (POS) tagging
from one domain to another. - It clicked with my interest in investigating the
theoretical foundations of practically successful
learning heuristics.
3Inductive transfer for POS tagging
- POS tagging is a common preprocessing
- step in NLP systems.
- Can an automatic POS tagger be trained on
- one domain (say, legal documents) and then
- be used to tag on a different domain (say,
- biomedical abstracts).
- The relationship between Inductive
- Transfer and Multi-Task learning is worth
- discussing Ill postpone that to the end
of - my talk.
4Structural Correspondence Learning(Blitzer,
McDonald, Pereira)
- Choose a set of pivot words (determiners,
prepositions, connectors and frequently occurring
verbs). - Represent every word in a text as a vector of its
correlations with each of the - pivot words.
- Train a linear separator on the (images of) the
training data coming from one domain and use it
for tagging on the other.
5Another potential application
- Spam detection
- Consider the scenario where a spam filter is
trained on the mail of in-house - users, but then has to succeed on mail
- of customers.
6Why does it work?
- The CSL representation seems to enjoy
- two properties
- It is sufficiently reach to allow tagging.
- The images of the two domains look
- similar under this representation.
7A formal framework
- Our Setting
- There is a fixed (unique) domain set X.
- There is a unique probabilistic labeling function
- f X ? 0,1.
- Two tasks, S and T, induce two different
probability distributions, DS and DT over X. - We are given a training sample drawn by DS
- and labeled by Prob(l(x)1)f(x).
- The test data is generated by DT
8Our main Inductive Transfer Tool
- We propose to embed the original attribute
space into some feature space - in which the two tasks look similar.
- Then, treat the images of points from both
domains as if they are coming from - a unique domain.
9The Common-Feature-Space Idea
10Some more notation
- We call a function R X ? Z a representation
- and its range, Z , a feature space.
- Let DRS and DRT denote the probability
distributions - induced by DS and DT via R
- Let fRS and fRT be the corresponding images of
f. - E.g., fRS(x) ESf(y) R(y)R(x)
11Requirements from such embedding
- The embedding should retain labeling-relevant
information. - Namely EDS (fRs ½) should be as large
as - possible.
- (Well refine this requirement later)
12Requirements from such embedding
- The induced distributions DRS and DRT
-
- should be as similar as possible.
13How should one measure similarity of
distributions?
- Common measures
- Total Variance
- TV(D, D) SupE measurable (D(E)-D(E))
- KL divergence
- They are too sensitive for our needs
- and cannot be reliably estimated from
finite - samples. Batu et al 2000
14 A new distribution-similarity measure
- In VLBD 2004, Kifer, B-D and Gehrke
- introduced the A-similarity measure,
- dA(D, D) SupE e A (D(E)-D(E))
- They prove that it can reliably be estimated
- from finite samples drawn i.i.d. via D and D
- (respectively).
- The required sample sizes are a function of the
VC-dim of A
15Estimating dA from samples
16A generalization bound
- Theorem
- Let Z be any feature space, and R any
representation of - X into Z. Let H be a class of functions from Z
to 0,1. - Then, if a random labeled sample of size m
generated by - applying R to a DS -i.i.d. sample labeled
according to f , - Then, with probability gt (1-d), for every h in H,
- Where d is the VC-dim of H, and
17An algorithmic conclusion
- Assuming that
- The (unlabeled) distributions induced by the
Source and Target domains under the
representation R are similar. - There exist a predictor in H that works
reasonably well for both distributions. - Empirical risk minimization over training in the
- source domain is a good predictor for the target
- domain.
18Learning good embeddings
- We can use a semi-supervised learning
approach - Take care of the distribution-similarity
requirement using unlabeled data form both tasks. - Minimize the empirical error ErS(h)
- using labeled training sample from the
- source domain.
-
19The resulting algorithm
- Fix a family R of potential embeddings.
- Fix a family H of predictors over the feature
space(s). - Draw a labeled training sample from the source
domain. - Search simultaneously over R and H to minimize
the sum of the empirical error of h and d?H(DS ,
DT) - (We implicitly assume that this search does
not - have a significant effect on the value of ?)
20A generalization bound for semi-supervised
approach
- We get a similar bound to our basic error
- bound, with two added terms
- The first depends on VC-dim(H), based on
- K B-D G 2004, to account for estimating
- d?H(DS , DT) from (unlabeled) samples.
- The second term accounts for the search over
the - space R of potential embeddings, and depends
on - a pseudo-dimension of R.
-
21Some experimental results
22Comparison to a random projection
23On the relationship between Inductive Transfer
and Multi-Task Learning
- Inductive Transfer is concerned with learning
- to predict in one domain based on training data
coming - from another domain.
- Multi-Task-Learning is concerned with utilizing
- training data from different domains to help
- predicting in all of them.
- Clearly each of them implies the other.
- Can we formalize and quantify these
implications?
(?)
24Some other attempts of mine totheoreize
practical heuristics
- Can kernels always provide large margins?
- (with Eiron and Simon, 2001)
- Can kernels be learned without over-fitting?
(with Srebro, 2006) -
- Does stability work for selecting
- clustering parameters? (with von Luxburg and
Pal, 2006)
Negative
Positive
Negative