Inductive Transfer via Embedding into a Common Feature Space - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Inductive Transfer via Embedding into a Common Feature Space

Description:

one domain (say, legal documents) and then. be used to tag on a different domain (say, ... The embedding should retain labeling-relevant information. ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 25
Provided by: shai160
Category:

less

Transcript and Presenter's Notes

Title: Inductive Transfer via Embedding into a Common Feature Space


1
Inductive Transfer via Embedding into a Common
Feature Space
  • Shai Ben-David
  • School of computer Science
  • University of Waterloo

2
Background
  • In April at Snowbird I heard John Blitzer
  • talk about his work with Crammer and Pereira
  • on transferring Part Of Speech (POS) tagging
    from one domain to another.
  • It clicked with my interest in investigating the
    theoretical foundations of practically successful
    learning heuristics.

3
Inductive transfer for POS tagging
  • POS tagging is a common preprocessing
  • step in NLP systems.
  • Can an automatic POS tagger be trained on
  • one domain (say, legal documents) and then
  • be used to tag on a different domain (say,
  • biomedical abstracts).
  • The relationship between Inductive
  • Transfer and Multi-Task learning is worth
  • discussing Ill postpone that to the end
    of
  • my talk.

4
Structural Correspondence Learning(Blitzer,
McDonald, Pereira)
  • Choose a set of pivot words (determiners,
    prepositions, connectors and frequently occurring
    verbs).
  • Represent every word in a text as a vector of its
    correlations with each of the
  • pivot words.
  • Train a linear separator on the (images of) the
    training data coming from one domain and use it
    for tagging on the other.

5
Another potential application
  • Spam detection
  • Consider the scenario where a spam filter is
    trained on the mail of in-house
  • users, but then has to succeed on mail
  • of customers.

6
Why does it work?
  • The CSL representation seems to enjoy
  • two properties
  • It is sufficiently reach to allow tagging.
  • The images of the two domains look
  • similar under this representation.

7
A formal framework
  • Our Setting
  • There is a fixed (unique) domain set X.
  • There is a unique probabilistic labeling function
  • f X ? 0,1.
  • Two tasks, S and T, induce two different
    probability distributions, DS and DT over X.
  • We are given a training sample drawn by DS
  • and labeled by Prob(l(x)1)f(x).
  • The test data is generated by DT

8
Our main Inductive Transfer Tool
  • We propose to embed the original attribute
    space into some feature space
  • in which the two tasks look similar.
  • Then, treat the images of points from both
    domains as if they are coming from
  • a unique domain.

9
The Common-Feature-Space Idea
10
Some more notation
  • We call a function R X ? Z a representation
  • and its range, Z , a feature space.
  • Let DRS and DRT denote the probability
    distributions
  • induced by DS and DT via R
  • Let fRS and fRT be the corresponding images of
    f.
  • E.g., fRS(x) ESf(y) R(y)R(x)

11
Requirements from such embedding
  • The embedding should retain labeling-relevant
    information.
  • Namely EDS (fRs ½) should be as large
    as
  • possible.
  • (Well refine this requirement later)

12
Requirements from such embedding
  • The induced distributions DRS and DRT
  • should be as similar as possible.

13
How should one measure similarity of
distributions?
  • Common measures
  • Total Variance
  • TV(D, D) SupE measurable (D(E)-D(E))
  • KL divergence
  • They are too sensitive for our needs
  • and cannot be reliably estimated from
    finite
  • samples. Batu et al 2000

14
A new distribution-similarity measure
  • In VLBD 2004, Kifer, B-D and Gehrke
  • introduced the A-similarity measure,
  • dA(D, D) SupE e A (D(E)-D(E))
  • They prove that it can reliably be estimated
  • from finite samples drawn i.i.d. via D and D
  • (respectively).
  • The required sample sizes are a function of the
    VC-dim of A

15
Estimating dA from samples
  • Kifer, B-D, Gehrke

16
A generalization bound
  • Theorem
  • Let Z be any feature space, and R any
    representation of
  • X into Z. Let H be a class of functions from Z
    to 0,1.
  • Then, if a random labeled sample of size m
    generated by
  • applying R to a DS -i.i.d. sample labeled
    according to f ,
  • Then, with probability gt (1-d), for every h in H,
  • Where d is the VC-dim of H, and

17
An algorithmic conclusion
  • Assuming that
  • The (unlabeled) distributions induced by the
    Source and Target domains under the
    representation R are similar.
  • There exist a predictor in H that works
    reasonably well for both distributions.
  • Empirical risk minimization over training in the
  • source domain is a good predictor for the target
  • domain.

18
Learning good embeddings
  • We can use a semi-supervised learning
    approach
  • Take care of the distribution-similarity
    requirement using unlabeled data form both tasks.
  • Minimize the empirical error ErS(h)
  • using labeled training sample from the
  • source domain.

19
The resulting algorithm
  • Fix a family R of potential embeddings.
  • Fix a family H of predictors over the feature
    space(s).
  • Draw a labeled training sample from the source
    domain.
  • Search simultaneously over R and H to minimize
    the sum of the empirical error of h and d?H(DS ,
    DT)
  • (We implicitly assume that this search does
    not
  • have a significant effect on the value of ?)

20
A generalization bound for semi-supervised
approach
  • We get a similar bound to our basic error
  • bound, with two added terms
  • The first depends on VC-dim(H), based on
  • K B-D G 2004, to account for estimating
  • d?H(DS , DT) from (unlabeled) samples.
  • The second term accounts for the search over
    the
  • space R of potential embeddings, and depends
    on
  • a pseudo-dimension of R.

21
Some experimental results
22
Comparison to a random projection
23
On the relationship between Inductive Transfer
and Multi-Task Learning
  • Inductive Transfer is concerned with learning
  • to predict in one domain based on training data
    coming
  • from another domain.
  • Multi-Task-Learning is concerned with utilizing
  • training data from different domains to help
  • predicting in all of them.
  • Clearly each of them implies the other.
  • Can we formalize and quantify these
    implications?

(?)
24
Some other attempts of mine totheoreize
practical heuristics
  • Can kernels always provide large margins?
  • (with Eiron and Simon, 2001)
  • Can kernels be learned without over-fitting?
    (with Srebro, 2006)
  • Does stability work for selecting
  • clustering parameters? (with von Luxburg and
    Pal, 2006)

Negative
Positive
Negative
Write a Comment
User Comments (0)
About PowerShow.com