Title: A Comparison of Methods for Transductive Transfer Learning
1A Comparison of Methods for Transductive Transfer
Learning
- Andrew Arnold
- Advised by William W. Cohen
- Machine Learning Department
- School of Computer Science
- Carnegie Mellon University
- May 30, 2007
2What we are able to do
- Supervised learning
- Train on large, labeled data sets drawn from same
distribution as testing data - Well studied problem
Training data
Test
Test
Train
Reversible histone acetylation changes the
chromatin structure and can modulate gene
transcription. Mammalian histone deacetylase 1
(HDAC1)
The neuronal cyclin-dependent kinase p35/cdk5
comprises a catalytic subunit (cdk5) and an
activator subunit (p35)
3What were getting better at doing
- Semi-supervised learning
- Same as before, but now
- Add large unlabelled or weakly labeled data sets
from same domain - Zhu 05, Grandvalet 05
Train
Auxiliary (available for training)
Auxiliary
Train
Reversible histone acetylation changes the
chromatin structure and can modulate gene
transcription. Mammalian histone deacetylase 1
(HDAC1)
The neuronal cyclin-dependent kinase p35/cdk5
comprises a catalytic subunit (cdk5) and an
activator subunit (p35)
4What were getting better at doing
- Transductive learning
- Unlabeled test data is available during training
- Easier than inductive learning
- Learning specific predictions rahter than general
function - Joachims 99, 03, Sindhwani 05, Vapnik 98
Train
Both Auxiliary Eventual Test
Auxiliary Test
Train
Reversible histone acetylation changes the
chromatin structure and can modulate gene
transcription. Mammalian histone deacetylase 1
(HDAC1)
The neuronal cyclin-dependent kinase p35/cdk5
comprises a catalytic subunit (cdk5) and an
activator subunit (p35)
5What wed like to be able to do
- Transfer learning (domain adaptation)
- Leverage large, previously labeled data from a
related domain - Related domain well be training on (with lots of
data) Source - Domain were interested in and will be tested on
(data scarce) Target - Ng 06, Daumé 06, Jiang 06, Blitzer 06,
Ben-David 07, Thrun 96
Train (source domain E-mail)
Test (target domain IM)
Test (target domain Caption)
Train (source domain Abstract)
Neuronal cyclin-dependent kinase p35/cdk5 (Fig 1,
a) comprises a catalytic subunit (cdk5, left
panel) and an activator subunit (p35, fmi 4)
The neuronal cyclin-dependent kinase p35/cdk5
comprises a catalytic subunit (cdk5) and an
activator subunit (p35)
6What wed like to be able to do
- Transfer learning (multi-task)
- Same domain, but slightly different task
- Related task well be training on (with lots of
data) Source - Task were interested in and will be tested on
(data scarce) Target - Ando 05, Sutton 05
Train (source task Names)
Test (target task Pronouns)
Test (target task Action Verbs)
Train (source task Proteins)
Reversible histone acetylation changes the
chromatin structure and can modulate gene
transcription. Mammalian histone deacetylase 1
(HDAC1)
The neuronal cyclin-dependent kinase p35/cdk5
comprises a catalytic subunit (cdk5) and an
activator subunit (p35)
7Motivation
- Why is transfer important?
- Often we violate non-transfer assumption without
realizing. How much data is truly identically
distributed (i.i.d.)? - E.g. Different authors, annotators, time periods,
sources - Large amounts of labeled data/trained classifiers
already exist - Why waste data computation?
- Can learning be made easier by leveraging related
domains/problems? - Life-long learning
- Why is transduction important?
- Why solve a harder problem than we need to?
- Unlabeled data is vast and cheap
- Are transduction and transfer so different?
- Can we learn more about one by studying the other?
8Outline
- Motivating Problems
- Supervised learning
- Semi-supervised learning
- Transductive learning
- Transfer learning domain adaptation
- Transfer learning multi-task
- Methods
- Maximum entropy (MaxEnt)
- Source regularized maximum entropy
- Feature space expansion
- Feature selection
- Feature space transformation
- Iterative Pseudo Labeling (IPL)
- Biased thresholding
- Support Vector Machines (SVMs)
- Inductive SVM
- Transductive SVM
- Experiment
9Maximum Entropy (MaxEnt)
- Discriminative model
- Matches feature expectations of model to data
Conditional likelihood
Regularized optimization
10Summary of Learning Settings
10
11Source-regularized MaxEnt
- Instead of regularizing towards zero
- Learn model ?s on source data
- During target training
- Regularize towards source-trained ?s
Chelba04
12Feature Space Expansion
- Add extra degrees of freedom
- Allow classifier to discern general/specific
features
Daumé 06, 07
13Feature selection
- Emphasize features shared by source and target
data - Minimize different features
- How to measure?
- Fisher exact test
- Is P(feature source) P(feature target) ?
- If so, shared feature ? keep
- If not, different feature ? discard
14Feature Space Transformation
- Source and target originally independently
separable - Learn transformation, G, to allow joint
separation
15Iterative Pseudo Labeling (IPL)
- Novel algorithm for MaxEnt based transfer
- Adjust feature values to match feature
expectation in source and target - ? trades off certainty vs adaptativity
16IPL analysis
Given linear transform
We can express conditional feature expectations
of target data in terms of a transformation of
source
17Biased Thresholding
- Different proportions of positive examples
- Learning to predict rain in in humid and arid
climates - How to maximize F1 (and not accuracy)?
- Score Cut (s-cut)
- Select score threshold over ranked train scores
- Apply to test data
- Percentage Cut (p-cut)
- Estimate proportion of positive examples expected
in target data - Set threshold so as to select this amount
18Support Vector Machines (SVMs)
- Inductive (standard) SVM
- Learn separating hyperplane on labeled training
data. Then evaluate on held-out testing data. - Transductive SVM
- Learn hyperplane in the presence of labeled
training data AND unlabeled testing data. Use
distribution of testing points to assist you. - Easier to learn particular labels than a whole
function. - More expensive than inductive
19Transductive vs. Inductive SVM
Joachims 99, 03
20Domain
21Data
ltProtnamegtp35lt/Protnamegt/ltProtnamegtcdk5
lt/Protnamegt binds and phosphorylates
ltProtnamegtbeta-cateninlt/Protnamegt and regulates
ltProtnamegtbeta-catenin lt/Protnamegt /
ltProtnamegtpresenilin-1lt/Protnamegt interaction.
ltprotgt p38 stress-activated protein kinase
lt/protgt inhibitor reverses ltprotgt bradykinin
B(1) receptor lt/protgt-mediated component of
inflammatory hyperalgesia.
- Notice difference in
- Length and density of protein names
- Number of training examples UT 4Yapex
- positive examples twice as many in Yapex
22Experiment
- Examining three dimensions
- Labeled vs unlabeled vs prior auxiliary data
- eg. target positive examples, few labeled
target data - Transduction vs induction
- Transfer vs non-transfer
- Since few true positives, focused on
- F1 (2 Precision Recall) / (Precision
Recall) - Source UT, target Yapex
- For IPL, ? .95 (conservative)
23Results Transfer
- Transfer is much more difficult
- Accuracy is not the problem
24Results Transduction
- Transduction helps in transfer setting
- TSVM copes better than MaxEnt, ISVM
25Results IPL
- IPL can help boost performance
- Makes transfer MaxEnt competitive with TSVM
- But bounded by quality of initial pseudo-labels
26Results Priors
- Priors improve unsupervised transfer
- Threshold helps balance recall and precision ?
better F1 - A little bit of knowledge can help a lot
27Results Supervision
- Supervised transfer beats supervised non-transfer
- Significant at 99 binomial CI on precision and
recall - But not by as much as might be hoped for
- Even relatively simple transfer methods can help
28Conclusions Contributions
- Introduced novel MaxEnt transfer method IPL
- Can match transduction in unsupervised setting
- Gives probabilistic results
- Analyzed and compared various methods related to
transfer learning and concluded - Transfer is hard
- But made easier when explicitly addressed
- Transduction is a good start
- TSVM excels even with scant prior knowledge
- A little prior target knowledge is even better
- No need for fully labeled target data set
29Limitations Future Work
- Threshold is important
- Currently only using at test time
- Why not incorporate earlier, get better pseudo
labels - Priors seem to help a lot
- Currently only using feature means, what about
variances? - Can structuring feature
- space lead to parsimonious
- transferable priors?
token
right
left
token.is.capitalized
token.is.numeric
30Limitations Future Work high-level
- How to better make use of source data?
- Why doesnt source data help so much?
- Is IPL convex?
- Is this exactly what we want to optimize?
- How does regularization affect convexity?
- What, exactly, is the relationship between
transduction and transfer? - Can their theories be unified?
- When is it worth explicitly modeling transfer?
- How different do the domains need to be?
- How much source/target data do we need?
- What kind of priors do we need?
31? Thank you! ?
Questions ?
32References