Title: 1
1Transfer Learning by Mapping and Revising
Relational Knowledge
- Raymond J. Mooney
- University of Texas at Austin
- with acknowledgements to
- Lily Mihalkova, Tuyen Huynh
2Transfer Learning
- Most machine learning methods learn each new task
from scratch, failing to utilize previously
learned knowledge. - Transfer learning concerns using knowledge
acquired in a previous source task to facilitate
learning in a related target task. - Usually assume significant training data was
available in the source domain but limited
training data is available in the target domain. - By exploiting knowledge from the source, learning
in the target can be - More accurate Learned knowledge makes better
predictions. - Faster Training time is reduced.
3Transfer Learning Curves
- Transfer learning increases accuracy in the
target domain.
Predictive Accuracy
Amount of training data in target domain
4Recent Work on Transfer Learning
- Recent DARPA program on Transfer Learning has led
to significant recent research in the area. - Some work focuses on feature-vector
classification. - Hierarchical Bayes (Yu et al., 2005 Lawrence
Platt, 2004) - Informative Bayesian Priors (Raina et al., 2005)
- Boosting for transfer learning (Dai et al., 2007)
- Structural Correspondence Learning (Blitzer et
al., 2007) - Some work focuses on Reinforcement Learning
- Value-function transfer (Taylor Stone, 2005
2007) - Advice-based policy transfer (Torrey et al.,
2005 2007)
5Similar Research Problems
- Multi-Task Learning (Caruana, 1997)
- Learn multiple tasks simultaneously each one
helped by the others. - Life-Long Learning (Thrun, 1996)
- Transfer learning from a number of prior source
problems, picking the correct source problems to
use.
6Logical Paradigm
- Represents knowledge and data in binary symbolic
logic such as First Order Predicate Calculus. - Rich representation that handles arbitrary
sets of objects, with properties, relations,
quantifiers, etc. - ? Unable to handle uncertain knowledge and
probabilistic reasoning.
7Probabilistic Paradigm
- Represents knowledge and data as a fixed set of
random variables with a joint probability
distribution. - Handles uncertain knowledge and probabilistic
reasoning. - ? Unable to handle arbitrary sets of objects,
with properties, relations, quantifiers, etc.
8Statistical Relational Learning (SRL)
- Most machine learning methods assume i.i.d.
examples represented as fixed-length feature
vectors. - Many domains require learning and making
inferences about unbounded sets of entities that
are richly relationally connected. - SRL methods attempt to integrate methods from
predicate logic and probabilistic graphical
models to handle such structured,
multi-relational data.
9Statistical Relational Learning
Actor
Movie
Director
WorkedFor
Multi-Relational Data
Learning Algorithm
Probabilistic Graphical Model
10Multi-Relational Data Challenges
- Examples cannot be effectively represented as
feature vectors. - Predictions for connected facts are not
independent. (e.g. WorkedFor(brando,
coppolo), Movie(godFather, brando)) - Data is not i.i.d.
- Requires collective inference (classification)
(Taskar et al., 2001) - A single independent example (mega-example) often
contains information about a large number of
interconnected entities and can vary in length. - Leave one university out testing (Craven et al.,
1998)
11TL and SRL and I.I.D.
- Standard Machine Learning assumes examples are
- Independent and Identically Distributed
TL breaks the assumption that test examples
are drawn from the same distribution as the
training instances
SRL breaks the assumption that examples are
independent
12Multi-Relational Domains
- Domains about people
- Academic departments (UW-CSE)
- Movies (IMDB)
- Biochemical domains
- Mutagenesis
- Alzheimer drug design
- Linked text domains
- WebKB
- Cora
13Relational Learning Methods
- Inductive Logic Programming (ILP)
- Produces sets of first-order rules
- Not appropriate for probabilistic reasoning
- If a student wrote a paper with a professor, then
the professor is the students advisor - SRL models learning algorithms
- SLPs (Muggleton, 1996)
- PRMs (Koller, 1999)
- BLPs (Kersting De Raedt, 2001)
- RMNs (Taskar et al., 2002)
- MLNs (Richardson Domingos, 2006)
14MLN Transfer(Mihalkova, Huynh, Mooney, 2007)
- Given two multi-relational domains, such as
- Transfer a Markov logic network learned in the
Source to the Target by - Mapping the Source predicates to the Target
- Revising the mapped knowledge
15First-Order Logic Basics
- Literal A predicate (or its negation) applied to
constants and/or variables. - Gliteral Ground literal WorkedFor(brando,
coppola) - Vliteral Variablized literal ?WorkedFor(A, B)
- We assume predicates have typed arguments.
- For example Movie(godFather, coppola)
16First-Order Clauses
- Clause A disjunction of literals
- Can be rewritten as a set of rules
17Representing the Data
- Makes a closed world assumption
- The gliterals listed are true the rest are false
18Markov Logic Networks(Richardson Domingos,
2006)
- Set of first-order clauses, each assigned a
weight. - Larger weight indicates stronger belief that the
clause should hold. - The clauses are called the structure of the MLN.
19Markov Networks(Pearl, 1988)
- A concise representation of the joint probability
distribution of a set of random variables using
an undirected graph.
Joint distribution
Reputation of Author
Quality of Paper
Same probability distribution can be represented
as the product of a set of functions defined over
the cliques of the graph
20Markov Network Equations
- General form
- Log-linear models
21Ground Markov Network for an MLN
- MLNs are templates for constructing Markov
networks for a given set of constants - Include a node for each type-consistent grounding
(a gliteral) of each predicate in the MLN. - Two nodes are connected by an edge if their
corresponding gliterals appear together in any
grounding of any clause in the MLN. - Include a feature for each grounding of each
clause in the MLN with weight equal to the weight
of the clause.
22- constants
- coppola,
- brando,
- godFather
1.3
1.2
0.5
Actor(brando)
Director(brando)
WorkedFor(brando, brando)
WorkedFor(brando, coppola)
Movie(godFather, brando)
Movie(godFather,coppola)
WorkedFor(coppola, brando)
WorkedFor(coppola, coppola)
Director(coppola)
Actor(coppola)
23MLN Equations
24MLN Equation Intuition
- A possible world (a truth assignment to all
gliterals) becomes exponentially less likely as
the total weight of all the grounded clauses it
violates increases.
25MLN Inference
- Given truth assignments for given set of evidence
gliterals, infer the probability that each member
of set of unknown query gliterals is true.
26Actor(brando)
Director(brando)
WorkedFor(brando, brando)
WorkedFor(brando, coppola)
Movie(godFather, brando)
Movie(godFather,coppola)
WorkedFor(coppola, brando)
WorkedFor(coppola, coppola)
Director(coppola)
Actor(coppola)
27MLN Inference Algorithms
- Gibbs Sampling (Richardson Domingos, 2006)
- MC-SAT (Poon Domingos, 2006)
28MLN Learning
- Weight-learning (Richardson Domingos, 2006
Lowd Domingos, 2007) - Performed using optimization methods.
- Structure-learning (Kok Domingos, 2005)
- Proceeds in iterations of beam search, adding the
best-performing clause after each iteration to
the MLN. - Clauses are evaluated using WPLL score.
29WPLL (Kok Domingos, 2005)
- Weighted pseudo log-likelihood
30Alchemy
- Open-source package of MLN software provided by
UW that includes - Inference algorithms
- Weight learning algorithms
- Structure learning algorithm
- Sample data sets
- All our software uses and extends Alchemy.
31TAMAR(Transfer via Automatic Mapping And
Revision)
Target (IMDB) Data
32Predicate Mapping
- Each clause is mapped independently of the
others. - The algorithm considers all possible ways to map
a clause such that - Each predicate in the source clause is mapped to
some target predicate. - Each argument type in the source is mapped to
exactly one argument type in the target. - Each mapped clause is evaluated by measuring its
WPLL for the target data, and the most accurate
mapping is kept.
33Predicate Mapping Example
Consistent Type Mapping title ?
name person ? person
34Predicate Mapping Example 2
Consistent Type Mapping title ?
person person ? gend
35TAMAR(Transfer via Automatic Mapping And
Revision)
Target (IMDB) Data
36Transfer Learning as Revision
- Regard mapped source MLN as an approximate model
for the target task that needs to be accurately
and efficiently revised. - Thus our general approach is similar to that
taken by theory revision systems (Richards
Mooney, 1995). - Revisions are proposed in a bottom-up fashion.
37R-TAMAR
Relational Data
New clause discovery
New Candidate Clauses
Change in WPLL
0.1
-0.2
0.5
1.7
1.3
38R-TAMAR Self-Diagnosis
- Use mapped source MLN to make inferences in the
target and observe the behavior of each clause - Consider each predicate P in the domain in turn.
- Use Gibbs sampling to infer truth values for the
gliterals of P, using the remaining gliterals as
evidence. - Bin the clauses containing gliterals of P based
on whether they behave as desired. - Revisions are focused only on clauses in the
Bad bins.
39Self-Diagnosis Clause Bins
Actor(brando) Director(coppola) Movie(godFather,
brando) Movie(godFather, coppola) Movie(rainMaker,
coppola) WorkedFor(brando, coppola)
Current gliteral Actor(brando)
Good
40Self-Diagnosis Clause Bins
Actor(brando) Director(coppola) Movie(godFather,
brando) Movie(godFather, coppola) Movie(rainMaker,
coppola) WorkedFor(brando, coppola)
Current gliteral Actor(brando)
Good
Bad
41Self-Diagnosis Clause Bins
Actor(brando) Director(coppola) Movie(godFather,
brando) Movie(godFather, coppola) Movie(rainMaker,
coppola) WorkedFor(brando, coppola)
Current gliteral Actor(brando)
Good
Bad
Good
42Self-Diagnosis Clause Bins
Actor(brando) Director(coppola) Movie(godFather,
brando) Movie(godFather, coppola) Movie(rainMaker,
coppola) WorkedFor(brando, coppola)
Current gliteral Actor(brando)
Good
Bad
Good
Bad
43Structure Revisions
- Using directed beam search
- Literal deletions attempted only from clauses
marked for shortening. - Literal additions attempted only for clauses
marked for lengthening. - Training is much faster since search space is
constrained by - Limiting the clauses considered for updates.
- Restricting the type of updates allowed.
44New Clause Discovery
- Uses Relational Pathfinding (Richards Mooney,
1992)
Actor(brando) Director(coppola) Movie(godFather,
brando) Movie(godFather, coppola) Movie(rainMaker,
coppola) WorkedFor(brando, coppola)
WorkedFor
WorkedFor
brando
coppola
Movie
Movie
Movie
godFather
rainMaker
45Weight Revision
Publication(T,A) ? AdvisedBy(A,B) ?
Publication(T,B)
Target (IMDB) Data
Movie(T,A) ? WorkedFor(A,B) ? Movie(T,B)
Movie(T,A) ? WorkedFor(A,B) ? Relative(A,B) ?
Movie(T,B)
46Experiments Domains
- UW-CSE
- Data about members of the UW CSE department
- Predicates include Professor, Student, AdvisedBy,
TaughtBy, Publication, etc. - IMDB
- Data about 20 movies
- Predicates include Actor, Director, Movie,
WorkedFor, Genre, etc. - WebKB
- Entity relations from the original WebKB domain
(Craven et al. 1998) - Predicates include Faculty, Student, Project,
CourseTA, etc.
47Dataset Statistics
Data is organized as mega-examples
- Each mega-example contains information about a
group of related entities. - Mega-examples are independent and disconnected
from each other.
48Manually Developed Source KB
- UW-KB is a hand-built knowledge base (set of
clauses) for the UW-CSE domain. - When used as a source domain, transfer learning
is a form of theory refinement that also includes
mapping to a new domain with a different
representation.
49Systems Compared
- TAMAR Complete transfer system.
- ScrKD Algorithm of Kok Domingos (2005)
learning from scratch. - TrKD Algorithm of Kok Domingos (2005)
performing transfer, using M-TAMAR to produce a
mapping.
50Methodology Training Testing
- Generated learning curves using leave-one-out
CV - Each run keeps one mega-example for testing and
trains on the remaining ones, provided one by
one. - Curves are averages over all runs.
- Evaluated learned MLN by performing inference for
all gliterals of each predicate in turn,
providing the rest as evidence, and averaging the
results.
51Methodology Metrics Kok Domingos (2005)
- CLL Conditional Log Likelihood
- The log of the probability predicted by the model
that a gliteral has the correct truth value given
in the data. - Averaged over all test gliterals.
- AUC Area under the precision recall (PR) curve
- Produce a PR curve by varying the probability
threshold. - Compute the area under this curve.
52Metrics to Summarize Curves
- Transfer Ratio (Cohen et al. 2007)
- Gives overall idea of improvement achieved over
learning from scratch
53Transfer Scenarios
- Source/target pairs tested
- WebKB ? IMDB
- UW-CSE ? IMDB
- UW-KB ? IMDB
- WebKB ? UW-CSE
- IMDB ? UW-CSE
- WebKB not used as a target since one mega-example
is sufficient to learn an accurate theory for its
limited predicate set.
54(No Transcript)
55(No Transcript)
56Sample Learning Curve
ScrKD TrKD, Hand Mapping TAMAR, Hand
Mapping TrKD TAMAR
57(No Transcript)
58Future Research Issues
- More realistic application domains.
- Application to other SRL models (e.g. SLPs,
BLPs). - More flexible predicate mapping
- Allow argument ordering or arity to change.
- Map 1 predicate to conjunction of 1 predicates
- AdvisedBy(X,Y) ?? Movie(M,X) ? Director(M,Y)
59Multiple Source Transfer
- Transfer from multiple source problems to a given
target problem. - Determine which clauses to map and revise from
different source MLNs.
60Source Selection
- Select useful source domains from a large number
of previously learned tasks. - Ideally, picking source domain(s) is sub-linear
in the number of previously learned tasks.
61Conclusions
- Presented TAMAR, a complete transfer system for
SRL that - Maps relational knowledge in the source to the
target domain. - Revises the mapped knowledge to further improve
accuracy. - Showed experimentally that TAMAR improves speed
and accuracy over existing methods.
62Questions?
- Related papers at
- http//www.cs.utexas.edu/users/ml/publication/tran
sfer.html
63Why MLNs?
- Inherit the expressivity of first-order logic
- Can apply insights from ILP
- Inherit the flexibility of probabilistic
graphical models - Can deal with noisy uncertain environments
- Undirected models
- Do not need to learn causal directions
- Subsume all other SRL models that are special
cases of first-order logic or probabilistic
graphical models Richardson 04 - Publicly available software package Alchemy
64Predicate Mapping Comments
- A particular source predicate can be mapped to
different target predicates in different clauses. - This makes our approach context sensitive.
- More scalable.
- In the worst-case, the number of mappings is
exponential in the number of predicates. - The number of predicates in a clause is generally
much smaller than the total number of predicates
in a domain.
65Relationship to Structure Mapping Engine
(Falkenheiner et al., 1989)
- A system for mapping relations using analogy
based on a psychological theory. - Mappings are evaluated based only on the
structural relational similarity between the two
domains. - Does not consider the accuracy of mapped
knowledge in the target when determining the
preferred mapping. - Determines a single global mapping for a given
source target.
66Summary of Methodology
- Learn MLNs for each point on learning curve
- Perform inference over learned models
- Summarize inference results using 2 metrics CLL
and AUC, thus producing two learning curves - Summarize each learning curve using transfer
ratio and percentage improvement from one
mega-example
67(No Transcript)
68(No Transcript)