Title: Markov Logic in Natural Language Processing
1Markov Logic in Natural Language Processing
- Hoifung Poon
- Dept. of Computer Science Eng.
- University of Washington
2Overview
- Motivation
- Foundational areas
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning
3Languages Are Structural
governments lmpxtm (according
to their families)
4Languages Are Structural
S
govern-ment-s l-mpx-t-m (according
to their families)
VP
NP
V
NP
IL-4 induces CD11B
Involvement of p70(S6)-kinase activation in IL-10
up-regulation in human monocytes by gp41......
George Walker Bush was the 43rd President of the
United States. Bush was the eldest son of
President G. H. W. Bush and Babara Bush. . In
November 1977, he met Laura Welch at a barbecue.
involvement
Theme
Cause
up-regulation
activation
Site
Theme
Cause
Theme
human monocyte
IL-10
gp41
p70(S6)-kinase
5Languages Are Structural
S
govern-ment-s l-mpx-t-m (according
to their families)
VP
NP
V
NP
IL-4 induces CD11B
Involvement of p70(S6)-kinase activation in IL-10
up-regulation in human monocytes by gp41......
George Walker Bush was the 43rd President of the
United States. Bush was the eldest son of
President G. H. W. Bush and Babara Bush. . In
November 1977, he met Laura Welch at a barbecue.
involvement
Theme
Cause
up-regulation
activation
Site
Theme
Cause
Theme
human monocyte
IL-10
gp41
p70(S6)-kinase
6Languages Are Structural
- Objects are not just feature vectors
- They have parts and subparts
- Which have relations with each other
- They can be trees, graphs, etc.
- Objects are seldom i.i.d.(independent and
identically distributed) - They exhibit local and global dependencies
- They form class hierarchies (with multiple
inheritance) - Objects properties depend on those of related
objects - Deeply interwoven with knowledge
7First-Order Logic
- Main theoretical foundation of computer science
- General language for describing complex
structures and knowledge - Trees, graphs, dependencies, hierarchies, etc.
easily expressed - Inference algorithms (satisfiability testing,
theorem proving, etc.)
8Languages Are Statistical
Microsoft buys Powerset Microsoft acquires
Powerset Powerset is acquired by Microsoft
Corporation The Redmond software giant buys
Powerset Microsofts purchase of Powerset,
I saw the man with the telescope
NP
I saw the man with the telescope
NP
ADVP
I saw the man with the telescope
G. W. Bush Laura Bush Mrs. Bush
Here in London, Frances Deek is a retired teacher
In the Israeli town , Karen London says Now
London says
Which one?
London ? PERSON or LOCATION?
9Languages Are Statistical
- Languages are ambiguous
- Our information is always incomplete
- We need to model correlations
- Our predictions are uncertain
- Statistics provides the tools to handle this
10Probabilistic Graphical Models
- Mixture models
- Hidden Markov models
- Bayesian networks
- Markov random fields
- Maximum entropy models
- Conditional random fields
- Etc.
11The Problem
- Logic is deterministic, requires manual coding
- Statistical models assume i.i.d. data,objects
feature vectors - Historically, statistical and logical NLPhave
been pursued separately - We need to unify the two!
- Burgeoning field in machine learning
- Statistical relational learning
12Costs and Benefits ofStatistical Relational
Learning
- Benefits
- Better predictive accuracy
- Better understanding of domains
- Enable learning with less or no labeled data
- Costs
- Learning is much harder
- Inference becomes a crucial issue
- Greater complexity for user
13Progress to Date
- Probabilistic logic Nilsson, 1986
- Statistics and beliefs Halpern, 1990
- Knowledge-based model constructionWellman et
al., 1992 - Stochastic logic programs Muggleton, 1996
- Probabilistic relational models Friedman et al.,
1999 - Relational Markov networks Taskar et al., 2002
- Etc.
- This talk Markov logic Domingos Lowd, 2009
14Markov Logic A Unifying Framework
- Probabilistic graphical models andfirst-order
logic are special cases - Unified inference and learning algorithms
- Easy-to-use software Alchemy
- Broad applicability
- Goal of this tutorialQuickly learn how to use
Markov logic and Alchemy for a broad spectrum of
NLP applications
15Overview
- Motivation
- Foundational areas
- Probabilistic inference
- Statistical learning
- Logical inference
- Inductive logic programming
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning
16Markov Networks
- Undirected graphical models
Cancer
Smoking
Cough
Asthma
- Potential functions defined over cliques
Smoking Cancer ?(S,C)
False False 4.5
False True 4.5
True False 2.7
True True 4.5
17Markov Networks
- Undirected graphical models
Cancer
Smoking
Cough
Asthma
Weight of Feature i
Feature i
18Markov Nets vs. Bayes Nets
Property Markov Nets Bayes Nets
Form Prod. potentials Prod. potentials
Potentials Arbitrary Cond. probabilities
Cycles Allowed Forbidden
Partition func. Z ? Z 1
Indep. check Graph separation D-separation
Indep. props. Some Some
Inference MCMC, BP, etc. Convert to Markov
19Inference in Markov Networks
- Goal compute marginals conditionals of
- Exact inference is P-complete
- Conditioning on Markov blanket is easy
- Gibbs sampling exploits this
20MCMC Gibbs Sampling
state ? random truth assignment for i ? 1 to
num-samples do for each variable x
sample x according to P(xneighbors(x))
state ? state with new value of x P(F) ? fraction
of states in which F is true
21Other Inference Methods
- Belief propagation (sum-product)
- Mean field / Variational approximations
22MAP/MPE Inference
- Goal Find most likely state of world given
evidence
Query
Evidence
23MAP Inference Algorithms
- Iterated conditional modes
- Simulated annealing
- Graph cuts
- Belief propagation (max-product)
- LP relaxation
24Overview
- Motivation
- Foundational areas
- Probabilistic inference
- Statistical learning
- Logical inference
- Inductive logic programming
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning
25Generative Weight Learning
- Maximize likelihood
- Use gradient ascent or L-BFGS
- No local maxima
- Requires inference at each step (slow!)
No. of times feature i is true in data
Expected no. times feature i is true according to
model
26Pseudo-Likelihood
- Likelihood of each variable given its neighbors
in the data - Does not require inference at each step
- Widely used in vision, spatial statistics, etc.
- But PL parameters may not work well forlong
inference chains
27Discriminative Weight Learning
- Maximize conditional likelihood of query (y)
given evidence (x) - Approximate expected counts by counts in MAP
state of y given x
No. of true groundings of clause i in data
Expected no. true groundings according to model
28Voted Perceptron
- Originally proposed for training HMMs
discriminatively - Assumes network is linear chain
- Can be generalized to arbitrary networks
wi ? 0 for t ? 1 to T do yMAP ? Viterbi(x)
wi ? wi ? counti(yData) counti(yMAP) return
? wi / T
29Overview
- Motivation
- Foundational areas
- Probabilistic inference
- Statistical learning
- Logical inference
- Inductive logic programming
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning
30First-Order Logic
- Constants, variables, functions, predicatesE.g.
Anna, x, MotherOf(x), Friends(x, y) - Literal Predicate or its negation
- Clause Disjunction of literals
- Grounding Replace all variables by
constantsE.g. Friends (Anna, Bob) - World (model, interpretation)Assignment of
truth values to all ground predicates
31Inference in First-Order Logic
- Traditionally done by theorem proving(e.g.
Prolog) - Propositionalization followed by model checking
turns out to be faster (often by a lot) - PropositionalizationCreate all ground atoms and
clauses - Model checking Satisfiability testing
- Two main approaches
- Backtracking (e.g. DPLL)
- Stochastic local search (e.g. WalkSAT)
32Satisfiability
- Input Set of clauses(Convert KB to conjunctive
normal form (CNF)) - Output Truth assignment that satisfies all
clauses, or failure - The paradigmatic NP-complete problem
- Solution Search
- Key pointMost SAT problems are actually easy
- Hard region Narrow range ofClauses / Variables
33Stochastic Local Search
- Uses complete assignments instead of partial
- Start with random state
- Flip variables in unsatisfied clauses
- Hill-climbing Minimize unsatisfied clauses
- Avoid local minima Random flips
- Multiple restarts
34The WalkSAT Algorithm
for i ? 1 to max-tries do solution random
truth assignment for j ? 1 to max-flips do
if all clauses satisfied then
return solution c ? random unsatisfied
clause with probability p
flip a random variable in c else
flip variable in c that maximizes satisfied
clauses return failure
35Overview
- Motivation
- Foundational areas
- Probabilistic inference
- Statistical learning
- Logical inference
- Inductive logic programming
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning
36Rule Induction
- Given Set of positive and negative examples of
some concept - Example (x1, x2, , xn, y)
- y concept (Boolean)
- x1, x2, , xn attributes (assume Boolean)
- Goal Induce a set of rules that cover all
positive examples and no negative ones - Rule xa xb ? y (xa Literal, i.e., xi
or its negation) - Same as Horn clause Body ? Head
- Rule r covers example x iff x satisfies body of r
- Eval(r) Accuracy, info gain, coverage, support,
etc.
37Learning a Single Rule
head ? y body ? Ø repeat for each literal x
rx ? r with x added to body
Eval(rx) body ? body best x until no x
improves Eval(r) return r
38Learning a Set of Rules
R ? Ø S ? examples repeat learn a single rule
r R ? R U r S ? S - positive examples
covered by r until S Ø return R
39First-Order Rule Induction
- y and xi are now predicates with argumentsE.g.
y is Ancestor(x,y), xi is Parent(x,y) - Literals to add are predicates or their negations
- Literal to add must include at least one
variablealready appearing in rule - Adding a literal changes groundings of
ruleE.g. Ancestor(x,z) Parent(z,y) ?
Ancestor(x,y) - Eval(r) must take this into accountE.g.
Multiply by positive groundings of rule
still covered after adding literal
40Overview
- Motivation
- Foundational areas
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning
41Markov Logic
- Syntax Weighted first-order formulas
- Semantics Feature templates for Markov networks
- Intuition Soften logical constraints
- Give each formula a weight(Higher weight ?
Stronger constraint)
42Example Coreference Resolution
Barack Obama, the 44th President of the United
States, is the first African American to hold the
office.
43Example Coreference Resolution
44Example Coreference Resolution
45Example Coreference Resolution
Two mention constants A and B
Apposition(A,B)
Head(A,President)
Head(B,President)
MentionOf(A,Obama)
MentionOf(B,Obama)
Head(A,Obama)
Head(B,Obama)
Apposition(B,A)
46Markov Logic Networks
- MLN is template for ground Markov nets
- Probability of a world x
- Typed variables and constants greatly reduce size
of ground Markov net - Functions, existential quantifiers, etc.
- Can handle infinite domains Singla Domingos,
2007 and continuous domains Wang
Domingos, 2008
Weight of formula i
No. of true groundings of formula i in x
47Relation to Statistical Models
- Special cases
- Markov networks
- Markov random fields
- Bayesian networks
- Log-linear models
- Exponential models
- Max. entropy models
- Gibbs distributions
- Boltzmann machines
- Logistic regression
- Hidden Markov models
- Conditional random fields
- Obtained by making all predicates zero-arity
- Markov logic allows objects to be interdependent
(non-i.i.d.)
48Relation to First-Order Logic
- Infinite weights ? First-order logic
- Satisfiable KB, positive weights ? Satisfying
assignments Modes of distribution - Markov logic allows contradictions between
formulas
49MLN AlgorithmsThe First Three Generations
Problem First generation Second generation Third generation
MAP inference Weighted satisfiability Lazy inference Cutting planes
Marginal inference Gibbs sampling MC-SAT Lifted inference
Weight learning Pseudo-likelihood Voted perceptron Scaled conj. gradient
Structure learning Inductive logic progr. ILP PL (etc.) Clustering pathfinding
50MAP/MPE Inference
- Problem Find most likely state of world given
evidence
Query
Evidence
51MAP/MPE Inference
- Problem Find most likely state of world given
evidence
52MAP/MPE Inference
- Problem Find most likely state of world given
evidence
53MAP/MPE Inference
- Problem Find most likely state of world given
evidence - This is just the weighted MaxSAT problem
- Use weighted SAT solver(e.g., MaxWalkSAT Kautz
et al., 1997 )
54The MaxWalkSAT Algorithm
for i ? 1 to max-tries do solution random
truth assignment for j ? 1 to max-flips do
if ? weights(sat. clauses) gt threshold then
return solution c ? random
unsatisfied clause with probability p
flip a random variable in c else
flip variable in c that maximizes
? weights(sat. clauses)
return failure, best solution found
55Computing Probabilities
- P(FormulaMLN,C) ?
- MCMC Sample worlds, check formula holds
- P(Formula1Formula2,MLN,C) ?
- If Formula2 Conjunction of ground atoms
- First construct min subset of network necessary
to answer query (generalization of KBMC) - Then apply MCMC
56But Insufficient for Logic
- ProblemDeterministic dependencies break
MCMCNear-deterministic ones make it very slow - SolutionCombine MCMC and WalkSAT? MC-SAT
algorithm Poon Domingos, 2006
57Auxiliary-Variable Methods
- Main ideas
- Use auxiliary variables to capture dependencies
- Turn difficult sampling into uniform sampling
- Given distribution P(x)
- Sample from f (x, u), then discard u
58Slice Sampling Damien et al. 1999
U
P(x)
Slice
u(k)
X
x(k1)
x(k)
59Slice Sampling
- Identifying the slice may be difficult
- Introduce an auxiliary variable ui for each ?i
60The MC-SAT Algorithm
- Select random subset M of satisfied clauses
- With probability 1 exp ( wi )
- Larger wi ? Ci more likely to be selected
- Hard clause (wi ? ?) Always selected
- Slice ? States that satisfy clauses in M
- Uses SAT solver to sample x u.
- Orders of magnitude faster than Gibbs sampling,
etc.
61But It Is Not Scalable
- 1000 researchers
- Coauthor(x,y) 1 million ground atoms
- Coauthor(x,y) ? Coauthor(y,z) ? Coauthor(x,z) 1
billion ground clauses - Exponential in arity
62Sparsity to the Rescue
- 1000 researchers
- Coauthor(x,y) 1 million ground atoms
- But most atoms are false
- Coauthor(x,y) ? Coauthor(y,z) ? Coauthor(x,z)
- 1 billion ground clauses
- Most trivially satisfied if most atoms are false
- No need to explicitly compute most of them
63Lazy Inference
- LazySAT Singla Domingos, 2006a
- Lazy version of WalkSAT Selman et al., 1996
- Grounds atoms/clauses as needed
- Greatly reduces memory usage
- The idea is much more general Poon
Domingos, 2008a
64General Method for Lazy Inference
- If most variables assume the default value,
wasteful to instantiate all variables / functions - Main idea
- Allocate memory for a small subset of
- active variables / functions
- Activate more if necessary as inference proceeds
- Applicable to a diverse set of algorithms
Satisfiability solvers (systematic,
local-search), Markov chain Monte Carlo, MPE /
MAP algorithms, Maximum expected utility
algorithms, Belief propagation, MC-SAT, Etc. - Reduce memory and time by orders of magnitude
65Lifted Inference
- Consider belief propagation (BP)
- Often in large problems, many nodes are
interchangeableThey send and receive the same
messages throughout BP - Basic idea Group them into supernodes, forming
lifted network - Smaller network ? Faster inference
- Akin to resolution in first-order logic
66Belief Propagation
Features (f)
Nodes (x)
67Lifted Belief Propagation
Features (f)
Nodes (x)
68Lifted Belief Propagation
?,? Functions of edge counts
?
?
Features (f)
Nodes (x)
69Learning
- Data is a relational database
- Closed world assumption (if not EM)
- Learning parameters (weights)
- Learning structure (formulas)
70Parameter Learning
- Parameter tying Groundings of same clause
- Generative learning Pseudo-likelihood
- Discriminative learning Conditional
likelihood,use MC-SAT or MaxWalkSAT for inference
No. of times clause i is true in data
Expected no. times clause i is true according to
MLN
71Parameter Learning
- Pseudo-likelihood L-BFGS is fast and robust but
can give poor inference results - Voted perceptronGradient descent MAP
inference - Scaled conjugate gradient
72Voted Perceptron for MLNs
- HMMs are special case of MLNs
- Replace Viterbi by MaxWalkSAT
- Network can now be arbitrary graph
wi ? 0 for t ? 1 to T do yMAP ?
MaxWalkSAT(x) wi ? wi ? counti(yData)
counti(yMAP) return ? wi / T
73Problem Multiple Modes
- Not alleviated by contrastive divergence
- Alleviated by MC-SAT
- Warm start Start each MC-SAT run at previous end
state
74 Problem Extreme Ill-Conditioning
- Solvable by quasi-Newton, conjugate gradient,
etc. - But line searches require exact inference
- Solution Scaled conjugate gradient
Lowd Domingos, 2008 - Use Hessian to choose step size
- Compute quadratic form inside MC-SAT
- Use inverse diagonal Hessian as preconditioner
75Structure Learning
- Standard inductive logic programming
optimizesthe wrong thing - But can be used to overgenerate for L1 pruning
- Our approachILP Pseudo-likelihood Structure
priors - For each candidate structure changeStart from
current weights relax convergence - Use subsampling to compute sufficient statistics
76Structure Learning
- Initial state Unit clauses or prototype KB
- Operators Add/remove literal, flip sign
- Evaluation function Pseudo-likelihood
Structure prior - Search Beam search, shortest-first search
77Alchemy
- Open-source software including
- Full first-order logic syntax
- Generative discriminative weight learning
- Structure learning
- Weighted satisfiability, MCMC, lifted BP
- Programming language features
alchemy.cs.washington.edu
78Alchemy Prolog BUGS
Represent-ation F.O. Logic Markov nets Horn clauses Bayes nets
Inference Model check- ing, MCMC, lifted BP Theorem proving MCMC
Learning Parameters structure No Params.
Uncertainty Yes No Yes
Relational Yes Yes No
79Constrained Conditional Model
- Representation Integer linear programs
- Local classifiers Global constraints
- Inference LP solver
- Parameter learning None for constraints
- Weights of soft constraints set heuristically
- Local weights typically learned independently
- Structure learning None to date
- But see latest development in NAACL-10
80Running Alchemy
- Programs
- Infer
- Learnwts
- Learnstruct
- Options
- MLN file
- Types (optional)
- Predicates
- Formulas
- Database files
81Overview
- Motivation
- Foundational areas
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning
82Uniform Distribn. Empty MLN
- Example Unbiased coin flips
- Type flip 1, , 20
- Predicate Heads(flip)
83Binomial Distribn. Unit Clause
- Example Biased coin flips
- Type flip 1, , 20
- Predicate Heads(flip)
- Formula Heads(f)
- Weight Log odds of heads
- By default, MLN includes unit clauses for all
predicates - (captures marginal distributions, etc.)
84Multinomial Distribution
- Example Throwing die
- Types throw 1, , 20
- face 1, , 6
- Predicate Outcome(throw,face)
- Formulas Outcome(t,f) f ! f gt
!Outcome(t,f). - Exist f Outcome(t,f).
- Too cumbersome!
85Multinomial Distrib. ! Notation
- Example Throwing die
- Types throw 1, , 20
- face 1, , 6
- Predicate Outcome(throw,face!)
- Formulas
- Semantics Arguments without ! determine
arguments with !. - Also makes inference more efficient (triggers
blocking).
86Multinomial Distrib. Notation
- Example Throwing biased die
- Types throw 1, , 20
- face 1, , 6
- Predicate Outcome(throw,face!)
- Formulas Outcome(t,f)
- Semantics Learn weight for each grounding of
args with .
87Logistic Regression (MaxEnt)
Logistic regression Type
obj 1, ... , n Query predicate
C(obj) Evidence predicates Fi(obj) Formulas
a C(x)
bi Fi(x) C(x) Resulting distribution
Therefore Alternative form Fi(x) gt
C(x)
88Hidden Markov Models
obs Red, Green, Yellow state Stop,
Drive, Slow time 0, ..., 100
State(state!,time) Obs(obs!,time) State(s,0)
State(s,t) State(s',t1) Obs(o,t)
State(s,t) Sparse HMM State(s,t) gt
State(s1,t1) v State(s2, t1) v ... .
89Bayesian Networks
- Use all binary predicates with same first
argument (the object x). - One predicate for each variable A A(x,v!)
- One clause for each line in the CPT andvalue of
the variable - Context-specific independenceOne clause for
each path in the decision tree - Logistic regression As before
- Noisy OR Deterministic OR Pairwise clauses
90Relational Models
- Knowledge-based model construction
- Allow only Horn clauses
- Same as Bayes nets, except arbitrary relations
- Combin. function Logistic regression, noisy-OR
or external - Stochastic logic programs
- Allow only Horn clauses
- Weight of clause log(p)
- Add formulas Head holds ? Exactly one body holds
- Probabilistic relational models
- Allow only binary relations
- Same as Bayes nets, except first argument can vary
91Relational Models
- Relational Markov networks
- SQL ? Datalog ? First-order logic
- One clause for each state of a clique
- syntax in Alchemy facilitates this
- Bayesian logic
- Object Cluster of similar/related observations
- Observation constants Object constants
- Predicate InstanceOf(Obs,Obj) and clauses using
it - Unknown relations Second-order Markov logic
- S. Kok P. Domingos, Statistical Predicate
Invention, inProc. ICML-2007.
92Overview
- Motivation
- Foundational areas
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning
93Text Classification
The 56th quadrennial United States presidential
election was held on November 4, 2008. Outgoing
Republican President George W. Bush's policies
and actions and the American public's desire for
change were key issues throughout the campaign.
Topic politics
The Chicago Bulls are an American professional
basketball team based in Chicago, Illinois,
playing in the Central Division of the Eastern
Conference in the National Basketball Association
(NBA).
Topic sports
94Text Classification
page 1, ..., max word ... topic ...
Topic(page,topic) HasWord(page,word) Topic(p,
t) HasWord(p,w) gt Topic(p,t) If topics
mutually exclusive Topic(page,topic!)
95Text Classification
page 1, ..., max word ... topic ...
Topic(page,topic) HasWord(page,word) Links(page
,page) Topic(p,t) HasWord(p,w) gt
Topic(p,t) Topic(p,t) Links(p,p') gt
Topic(p',t) Cf. S. Chakrabarti, B. Dom P.
Indyk, Hypertext Classification Using
Hyperlinks, in Proc. SIGMOD-1998.
96Entity Resolution
AUTHOR H. POON P. DOMINGOS TITLE UNSUPERVISED
SEMANTIC PARSING VENUE EMNLP-09
SAME?
AUTHOR Hoifung Poon and Pedro Domings TITLE
Unsupervised semantic parsing VENUE Proceedings
of the 2009 Conference on Empirical Methods in
Natural Language Processing
AUTHOR Poon, Hoifung and Domings, Pedro TITLE
Unsupervised ontology induction from text VENUE
Proceedings of the Forty-Eighth Annual Meeting of
the Association for Computational Linguistics
SAME?
AUTHOR H. Poon, P. Domings TITLE Unsupervised
ontology induction VENUE ACL-10
97Entity Resolution
Problem Given database, find duplicate
records HasToken(token,field,record) SameField(fi
eld,record,record) SameRecord(record,record) HasT
oken(t,f,r) HasToken(t,f,r) gt
SameField(f,r,r) SameField(f,r,r) gt
SameRecord(r,r)
98Entity Resolution
Problem Given database, find duplicate
records HasToken(token,field,record) SameField(fi
eld,record,record) SameRecord(record,record) HasT
oken(t,f,r) HasToken(t,f,r) gt
SameField(f,r,r) SameField(f,r,r) gt
SameRecord(r,r) SameRecord(r,r)
SameRecord(r,r) gt SameRecord(r,r) Cf.
A. McCallum B. Wellner, Conditional Models of
Identity Uncertainty with Application to Noun
Coreference, in Adv. NIPS 17, 2005.
99Entity Resolution
Can also resolve fields HasToken(token,field,rec
ord) SameField(field,record,record) SameRecord(rec
ord,record) HasToken(t,f,r)
HasToken(t,f,r) gt SameField(f,r,r) SameFi
eld(f,r,r) ltgt SameRecord(r,r) SameRecord(r,r)
SameRecord(r,r) gt SameRecord(r,r) SameFi
eld(f,r,r) SameField(f,r,r) gt
SameField(f,r,r) More P. Singla P. Domingos,
Entity Resolution with Markov Logic, in Proc.
ICDM-2006.
100Information Extraction
Unsupervised Semantic Parsing, Hoifung Poon and
Pedro Domingos. Proceedings of the 2009
Conference on Empirical Methods in Natural
Language Processing. Singapore ACL.
UNSUPERVISED SEMANTIC PARSING. H. POON P.
DOMINGOS. EMNLP-2009.
101Information Extraction
Author
Title
Venue
Unsupervised Semantic Parsing, Hoifung Poon and
Pedro Domingos. Proceedings of the 2009
Conference on Empirical Methods in Natural
Language Processing. Singapore ACL.
SAME?
UNSUPERVISED SEMANTIC PARSING. H. POON P.
DOMINGOS. EMNLP-2009.
102Information Extraction
- Problem Extract database from text
orsemi-structured sources - Example Extract database of publications from
citation list(s) (the CiteSeer problem) - Two steps
- SegmentationUse HMM to assign tokens to fields
- Entity resolutionUse logistic regression and
transitivity
103Information Extraction
Token(token, position, citation) InField(position,
field!, citation) SameField(field, citation,
citation) SameCit(citation, citation) Token(t,i,
c) gt InField(i,f,c) InField(i,f,c)
InField(i1,f,c) Token(t,i,c)
InField(i,f,c) Token(t,i,c)
InField(i,f,c) gt SameField(f,c,c) SameField(
f,c,c) ltgt SameCit(c,c) SameField(f,c,c)
SameField(f,c,c) gt SameField(f,c,c) SameCit(c,
c) SameCit(c,c) gt SameCit(c,c)
104Information Extraction
Token(token, position, citation) InField(position,
field!, citation) SameField(field, citation,
citation) SameCit(citation, citation) Token(t,i,
c) gt InField(i,f,c) InField(i,f,c)
!Token(.,i,c) InField(i1,f,c) Token(t,i,c
) InField(i,f,c) Token(t,i,c)
InField(i,f,c) gt SameField(f,c,c) SameField(
f,c,c) ltgt SameCit(c,c) SameField(f,c,c)
SameField(f,c,c) gt SameField(f,c,c) SameCit(c,
c) SameCit(c,c) gt SameCit(c,c) More H.
Poon P. Domingos, Joint Inference in
Information Extraction, in Proc. AAAI-2007.
105Biomedical Text Mining
- Traditionally, name entity recognition or
information extraction - E.g., protein recognition, protein-protein
identification - BioNLP-09 shared task Nested bio-events
- Much harder than traditional IE
- Top F1 around 50
- Naturally calls for joint inference
106Bio-Event Extraction
Involvement of p70(S6)-kinase activation in IL-10
up-regulation in human monocytes by gp41 envelope
protein of human immunodeficiency virus type 1 ...
involvement
Theme
Cause
up-regulation
activation
Site
Theme
Cause
Theme
human monocyte
p70(S6)-kinase
gp41
IL-10
107Bio-Event Extraction
Token(position, token) DepEdge(position,
position, dependency) IsProtein(position) EvtType(
position, evtType) InArgPath(position, position,
argType!) Token(i,w) gt EvtType(i,t) Token(j,w)
DepEdge(i,j,d) gt EvtType(i,t) DepEdge(i,j,d
) gt InArgPath(i,j,a) Token(i,w)
DepEdge(i,j,d) gt InArgPath(i,j,a)
Logistic regression
108Bio-Event Extraction
Token(position, token) DepEdge(position,
position, dependency) IsProtein(position) EvtType(
position, evtType) InArgPath(position, position,
argType!) Token(i,w) gt EvtType(i,t) Token(j,w)
DepEdge(i,j,d) gt EvtType(i,t) DepEdge(i,j,d
) gt InArgPath(i,j,a) Token(i,w)
DepEdge(i,j,d) gt InArgPath(i,j,a) InArgPath(
i,j,Theme) gt IsProtein(j) v
(Exist k k!i InArgPath(j, k,
Theme)). More H. Poon and L. Vanderwende,
Joint Inference for Knowledge Extraction from
Biomedical Literature, 1040 am, June 4, Gold
Room.
Adding a few joint inference rules doubles the F1
109Temporal Information Extraction
- Identify event times and temporal relations
(BEFORE, AFTER, OVERLAP) - E.g., who is the President of U.S.A.?
- Obama 1/20/2009 ? present
- G. W. Bush 1/20/2001 ? 1/19/2009
- Etc.
110Temporal Information Extraction
DepEdge(position, position, dependency) Event(posi
tion, event) After(event, event)
DepEdge(i,j,d) Event(i,p) Event(j,q) gt
After(p,q) After(p,q) After(q,r) gt
After(p,r)
111Temporal Information Extraction
DepEdge(position, position, dependency) Event(posi
tion, event) After(event, event) Role(position,
position, role) DepEdge(I,j,d) Event(i,p)
Event(j,q) gt After(p,q) Role(i,j,ROLE-AFTER)
Event(i,p) Event(j,q) gt After(p,q) After(p,q)
After(q,r) gt After(p,r) More K. Yoshikawa,
S. Riedel, M. Asahara and Y. Matsumoto, Jointly
Identifying Temporal Relations with Markov
Logic, in Proc. ACL-2009. X. Ling D. Weld,
Temporal Information Extraction, in Proc.
AAAI-2010.
112Semantic Role Labeling
- Problem Identify arguments for a predicate
- Two steps
- Argument identificationDetermine whether a
phrase is an argument - Role classificationDetermine the type of an
argument (agent, theme, temporal, adjunct, etc.)
113Semantic Role Labeling
Token(position, token) DepPath(position,
position, path) IsPredicate(position) Role(positio
n, position, role!) HasRole(position, position)
Token(i,t) gt IsPredicate(i) DepPath(i,j,p)
gt Role(i,j,r) HasRole(i,j) gt
IsPredicate(i) IsPredicate(i) gt Exist j
HasRole(i,j) HasRole(i,j) gt Exist r
Role(i,j,r) Role(i,j,r) gt HasRole(i,j) Cf. K.
Toutanova, A. Haghighi, C. Manning, A global
joint model for semantic role labeling, in
Computational Linguistics 2008.
114Joint Semantic Role Labeling and Word Sense
Disambiguation
Token(position, token) DepPath(position,
position, path) IsPredicate(position) Role(positio
n, position, role!) HasRole(position,
position) Sense(position, sense!) Token(i,t) gt
IsPredicate(i) DepPath(i,j,p) gt
Role(i,j,r) Sense(I,s) gt IsPredicate(i) HasRole
(i,j) gt IsPredicate(i) IsPredicate(i) gt Exist j
HasRole(i,j) HasRole(i,j) gt Exist r
Role(i,j,r) Role(i,j,r) gt HasRole(i,j) Token(i,t
) Role(i,j,r) gt Sense(i,s) More I.
Meza-Ruiz S. Riedel, Jointly Identifying
Predicates, Arguments and Senses using Markov
Logic, in Proc. NAACL-2009.
115Practical Tips Modeling
- Add all unit clauses (the default)
- How to handle uncertain dataR(x,y) R(x,y)
(the HMM trick) - Implications vs. conjunctions
- For soft correlation, conjunctions often better
- Implication A gt B is equivalent to !(A !B)
- Share cases with others like A gt C
- Make learning unnecessarily harder
116Practical Tips Efficiency
- Open/closed world assumptions
- Low clause arities
- Low numbers of constants
- Short inference chains
117Practical Tips Development
- Start with easy components
- Gradually expand to full task
- Use the simplest MLN that works
- Cycle Add/delete formulas, learn and test
118Overview
- Motivation
- Foundational areas
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning
119Unsupervised Learning Why?
- Virtually unlimited supply of unlabeled text
- Labeling is expensive (Cf. Penn-Treebank)
- Often difficult to label with consistency and
high quality (e.g., semantic parses) - Emerging field Machine reading
- Extract knowledge from unstructured text with
high precision/recall and minimal human effort - Check out LBR-Workshop (WS9) on Sunday
120Unsupervised Learning How?
- I.i.d. learning Sophisticated model requires
more labeled data - Statistical relational learning Sophisticated
model may require less labeled data - Relational dependencies constrain problem space
- One formula is worth a thousand labels
- Small amount of domain knowledge ?
large-scale joint inference
121Unsupervised Learning How?
- Ambiguities vary among objects
- Joint inference ? Propagate information from
unambiguous objects to ambiguous ones - E.g.
- G. W. Bush
- He
-
- Mrs. Bush
Are they coreferent?
122Unsupervised Learning How
- Ambiguities vary among objects
- Joint inference ? Propagate information from
unambiguous objects to ambiguous ones - E.g.
- G. W. Bush
- He
-
- Mrs. Bush
Should be coreferent
123Unsupervised Learning How
- Ambiguities vary among objects
- Joint inference ? Propagate information from
unambiguous objects to ambiguous ones - E.g.
- G. W. Bush
- He
-
- Mrs. Bush
So must be singular male!
124Unsupervised Learning How
- Ambiguities vary among objects
- Joint inference ? Propagate information from
unambiguous objects to ambiguous ones - E.g.
- G. W. Bush
- He
-
- Mrs. Bush
Must be singular female!
125Unsupervised Learning How
- Ambiguities vary among objects
- Joint inference ? Propagate information from
unambiguous objects to ambiguous ones - E.g.
- G. W. Bush
- He
-
- Mrs. Bush
Verdict Not coreferent!
126Parameter Learning
- Marginalize out hidden variables
- Use MC-SAT to approximate both expectations
- May also combine with contrastive estimation
Poon Cherry Toutanova, NAACL-2009
Sum over z, conditioned on observed x
Summed over both x and z
127Unsupervised Coreference Resolution
Head(mention, string) Type(mention,
type) MentionOf(mention, entity)
MentionOf(m,e) Type(m,t) Head(m,h)
MentionOf(m,e) MentionOf(a,e) MentionOf(b,e)
gt (Type(a,t) ltgt Type(b,t)) (similarly for
Number, Gender etc.)
Mixture model
Joint inference formulas Enforce agreement
128Unsupervised Coreference Resolution
Head(mention, string) Type(mention,
type) MentionOf(mention, entity) Apposition(mentio
n, mention) MentionOf(m,e) Type(m,t) Head(m,
h) MentionOf(m,e) MentionOf(a,e)
MentionOf(b,e) gt (Type(a,t) ltgt Type(b,t))
(similarly for Number, Gender etc.) Apposition(a,
b) gt (MentionOf(a,e) ltgt MentionOf(b,e)) More
H. Poon and P. Domingos, Joint Unsupervised
Coreference Resolution with Markov Logic, in
Proc. EMNLP-2008.
Joint inference formulas Leverage apposition
129Relational Clustering Discover Unknown Predicates
- Cluster relations along with objects
- Use second-order Markov logic
Kok Domingos, 2007, 2008 - Key idea Cluster combination determines
likelihood of relations - InClust(r,c) InClust(x,a) InClust(y,b)
gt r(x,y) - Input Relational tuples extracted by TextRunner
Banko et al., 2007 - Output Semantic network
130Recursive Relational Clustering
- Unsupervised semantic parsing
Poon Domingos, EMNLP-2009 - Text ? Knowledge
- Start directly from text
- Identify meaning units Resolve variations
- Use high-order Markov logic (variables over
arbitrary lambda forms and their clusters) - End-to-end machine reading Read
text, then answer questions
131Semantic Parsing
INDUCE(e1)
IL-4 protein induces CD11b
INDUCER(e1,e2)
INDUCED(e1,e3)
IL-4(e2)
CD11B(e3)
Structured prediction Partition Assignment
induces
induces
INDUCE
nsubj
dobj
nsubj
dobj
INDUCED
INDUCER
protein
CD11b
protein
CD11b
nn
CD11B
nn
IL-4
IL-4
IL-4
132Challenge Same Meaning, Many Variations
- IL-4 up-regulates CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is induced by IL-4 protein
- The cytokin interleukin-4 induces CD11b
expression - IL-4s up-regulation of CD11b,
-
133Unsupervised Semantic Parsing
- USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b
expression - IL-4s up-regulation of CD11b,
134Unsupervised Semantic Parsing
- USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b
expression - IL-4s up-regulation of CD11b,
Cluster same forms at the atom level
135Unsupervised Semantic Parsing
- USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b
expression - IL-4s up-regulation of CD11b,
Cluster forms in composition with same forms
136Unsupervised Semantic Parsing
- USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b
expression - IL-4s up-regulation of CD11b,
Cluster forms in composition with same forms
137Unsupervised Semantic Parsing
- USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b
expression - IL-4s up-regulation of CD11b,
Cluster forms in composition with same forms
138Unsupervised Semantic Parsing
- USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b
expression - IL-4s up-regulation of CD11b,
Cluster forms in composition with same forms
139Unsupervised Semantic Parsing
- Exponential prior on number of parameters
- Event/object/property cluster mixtures
- InClust(e,c) HasValue(e,v)
Object/Event Cluster INDUCE
Property Cluster INDUCER
induces
0.1
0.5
IL-4
0.2
nsubj
None
0.1
enhances
0.4
0.4
One
0.8
agent
IL-8
0.1
140But State Space Too Large
- Coreference -clusters ? -mentions
- USP -clusters ? exp(-tokens)
- Also, meaning units often small and many
singleton clusters - ? Use combinatorial search
141Inference Hill-Climb Probability
induces
?
nsubj
dobj
?
?
Initialize
protein
CD11B
?
?
nn
?
IL-4
?
Lambda reduction
protein
protein
?
Search Operator
nn
?
nn
?
IL-4
IL-4
?
142Learning Hill-Climb Likelihood
protein
enhances
1
1
IL-4
1
induces
1
Initialize
MERGE
COMPOSE
enhances
1
induces
1
1
protein
1
IL-4
Search Operator
induces
0.2
IL-4 protein
1
enhances
0.8
143Unsupervised Ontology Induction
- Limitations of USP
- No ISA hierarchy among clusters
- Little smoothing
- Limited capability to generalize
- OntoUSP Poon Domingos, ACL-2010
- Extends USP to also induce ISA hierarchy
- Joint approach for ontology induction,
population, and knowledge extraction - To appear in ACL (see you in Uppsala -)
144OntoUSP
- Modify the cluster mixture formula
- InClust(e,c) ISA(c,d) HasValue(e,v)
- Hierarchical smoothing clustering
- New operator in learning
MERGE with REGULATE?
ABSTRACTION
0.3
induces
0.1
enhances
induces
0.6
0.2
inhibits
suppresses
0.1
up-regulates
0.2
INDUCE
ISA
ISA
INHIBIT
INDUCE
inhibits
0.4
inhibits
0.4
induces
0.6
suppresses
INHIBIT
0.2
suppresses
0.2
up-regulates
0.2
145End of The Beginning
- Not merely a user guide of MLN and Alchemy
- Statistical relational learning
- Growth area for machine learning and NLP
146Future Work Inference
- Scale up inference
- Cutting-planes methods (e.g., Riedel, 2008)
- Unify lifted inference with sampling
- Coarse-to-fine inference
- Alternative technology
- E.g., linear programming, lagrangian relaxation
147Future Work Supervised Learning
- Alternative optimization objectives
- E.g., max-margin learning Huynh Mooney, 2009
- Learning for efficient inference
- E.g., learning arithmetic circuits Lowd
Domingos, 2008 - Structure learning
Improve accuracy and scalability - E.g., Kok Domingos, 2009
148Future Work Unsupervised Learning
- Model Learning objective, formalism, etc.
- Learning Local optima, intractability, etc.
- Hyperparameter tuning
- Leverage available resources
- Semi-supervised learning
- Multi-task learning
- Transfer learning (e.g., domain adaptation)
- Human in the loop
- E.g., interative ML, active learning,
crowdsourcing
149Future Work NLP Applications
- Existing application areas
- More joint inference opportunities
- Additional domain knowledge
- Combine multiple pipeline stages
- A killer app Machine reading
- Many, many more awaiting YOU to discover
150Summary
- We need to unify logical and statistical NLP
- Markov logic provides a language for this
- Syntax Weighted first-order formulas
- Semantics Feature templates of Markov nets
- Inference Satisfiability, MCMC, lifted BP, etc.
- Learning Pseudo-likelihood, VP, PSCG, ILP, etc.
- Growing set of NLP applications
- Open-source software Alchemy
- Book Domingos Lowd, Markov Logic,Morgan
Claypool, 2009.
alchemy.cs.washington.edu
151References
- Banko et al., 2007 Michele Banko, Michael J.
Cafarella, Stephen Soderland, Matt Broadhead,
Oren Etzioni, "Open Information Extraction From
the Web", In Proc. IJCAI-2007. - Chakrabarti et al., 1998 Soumen Chakrabarti,
Byron Dom, Piotr Indyk, "Hypertext Classification
Using Hyperlinks", in Proc. SIGMOD-1998. - Damien et al., 1999 Paul Damien, Jon Wakefield,
Stephen Walker, "Gibbs sampling for Bayesian
non-conjugate and hierarchical models by
auxiliary variables", Journal of the Royal
Statistical Society B, 612. - Domingos Lowd, 2009 Pedro Domingos and Daniel
Lowd, Markov Logic, Morgan Claypool. - Friedman et al., 1999 Nir Friedman, Lise
Getoor, Daphne Koller, Avi Pfeffer, "Learning
probabilistic relational models", in Proc.
IJCAI-1999.
152References
- Halpern, 1990 Joe Halpern, "An analysis of
first-order logics of probability", Artificial
Intelligence 46. - Huynh Mooney, 2009 Tuyen Huynh and Raymond
Mooney, "Max-Margin Weight Learning for Markov
Logic Networks", In Proc. ECML-2009. - Kautz et al., 1997 Henry Kautz, Bart Selman,
Yuejun Jiang, "A general stochastic approach to
solving problems with hard and soft constraints",
In The Satisfiability Problem Theory and
Applications. AMS. - Kok Domingos, 2007 Stanley Kok and Pedro
Domingos, "Statistical Predicate Invention", In
Proc. ICML-2007. - Kok Domingos, 2008 Stanley Kok and Pedro
Domingos, "Extracting Semantic Networks from Text
via Relational Clustering", In Proc. ECML-2008.
153References
- Kok Domingos, 2009 Stanley Kok and Pedro
Domingos, "Learning Markov Logic Network
Structure via Hypergraph Lifting", In Proc.
ICML-2009. - Ling Weld, 2010 Xiao Ling and Daniel S.
Weld, "Temporal Information Extraction", In Proc.
AAAI-2010. - Lowd Domingos, 2007 Daniel Lowd and Pedro
Domingos, "Efficient Weight Learning for Markov
Logic Networks", In Proc. PKDD-2007. - Lowd Domingos, 2008 Daniel Lowd and Pedro
Domingos, "Learning Arithmetic Circuits", In
Proc. UAI-2008. - Meza-Ruiz Riedel, 2009 Ivan Meza-Ruiz and
Sebastian Riedel, "Jointly Identifying
Predicates, Arguments and Senses using Markov
Logic", In Proc. NAACL-2009.
154References
- Muggleton, 1996 Stephen Muggleton, "Stochastic
logic programs", in Proc. ILP-1996. - Nilsson, 1986 Nil Nilsson, "Probabilistic
logic", Artificial Intelligence 28. - Page et al., 1998 Lawrence Page, Sergey Brin,
Rajeev Motwani, Terry Winograd, "The PageRank
Citation Ranking Bringing Order to the Web",
Tech. Rept., Stanford University, 1998. - Poon Domingos, 2006 Hoifung Poon and Pedro
Domingos, "Sound and Efficient Inference with
Probabilistic and Deterministic Dependencies", In
Proc. AAAI-06. - Poon Domingos, 2007 Hoifung Poon and Pedro
Domingo, "Joint Inference in Information
Extraction", In Proc. AAAI-07.
155References
- Poon Domingos, 2008a Hoifung Poon, Pedro
Domingos, Marc Sumner, "A General Method for
Reducing the Complexity of Relational Inference
and its Application to MCMC", In Proc. AAAI-08. - Poon Domingos, 2008b Hoifung Poon and Pedro
Domingos, "Joint Unsupervised Coreference
Resolution with Markov Logic", In Proc. EMNLP-08. - Poon Domingos, 2009 Hoifung and Pedro
Domingos, "Unsupervised Semantic Parsing", In
Proc. EMNLP-09. - Poon Cherry Toutanova, 2009 Hoifung Poon,
Colin Cherry, Kristina Toutanova, "Unsupervised
Morphological Segmentation with Log-Linear
Models", In Proc. NAACL-2009.
156References
- Poon Vanderwende, 2010 Hoifung Poon and Lucy
Vanderwende, "Joint Inference for Knowledge
Extraction from Biomedical Literature", In Proc.
NAACL-10. - Poon Domingos, 2010 Hoifung and Pedro
Domingos, "Unsupervised Ontology Induction From
Text", In Proc. ACL-10. - Riedel 2008 Sebatian Riedel, "Improving the
Accuracy and Efficiency of MAP Inference for
Markov Logic", In Proc. UAI-2008. - Riedel et al., 2009 Sebastian Riedel, Hong-Woo
Chun, Toshihisa Takagi and Jun'ichi Tsujii, "A
Markov Logic Approach to Bio-Molecular Event
Extraction", In Proc. BioNLP 2009 Shared Task. - Selman et al., 1996 Bart Selman, Henry Kautz,
Bram Cohen, "Local search strategies for
satisfiability testing", In Cliques, Coloring,
and Satisfiability Second DIMACS Implementation
Challenge. AMS.
157References
- Singla Domingos, 2006a Parag Singla and Pedro
Domingos, "Memory-Efficient Inference in
Relational Domains", In Proc. AAAI-2006. - Singla Domingos, 2006b Parag Singla and Pedro
Domingos, "Entity Resolution with Markov Logic",
In Proc. ICDM-2006. - Singla Domingos, 2007 Parag Singla and Pedro
Domingos, "Markov Logic in Infinite Domains", In
Proc. UAI-2007. - Singla Domingos, 2008 Parag Singla and Pedro
Domingos, "Lifted First-Order Belief
Propagation", In Proc. AAAI-2008. - Taskar et al., 2002 Ben Taskar, Pieter Abbeel,
Daphne Koller, "Discriminative probabilistic
models for relational data", in Proc. UAI-2002.
158References
- Toutanova Haghighi Manning, 2008 Kristina
Toutanova, Aria Haghighi, Chris Manning, "A
global joint model for semantic role labeling",
Computational Linguistics. - Wang Domingos, 2008 Jue Wang and Pedro
Domingos, "Hybrid Markov Logic Networks", In
Proc. AAAI-2008. - Wellman et al., 1992 Michael Wellman, John S.
Breese, Robert P. Goldman, "From knowledge bases
to decision models", Knowledge Engineering Review
7. - Yoshikawa et al., 2009 Katsumasa Yoshikawa,
Sebastian Riedel, Masayuki Asahara and Yuji
Matsumoto, "Jointly Identifying Temporal
Relations with Markov Logic", In Proc. ACL-2009.