Markov Logic in Natural Language Processing

About This Presentation

Title:

Markov Logic in Natural Language Processing

Description:

Markov Logic in Natural Language Processing Hoifung Poon Dept. of Computer Science & Eng. University of Washington PCFG? * Lifted An attractive solution is to use aux ... – PowerPoint PPT presentation

Number of Views:540

Avg rating:3.0/5.0

Slides: 159

Provided by: pedr92

Category:

more less

Transcript and Presenter's Notes

Title: Markov Logic in Natural Language Processing

1
Markov Logic in Natural Language Processing

Hoifung Poon
Dept. of Computer Science Eng.
University of Washington

2
Overview

Motivation
Foundational areas
Markov logic
NLP applications
Basics
Supervised learning
Unsupervised learning

3
Languages Are Structural
governments lmpxtm (according
to their families)
4
Languages Are Structural
S
govern-ment-s l-mpx-t-m (according
to their families)
VP
NP
V
NP
IL-4 induces CD11B
Involvement of p70(S6)-kinase activation in IL-10
up-regulation in human monocytes by gp41......
George Walker Bush was the 43rd President of the
United States. Bush was the eldest son of
President G. H. W. Bush and Babara Bush. . In
November 1977, he met Laura Welch at a barbecue.
involvement
Theme
Cause
up-regulation
activation
Site
Theme
Cause
Theme
human monocyte
IL-10
gp41
p70(S6)-kinase
5
Languages Are Structural
S
govern-ment-s l-mpx-t-m (according
to their families)
VP
NP
V
NP
IL-4 induces CD11B
Involvement of p70(S6)-kinase activation in IL-10
up-regulation in human monocytes by gp41......
George Walker Bush was the 43rd President of the
United States. Bush was the eldest son of
President G. H. W. Bush and Babara Bush. . In
November 1977, he met Laura Welch at a barbecue.
involvement
Theme
Cause
up-regulation
activation
Site
Theme
Cause
Theme
human monocyte
IL-10
gp41
p70(S6)-kinase
6
Languages Are Structural

Objects are not just feature vectors
They have parts and subparts
Which have relations with each other
They can be trees, graphs, etc.
Objects are seldom i.i.d.(independent and
identically distributed)
They exhibit local and global dependencies
They form class hierarchies (with multiple
inheritance)
Objects properties depend on those of related
objects
Deeply interwoven with knowledge

7
First-Order Logic

Main theoretical foundation of computer science
General language for describing complex
structures and knowledge
Trees, graphs, dependencies, hierarchies, etc.
easily expressed
Inference algorithms (satisfiability testing,
theorem proving, etc.)

8
Languages Are Statistical
Microsoft buys Powerset Microsoft acquires
Powerset Powerset is acquired by Microsoft
Corporation The Redmond software giant buys
Powerset Microsofts purchase of Powerset,
I saw the man with the telescope
NP
I saw the man with the telescope
NP
ADVP
I saw the man with the telescope
G. W. Bush Laura Bush Mrs. Bush
Here in London, Frances Deek is a retired teacher
In the Israeli town , Karen London says Now
London says
Which one?
London ? PERSON or LOCATION?
9
Languages Are Statistical

Languages are ambiguous
Our information is always incomplete
We need to model correlations
Our predictions are uncertain
Statistics provides the tools to handle this

10
Probabilistic Graphical Models

Mixture models
Hidden Markov models
Bayesian networks
Markov random fields
Maximum entropy models
Conditional random fields
Etc.

11
The Problem

Logic is deterministic, requires manual coding
Statistical models assume i.i.d. data,objects
feature vectors
Historically, statistical and logical NLPhave
been pursued separately
We need to unify the two!
Burgeoning field in machine learning
Statistical relational learning

12
Costs and Benefits ofStatistical Relational
Learning

Benefits
Better predictive accuracy
Better understanding of domains
Enable learning with less or no labeled data
Costs
Learning is much harder
Inference becomes a crucial issue
Greater complexity for user

13
Progress to Date

Probabilistic logic Nilsson, 1986
Statistics and beliefs Halpern, 1990
Knowledge-based model constructionWellman et
al., 1992
Stochastic logic programs Muggleton, 1996
Probabilistic relational models Friedman et al.,
1999
Relational Markov networks Taskar et al., 2002
Etc.
This talk Markov logic Domingos Lowd, 2009

14
Markov Logic A Unifying Framework

Probabilistic graphical models andfirst-order
logic are special cases
Unified inference and learning algorithms
Easy-to-use software Alchemy
Broad applicability
Goal of this tutorialQuickly learn how to use
Markov logic and Alchemy for a broad spectrum of
NLP applications

15
Overview

Motivation
Foundational areas
Probabilistic inference
Statistical learning
Logical inference
Inductive logic programming
Markov logic
NLP applications
Basics
Supervised learning
Unsupervised learning

16
Markov Networks

Undirected graphical models

Cancer
Smoking
Cough
Asthma

Potential functions defined over cliques

Smoking Cancer ?(S,C)
False False 4.5
False True 4.5
True False 2.7
True True 4.5
17
Markov Networks

Undirected graphical models

Cancer
Smoking
Cough
Asthma

Log-linear model

Weight of Feature i
Feature i
18
Markov Nets vs. Bayes Nets
Property Markov Nets Bayes Nets
Form Prod. potentials Prod. potentials
Potentials Arbitrary Cond. probabilities
Cycles Allowed Forbidden
Partition func. Z ? Z 1
Indep. check Graph separation D-separation
Indep. props. Some Some
Inference MCMC, BP, etc. Convert to Markov
19
Inference in Markov Networks

Goal compute marginals conditionals of
Exact inference is P-complete
Conditioning on Markov blanket is easy
Gibbs sampling exploits this

20
MCMC Gibbs Sampling
state ? random truth assignment for i ? 1 to
num-samples do for each variable x
sample x according to P(xneighbors(x))
state ? state with new value of x P(F) ? fraction
of states in which F is true
21
Other Inference Methods

Belief propagation (sum-product)
Mean field / Variational approximations

22
MAP/MPE Inference

Goal Find most likely state of world given
evidence

Query
Evidence
23
MAP Inference Algorithms

Iterated conditional modes
Simulated annealing
Graph cuts
Belief propagation (max-product)
LP relaxation

24
Overview

Motivation
Foundational areas
Probabilistic inference
Statistical learning
Logical inference
Inductive logic programming
Markov logic
NLP applications
Basics
Supervised learning
Unsupervised learning

25
Generative Weight Learning

Maximize likelihood
Use gradient ascent or L-BFGS
No local maxima
Requires inference at each step (slow!)

No. of times feature i is true in data
Expected no. times feature i is true according to
model
26
Pseudo-Likelihood

Likelihood of each variable given its neighbors
in the data
Does not require inference at each step
Widely used in vision, spatial statistics, etc.
But PL parameters may not work well forlong
inference chains

27
Discriminative Weight Learning

Maximize conditional likelihood of query (y)
given evidence (x)
Approximate expected counts by counts in MAP
state of y given x

No. of true groundings of clause i in data
Expected no. true groundings according to model
28
Voted Perceptron

Originally proposed for training HMMs
discriminatively
Assumes network is linear chain
Can be generalized to arbitrary networks

wi ? 0 for t ? 1 to T do yMAP ? Viterbi(x)
wi ? wi ? counti(yData) counti(yMAP) return
? wi / T
29
Overview

Motivation
Foundational areas
Probabilistic inference
Statistical learning
Logical inference
Inductive logic programming
Markov logic
NLP applications
Basics
Supervised learning
Unsupervised learning

30
First-Order Logic

Constants, variables, functions, predicatesE.g.
Anna, x, MotherOf(x), Friends(x, y)
Literal Predicate or its negation
Clause Disjunction of literals
Grounding Replace all variables by
constantsE.g. Friends (Anna, Bob)
World (model, interpretation)Assignment of
truth values to all ground predicates

31
Inference in First-Order Logic

Traditionally done by theorem proving(e.g.
Prolog)
Propositionalization followed by model checking
turns out to be faster (often by a lot)
PropositionalizationCreate all ground atoms and
clauses
Model checking Satisfiability testing
Two main approaches
Backtracking (e.g. DPLL)
Stochastic local search (e.g. WalkSAT)

32
Satisfiability

Input Set of clauses(Convert KB to conjunctive
normal form (CNF))
Output Truth assignment that satisfies all
clauses, or failure
The paradigmatic NP-complete problem
Solution Search
Key pointMost SAT problems are actually easy
Hard region Narrow range ofClauses / Variables

33
Stochastic Local Search

Uses complete assignments instead of partial
Start with random state
Flip variables in unsatisfied clauses
Hill-climbing Minimize unsatisfied clauses
Avoid local minima Random flips
Multiple restarts

34
The WalkSAT Algorithm
for i ? 1 to max-tries do solution random
truth assignment for j ? 1 to max-flips do
if all clauses satisfied then
return solution c ? random unsatisfied
clause with probability p
flip a random variable in c else
flip variable in c that maximizes satisfied
clauses return failure
35
Overview

Motivation
Foundational areas
Probabilistic inference
Statistical learning
Logical inference
Inductive logic programming
Markov logic
NLP applications
Basics
Supervised learning
Unsupervised learning

36
Rule Induction

Given Set of positive and negative examples of
some concept
Example (x1, x2, , xn, y)
y concept (Boolean)
x1, x2, , xn attributes (assume Boolean)
Goal Induce a set of rules that cover all
positive examples and no negative ones
Rule xa xb ? y (xa Literal, i.e., xi
or its negation)
Same as Horn clause Body ? Head
Rule r covers example x iff x satisfies body of r
Eval(r) Accuracy, info gain, coverage, support,
etc.

37
Learning a Single Rule
head ? y body ? Ø repeat for each literal x
rx ? r with x added to body
Eval(rx) body ? body best x until no x
improves Eval(r) return r
38
Learning a Set of Rules
R ? Ø S ? examples repeat learn a single rule
r R ? R U r S ? S - positive examples
covered by r until S Ø return R
39
First-Order Rule Induction

y and xi are now predicates with argumentsE.g.
y is Ancestor(x,y), xi is Parent(x,y)
Literals to add are predicates or their negations
Literal to add must include at least one
variablealready appearing in rule
Adding a literal changes groundings of
ruleE.g. Ancestor(x,z) Parent(z,y) ?
Ancestor(x,y)
Eval(r) must take this into accountE.g.
Multiply by positive groundings of rule
still covered after adding literal

40
Overview

Motivation
Foundational areas
Markov logic
NLP applications
Basics
Supervised learning
Unsupervised learning

41
Markov Logic

Syntax Weighted first-order formulas
Semantics Feature templates for Markov networks
Intuition Soften logical constraints
Give each formula a weight(Higher weight ?
Stronger constraint)

42
Example Coreference Resolution
Barack Obama, the 44th President of the United
States, is the first African American to hold the
office.
43
Example Coreference Resolution
44
Example Coreference Resolution
45
Example Coreference Resolution
Two mention constants A and B
Apposition(A,B)
Head(A,President)
Head(B,President)
MentionOf(A,Obama)
MentionOf(B,Obama)
Head(A,Obama)
Head(B,Obama)
Apposition(B,A)
46
Markov Logic Networks

MLN is template for ground Markov nets
Probability of a world x
Typed variables and constants greatly reduce size
of ground Markov net
Functions, existential quantifiers, etc.
Can handle infinite domains Singla Domingos,
2007 and continuous domains Wang
Domingos, 2008

Weight of formula i
No. of true groundings of formula i in x
47
Relation to Statistical Models

Special cases
Markov networks
Markov random fields
Bayesian networks
Log-linear models
Exponential models
Max. entropy models
Gibbs distributions
Boltzmann machines
Logistic regression
Hidden Markov models
Conditional random fields

Obtained by making all predicates zero-arity
Markov logic allows objects to be interdependent
(non-i.i.d.)

48
Relation to First-Order Logic

Infinite weights ? First-order logic
Satisfiable KB, positive weights ? Satisfying
assignments Modes of distribution
Markov logic allows contradictions between
formulas

49
MLN AlgorithmsThe First Three Generations
Problem First generation Second generation Third generation
MAP inference Weighted satisfiability Lazy inference Cutting planes
Marginal inference Gibbs sampling MC-SAT Lifted inference
Weight learning Pseudo-likelihood Voted perceptron Scaled conj. gradient
Structure learning Inductive logic progr. ILP PL (etc.) Clustering pathfinding
50
MAP/MPE Inference

Problem Find most likely state of world given
evidence

Query
Evidence
51
MAP/MPE Inference

Problem Find most likely state of world given
evidence

52
MAP/MPE Inference

Problem Find most likely state of world given
evidence

53
MAP/MPE Inference

Problem Find most likely state of world given
evidence
This is just the weighted MaxSAT problem
Use weighted SAT solver(e.g., MaxWalkSAT Kautz
et al., 1997 )

54
The MaxWalkSAT Algorithm
for i ? 1 to max-tries do solution random
truth assignment for j ? 1 to max-flips do
if ? weights(sat. clauses) gt threshold then
return solution c ? random
unsatisfied clause with probability p
flip a random variable in c else
flip variable in c that maximizes
? weights(sat. clauses)
return failure, best solution found
55
Computing Probabilities

P(FormulaMLN,C) ?
MCMC Sample worlds, check formula holds
P(Formula1Formula2,MLN,C) ?
If Formula2 Conjunction of ground atoms
First construct min subset of network necessary
to answer query (generalization of KBMC)
Then apply MCMC

56
But Insufficient for Logic

ProblemDeterministic dependencies break
MCMCNear-deterministic ones make it very slow
SolutionCombine MCMC and WalkSAT? MC-SAT
algorithm Poon Domingos, 2006

57
Auxiliary-Variable Methods

Main ideas
Use auxiliary variables to capture dependencies
Turn difficult sampling into uniform sampling
Given distribution P(x)
Sample from f (x, u), then discard u

58
Slice Sampling Damien et al. 1999
U
P(x)
Slice
u(k)
X
x(k1)
x(k)
59
Slice Sampling

Identifying the slice may be difficult
Introduce an auxiliary variable ui for each ?i

60
The MC-SAT Algorithm

Select random subset M of satisfied clauses
With probability 1 exp ( wi )
Larger wi ? Ci more likely to be selected
Hard clause (wi ? ?) Always selected
Slice ? States that satisfy clauses in M
Uses SAT solver to sample x u.
Orders of magnitude faster than Gibbs sampling,
etc.

61
But It Is Not Scalable

1000 researchers
Coauthor(x,y) 1 million ground atoms
Coauthor(x,y) ? Coauthor(y,z) ? Coauthor(x,z) 1
billion ground clauses
Exponential in arity

62
Sparsity to the Rescue

1000 researchers
Coauthor(x,y) 1 million ground atoms
But most atoms are false
Coauthor(x,y) ? Coauthor(y,z) ? Coauthor(x,z)
1 billion ground clauses
Most trivially satisfied if most atoms are false
No need to explicitly compute most of them

63
Lazy Inference

LazySAT Singla Domingos, 2006a
Lazy version of WalkSAT Selman et al., 1996
Grounds atoms/clauses as needed
Greatly reduces memory usage
The idea is much more general Poon
Domingos, 2008a

64
General Method for Lazy Inference

If most variables assume the default value,
wasteful to instantiate all variables / functions
Main idea
Allocate memory for a small subset of
active variables / functions
Activate more if necessary as inference proceeds
Applicable to a diverse set of algorithms
Satisfiability solvers (systematic,
local-search), Markov chain Monte Carlo, MPE /
MAP algorithms, Maximum expected utility
algorithms, Belief propagation, MC-SAT, Etc.
Reduce memory and time by orders of magnitude

65
Lifted Inference

Consider belief propagation (BP)
Often in large problems, many nodes are
interchangeableThey send and receive the same
messages throughout BP
Basic idea Group them into supernodes, forming
lifted network
Smaller network ? Faster inference
Akin to resolution in first-order logic

66
Belief Propagation
Features (f)
Nodes (x)
67
Lifted Belief Propagation
Features (f)
Nodes (x)
68
Lifted Belief Propagation
?,? Functions of edge counts
?
?
Features (f)
Nodes (x)
69
Learning

Data is a relational database
Closed world assumption (if not EM)
Learning parameters (weights)
Learning structure (formulas)

70
Parameter Learning

Parameter tying Groundings of same clause
Generative learning Pseudo-likelihood
Discriminative learning Conditional
likelihood,use MC-SAT or MaxWalkSAT for inference

No. of times clause i is true in data
Expected no. times clause i is true according to
MLN
71
Parameter Learning

Pseudo-likelihood L-BFGS is fast and robust but
can give poor inference results
Voted perceptronGradient descent MAP
inference
Scaled conjugate gradient

72
Voted Perceptron for MLNs

HMMs are special case of MLNs
Replace Viterbi by MaxWalkSAT
Network can now be arbitrary graph

wi ? 0 for t ? 1 to T do yMAP ?
MaxWalkSAT(x) wi ? wi ? counti(yData)
counti(yMAP) return ? wi / T
73
Problem Multiple Modes

Not alleviated by contrastive divergence
Alleviated by MC-SAT
Warm start Start each MC-SAT run at previous end
state

74
Problem Extreme Ill-Conditioning

Solvable by quasi-Newton, conjugate gradient,
etc.
But line searches require exact inference
Solution Scaled conjugate gradient
Lowd Domingos, 2008
Use Hessian to choose step size
Compute quadratic form inside MC-SAT
Use inverse diagonal Hessian as preconditioner

75
Structure Learning

Standard inductive logic programming
optimizesthe wrong thing
But can be used to overgenerate for L1 pruning
Our approachILP Pseudo-likelihood Structure
priors
For each candidate structure changeStart from
current weights relax convergence
Use subsampling to compute sufficient statistics

76
Structure Learning

Initial state Unit clauses or prototype KB
Operators Add/remove literal, flip sign
Evaluation function Pseudo-likelihood
Structure prior
Search Beam search, shortest-first search

77
Alchemy

Open-source software including
Full first-order logic syntax
Generative discriminative weight learning
Structure learning
Weighted satisfiability, MCMC, lifted BP
Programming language features

alchemy.cs.washington.edu
78
Alchemy Prolog BUGS
Represent-ation F.O. Logic Markov nets Horn clauses Bayes nets
Inference Model check- ing, MCMC, lifted BP Theorem proving MCMC
Learning Parameters structure No Params.
Uncertainty Yes No Yes
Relational Yes Yes No
79
Constrained Conditional Model

Representation Integer linear programs
Local classifiers Global constraints
Inference LP solver
Parameter learning None for constraints
Weights of soft constraints set heuristically
Local weights typically learned independently
Structure learning None to date
But see latest development in NAACL-10

80
Running Alchemy

Programs
Infer
Learnwts
Learnstruct
Options

MLN file
Types (optional)
Predicates
Formulas
Database files

81
Overview

Motivation
Foundational areas
Markov logic
NLP applications
Basics
Supervised learning
Unsupervised learning

82
Uniform Distribn. Empty MLN

Example Unbiased coin flips
Type flip 1, , 20
Predicate Heads(flip)

83
Binomial Distribn. Unit Clause

Example Biased coin flips
Type flip 1, , 20
Predicate Heads(flip)
Formula Heads(f)
Weight Log odds of heads
By default, MLN includes unit clauses for all
predicates
(captures marginal distributions, etc.)

84
Multinomial Distribution

Example Throwing die
Types throw 1, , 20
face 1, , 6
Predicate Outcome(throw,face)
Formulas Outcome(t,f) f ! f gt
!Outcome(t,f).
Exist f Outcome(t,f).
Too cumbersome!

85
Multinomial Distrib. ! Notation

Example Throwing die
Types throw 1, , 20
face 1, , 6
Predicate Outcome(throw,face!)
Formulas
Semantics Arguments without ! determine
arguments with !.
Also makes inference more efficient (triggers
blocking).

86
Multinomial Distrib. Notation

Example Throwing biased die
Types throw 1, , 20
face 1, , 6
Predicate Outcome(throw,face!)
Formulas Outcome(t,f)
Semantics Learn weight for each grounding of
args with .

87
Logistic Regression (MaxEnt)
Logistic regression Type
obj 1, ... , n Query predicate
C(obj) Evidence predicates Fi(obj) Formulas
a C(x)
bi Fi(x) C(x) Resulting distribution
Therefore Alternative form Fi(x) gt
C(x)
88
Hidden Markov Models
obs Red, Green, Yellow state Stop,
Drive, Slow time 0, ..., 100
State(state!,time) Obs(obs!,time) State(s,0)
State(s,t) State(s',t1) Obs(o,t)
State(s,t) Sparse HMM State(s,t) gt
State(s1,t1) v State(s2, t1) v ... .
89
Bayesian Networks

Use all binary predicates with same first
argument (the object x).
One predicate for each variable A A(x,v!)
One clause for each line in the CPT andvalue of
the variable
Context-specific independenceOne clause for
each path in the decision tree
Logistic regression As before
Noisy OR Deterministic OR Pairwise clauses

90
Relational Models

Knowledge-based model construction
Allow only Horn clauses
Same as Bayes nets, except arbitrary relations
Combin. function Logistic regression, noisy-OR
or external
Stochastic logic programs
Allow only Horn clauses
Weight of clause log(p)
Add formulas Head holds ? Exactly one body holds
Probabilistic relational models
Allow only binary relations
Same as Bayes nets, except first argument can vary

91
Relational Models

Relational Markov networks
SQL ? Datalog ? First-order logic
One clause for each state of a clique
syntax in Alchemy facilitates this
Bayesian logic
Object Cluster of similar/related observations
Observation constants Object constants
Predicate InstanceOf(Obs,Obj) and clauses using
it
Unknown relations Second-order Markov logic
S. Kok P. Domingos, Statistical Predicate
Invention, inProc. ICML-2007.

92
Overview

Motivation
Foundational areas
Markov logic
NLP applications
Basics
Supervised learning
Unsupervised learning

93
Text Classification
The 56th quadrennial United States presidential
election was held on November 4, 2008. Outgoing
Republican President George W. Bush's policies
and actions and the American public's desire for
change were key issues throughout the campaign.
Topic politics
The Chicago Bulls are an American professional
basketball team based in Chicago, Illinois,
playing in the Central Division of the Eastern
Conference in the National Basketball Association
(NBA).
Topic sports

94
Text Classification
page 1, ..., max word ... topic ...
Topic(page,topic) HasWord(page,word) Topic(p,
t) HasWord(p,w) gt Topic(p,t) If topics
mutually exclusive Topic(page,topic!)
95
Text Classification
page 1, ..., max word ... topic ...
Topic(page,topic) HasWord(page,word) Links(page
,page) Topic(p,t) HasWord(p,w) gt
Topic(p,t) Topic(p,t) Links(p,p') gt
Topic(p',t) Cf. S. Chakrabarti, B. Dom P.
Indyk, Hypertext Classification Using
Hyperlinks, in Proc. SIGMOD-1998.
96
Entity Resolution
AUTHOR H. POON P. DOMINGOS TITLE UNSUPERVISED
SEMANTIC PARSING VENUE EMNLP-09
SAME?
AUTHOR Hoifung Poon and Pedro Domings TITLE
Unsupervised semantic parsing VENUE Proceedings
of the 2009 Conference on Empirical Methods in
Natural Language Processing
AUTHOR Poon, Hoifung and Domings, Pedro TITLE
Unsupervised ontology induction from text VENUE
Proceedings of the Forty-Eighth Annual Meeting of
the Association for Computational Linguistics
SAME?
AUTHOR H. Poon, P. Domings TITLE Unsupervised
ontology induction VENUE ACL-10
97
Entity Resolution
Problem Given database, find duplicate
records HasToken(token,field,record) SameField(fi
eld,record,record) SameRecord(record,record) HasT
oken(t,f,r) HasToken(t,f,r) gt
SameField(f,r,r) SameField(f,r,r) gt
SameRecord(r,r)
98
Entity Resolution
Problem Given database, find duplicate
records HasToken(token,field,record) SameField(fi
eld,record,record) SameRecord(record,record) HasT
oken(t,f,r) HasToken(t,f,r) gt
SameField(f,r,r) SameField(f,r,r) gt
SameRecord(r,r) SameRecord(r,r)
SameRecord(r,r) gt SameRecord(r,r) Cf.
A. McCallum B. Wellner, Conditional Models of
Identity Uncertainty with Application to Noun
Coreference, in Adv. NIPS 17, 2005.
99
Entity Resolution
Can also resolve fields HasToken(token,field,rec
ord) SameField(field,record,record) SameRecord(rec
ord,record) HasToken(t,f,r)
HasToken(t,f,r) gt SameField(f,r,r) SameFi
eld(f,r,r) ltgt SameRecord(r,r) SameRecord(r,r)
SameRecord(r,r) gt SameRecord(r,r) SameFi
eld(f,r,r) SameField(f,r,r) gt
SameField(f,r,r) More P. Singla P. Domingos,
Entity Resolution with Markov Logic, in Proc.
ICDM-2006.
100
Information Extraction
Unsupervised Semantic Parsing, Hoifung Poon and
Pedro Domingos. Proceedings of the 2009
Conference on Empirical Methods in Natural
Language Processing. Singapore ACL.
UNSUPERVISED SEMANTIC PARSING. H. POON P.
DOMINGOS. EMNLP-2009.
101
Information Extraction
Author
Title
Venue
Unsupervised Semantic Parsing, Hoifung Poon and
Pedro Domingos. Proceedings of the 2009
Conference on Empirical Methods in Natural
Language Processing. Singapore ACL.
SAME?
UNSUPERVISED SEMANTIC PARSING. H. POON P.
DOMINGOS. EMNLP-2009.
102
Information Extraction

Problem Extract database from text
orsemi-structured sources
Example Extract database of publications from
citation list(s) (the CiteSeer problem)
Two steps
SegmentationUse HMM to assign tokens to fields
Entity resolutionUse logistic regression and
transitivity

103
Information Extraction
Token(token, position, citation) InField(position,
field!, citation) SameField(field, citation,
citation) SameCit(citation, citation) Token(t,i,
c) gt InField(i,f,c) InField(i,f,c)
InField(i1,f,c) Token(t,i,c)
InField(i,f,c) Token(t,i,c)
InField(i,f,c) gt SameField(f,c,c) SameField(
f,c,c) ltgt SameCit(c,c) SameField(f,c,c)
SameField(f,c,c) gt SameField(f,c,c) SameCit(c,
c) SameCit(c,c) gt SameCit(c,c)
104
Information Extraction
Token(token, position, citation) InField(position,
field!, citation) SameField(field, citation,
citation) SameCit(citation, citation) Token(t,i,
c) gt InField(i,f,c) InField(i,f,c)
!Token(.,i,c) InField(i1,f,c) Token(t,i,c
) InField(i,f,c) Token(t,i,c)
InField(i,f,c) gt SameField(f,c,c) SameField(
f,c,c) ltgt SameCit(c,c) SameField(f,c,c)
SameField(f,c,c) gt SameField(f,c,c) SameCit(c,
c) SameCit(c,c) gt SameCit(c,c) More H.
Poon P. Domingos, Joint Inference in
Information Extraction, in Proc. AAAI-2007.
105
Biomedical Text Mining

Traditionally, name entity recognition or
information extraction
E.g., protein recognition, protein-protein
identification
BioNLP-09 shared task Nested bio-events
Much harder than traditional IE
Top F1 around 50
Naturally calls for joint inference

106
Bio-Event Extraction
Involvement of p70(S6)-kinase activation in IL-10
up-regulation in human monocytes by gp41 envelope
protein of human immunodeficiency virus type 1 ...
involvement
Theme
Cause
up-regulation
activation
Site
Theme
Cause
Theme
human monocyte
p70(S6)-kinase
gp41
IL-10
107
Bio-Event Extraction
Token(position, token) DepEdge(position,
position, dependency) IsProtein(position) EvtType(
position, evtType) InArgPath(position, position,
argType!) Token(i,w) gt EvtType(i,t) Token(j,w)
DepEdge(i,j,d) gt EvtType(i,t) DepEdge(i,j,d
) gt InArgPath(i,j,a) Token(i,w)
DepEdge(i,j,d) gt InArgPath(i,j,a)
Logistic regression
108
Bio-Event Extraction
Token(position, token) DepEdge(position,
position, dependency) IsProtein(position) EvtType(
position, evtType) InArgPath(position, position,
argType!) Token(i,w) gt EvtType(i,t) Token(j,w)
DepEdge(i,j,d) gt EvtType(i,t) DepEdge(i,j,d
) gt InArgPath(i,j,a) Token(i,w)
DepEdge(i,j,d) gt InArgPath(i,j,a) InArgPath(
i,j,Theme) gt IsProtein(j) v
(Exist k k!i InArgPath(j, k,
Theme)). More H. Poon and L. Vanderwende,
Joint Inference for Knowledge Extraction from
Biomedical Literature, 1040 am, June 4, Gold
Room.
Adding a few joint inference rules doubles the F1
109
Temporal Information Extraction

Identify event times and temporal relations
(BEFORE, AFTER, OVERLAP)
E.g., who is the President of U.S.A.?
Obama 1/20/2009 ? present
G. W. Bush 1/20/2001 ? 1/19/2009
Etc.

110
Temporal Information Extraction
DepEdge(position, position, dependency) Event(posi
tion, event) After(event, event)
DepEdge(i,j,d) Event(i,p) Event(j,q) gt
After(p,q) After(p,q) After(q,r) gt
After(p,r)
111
Temporal Information Extraction
DepEdge(position, position, dependency) Event(posi
tion, event) After(event, event) Role(position,
position, role) DepEdge(I,j,d) Event(i,p)
Event(j,q) gt After(p,q) Role(i,j,ROLE-AFTER)
Event(i,p) Event(j,q) gt After(p,q) After(p,q)
After(q,r) gt After(p,r) More K. Yoshikawa,
S. Riedel, M. Asahara and Y. Matsumoto, Jointly
Identifying Temporal Relations with Markov
Logic, in Proc. ACL-2009. X. Ling D. Weld,
Temporal Information Extraction, in Proc.
AAAI-2010.
112
Semantic Role Labeling

Problem Identify arguments for a predicate
Two steps
Argument identificationDetermine whether a
phrase is an argument
Role classificationDetermine the type of an
argument (agent, theme, temporal, adjunct, etc.)

113
Semantic Role Labeling
Token(position, token) DepPath(position,
position, path) IsPredicate(position) Role(positio
n, position, role!) HasRole(position, position)
Token(i,t) gt IsPredicate(i) DepPath(i,j,p)
gt Role(i,j,r) HasRole(i,j) gt
IsPredicate(i) IsPredicate(i) gt Exist j
HasRole(i,j) HasRole(i,j) gt Exist r
Role(i,j,r) Role(i,j,r) gt HasRole(i,j) Cf. K.
Toutanova, A. Haghighi, C. Manning, A global
joint model for semantic role labeling, in
Computational Linguistics 2008.
114
Joint Semantic Role Labeling and Word Sense
Disambiguation
Token(position, token) DepPath(position,
position, path) IsPredicate(position) Role(positio
n, position, role!) HasRole(position,
position) Sense(position, sense!) Token(i,t) gt
IsPredicate(i) DepPath(i,j,p) gt
Role(i,j,r) Sense(I,s) gt IsPredicate(i) HasRole
(i,j) gt IsPredicate(i) IsPredicate(i) gt Exist j
HasRole(i,j) HasRole(i,j) gt Exist r
Role(i,j,r) Role(i,j,r) gt HasRole(i,j) Token(i,t
) Role(i,j,r) gt Sense(i,s) More I.
Meza-Ruiz S. Riedel, Jointly Identifying
Predicates, Arguments and Senses using Markov
Logic, in Proc. NAACL-2009.
115
Practical Tips Modeling

Add all unit clauses (the default)
How to handle uncertain dataR(x,y) R(x,y)
(the HMM trick)
Implications vs. conjunctions
For soft correlation, conjunctions often better
Implication A gt B is equivalent to !(A !B)
Share cases with others like A gt C
Make learning unnecessarily harder

116
Practical Tips Efficiency

Open/closed world assumptions
Low clause arities
Low numbers of constants
Short inference chains

117
Practical Tips Development

Start with easy components
Gradually expand to full task
Use the simplest MLN that works
Cycle Add/delete formulas, learn and test

118
Overview

Motivation
Foundational areas
Markov logic
NLP applications
Basics
Supervised learning
Unsupervised learning

119
Unsupervised Learning Why?

Virtually unlimited supply of unlabeled text
Labeling is expensive (Cf. Penn-Treebank)
Often difficult to label with consistency and
high quality (e.g., semantic parses)
Emerging field Machine reading
Extract knowledge from unstructured text with
high precision/recall and minimal human effort
Check out LBR-Workshop (WS9) on Sunday

120
Unsupervised Learning How?

I.i.d. learning Sophisticated model requires
more labeled data
Statistical relational learning Sophisticated
model may require less labeled data
Relational dependencies constrain problem space
One formula is worth a thousand labels
Small amount of domain knowledge ?
large-scale joint inference

121
Unsupervised Learning How?

Ambiguities vary among objects
Joint inference ? Propagate information from
unambiguous objects to ambiguous ones
E.g.
G. W. Bush
He
Mrs. Bush

Are they coreferent?
122
Unsupervised Learning How

Ambiguities vary among objects
Joint inference ? Propagate information from
unambiguous objects to ambiguous ones
E.g.
G. W. Bush
He
Mrs. Bush

Should be coreferent
123
Unsupervised Learning How

Ambiguities vary among objects
Joint inference ? Propagate information from
unambiguous objects to ambiguous ones
E.g.
G. W. Bush
He
Mrs. Bush

So must be singular male!
124
Unsupervised Learning How

Ambiguities vary among objects
Joint inference ? Propagate information from
unambiguous objects to ambiguous ones
E.g.
G. W. Bush
He
Mrs. Bush

Must be singular female!
125
Unsupervised Learning How

Ambiguities vary among objects
Joint inference ? Propagate information from
unambiguous objects to ambiguous ones
E.g.
G. W. Bush
He
Mrs. Bush

Verdict Not coreferent!
126
Parameter Learning

Marginalize out hidden variables
Use MC-SAT to approximate both expectations
May also combine with contrastive estimation
Poon Cherry Toutanova, NAACL-2009

Sum over z, conditioned on observed x
Summed over both x and z
127
Unsupervised Coreference Resolution
Head(mention, string) Type(mention,
type) MentionOf(mention, entity)
MentionOf(m,e) Type(m,t) Head(m,h)
MentionOf(m,e) MentionOf(a,e) MentionOf(b,e)
gt (Type(a,t) ltgt Type(b,t)) (similarly for
Number, Gender etc.)
Mixture model
Joint inference formulas Enforce agreement
128
Unsupervised Coreference Resolution
Head(mention, string) Type(mention,
type) MentionOf(mention, entity) Apposition(mentio
n, mention) MentionOf(m,e) Type(m,t) Head(m,
h) MentionOf(m,e) MentionOf(a,e)
MentionOf(b,e) gt (Type(a,t) ltgt Type(b,t))
(similarly for Number, Gender etc.) Apposition(a,
b) gt (MentionOf(a,e) ltgt MentionOf(b,e)) More
H. Poon and P. Domingos, Joint Unsupervised
Coreference Resolution with Markov Logic, in
Proc. EMNLP-2008.
Joint inference formulas Leverage apposition
129
Relational Clustering Discover Unknown Predicates

Cluster relations along with objects
Use second-order Markov logic
Kok Domingos, 2007, 2008
Key idea Cluster combination determines
likelihood of relations
InClust(r,c) InClust(x,a) InClust(y,b)
gt r(x,y)
Input Relational tuples extracted by TextRunner
Banko et al., 2007
Output Semantic network

130
Recursive Relational Clustering

Unsupervised semantic parsing
Poon Domingos, EMNLP-2009
Text ? Knowledge
Start directly from text
Identify meaning units Resolve variations
Use high-order Markov logic (variables over
arbitrary lambda forms and their clusters)
End-to-end machine reading Read
text, then answer questions

131
Semantic Parsing
INDUCE(e1)
IL-4 protein induces CD11b
INDUCER(e1,e2)
INDUCED(e1,e3)
IL-4(e2)
CD11B(e3)
Structured prediction Partition Assignment
induces
induces
INDUCE
nsubj
dobj
nsubj
dobj
INDUCED
INDUCER
protein
CD11b
protein
CD11b
nn
CD11B
nn
IL-4
IL-4
IL-4
132
Challenge Same Meaning, Many Variations

IL-4 up-regulates CD11b
Protein IL-4 enhances the expression of CD11b
CD11b expression is induced by IL-4 protein
The cytokin interleukin-4 induces CD11b
expression
IL-4s up-regulation of CD11b,

133
Unsupervised Semantic Parsing

USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions
IL-4 induces CD11b
Protein IL-4 enhances the expression of CD11b
CD11b expression is enhanced by IL-4 protein
The cytokin interleukin-4 induces CD11b
expression
IL-4s up-regulation of CD11b,

134
Unsupervised Semantic Parsing

USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions
IL-4 induces CD11b
Protein IL-4 enhances the expression of CD11b
CD11b expression is enhanced by IL-4 protein
The cytokin interleukin-4 induces CD11b
expression
IL-4s up-regulation of CD11b,

Cluster same forms at the atom level
135
Unsupervised Semantic Parsing

USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions
IL-4 induces CD11b
Protein IL-4 enhances the expression of CD11b
CD11b expression is enhanced by IL-4 protein
The cytokin interleukin-4 induces CD11b
expression
IL-4s up-regulation of CD11b,

Cluster forms in composition with same forms
136
Unsupervised Semantic Parsing

USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions
IL-4 induces CD11b
Protein IL-4 enhances the expression of CD11b
CD11b expression is enhanced by IL-4 protein
The cytokin interleukin-4 induces CD11b
expression
IL-4s up-regulation of CD11b,

Cluster forms in composition with same forms
137
Unsupervised Semantic Parsing

USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions
IL-4 induces CD11b
Protein IL-4 enhances the expression of CD11b
CD11b expression is enhanced by IL-4 protein
The cytokin interleukin-4 induces CD11b
expression
IL-4s up-regulation of CD11b,

Cluster forms in composition with same forms
138
Unsupervised Semantic Parsing

USP ? Recursively cluster arbitrary expressions
composed with / by similar expressions
IL-4 induces CD11b
Protein IL-4 enhances the expression of CD11b
CD11b expression is enhanced by IL-4 protein
The cytokin interleukin-4 induces CD11b
expression
IL-4s up-regulation of CD11b,

Cluster forms in composition with same forms
139
Unsupervised Semantic Parsing

Exponential prior on number of parameters
Event/object/property cluster mixtures
InClust(e,c) HasValue(e,v)

Object/Event Cluster INDUCE
Property Cluster INDUCER
induces
0.1
0.5
IL-4
0.2
nsubj
None
0.1
enhances
0.4

0.4
One
0.8
agent
IL-8
0.1

140
But State Space Too Large

Coreference -clusters ? -mentions
USP -clusters ? exp(-tokens)
Also, meaning units often small and many
singleton clusters
? Use combinatorial search

141
Inference Hill-Climb Probability
induces
?
nsubj
dobj
?
?
Initialize
protein
CD11B
?
?
nn
?
IL-4
?
Lambda reduction
protein
protein
?
Search Operator
nn
?
nn
?
IL-4
IL-4
?
142
Learning Hill-Climb Likelihood

protein
enhances
1
1
IL-4
1
induces
1
Initialize
MERGE
COMPOSE
enhances
1
induces
1
1
protein
1
IL-4
Search Operator
induces
0.2
IL-4 protein
1
enhances
0.8
143
Unsupervised Ontology Induction

Limitations of USP
No ISA hierarchy among clusters
Little smoothing
Limited capability to generalize
OntoUSP Poon Domingos, ACL-2010
Extends USP to also induce ISA hierarchy
Joint approach for ontology induction,
population, and knowledge extraction
To appear in ACL (see you in Uppsala -)

144
OntoUSP

Modify the cluster mixture formula
InClust(e,c) ISA(c,d) HasValue(e,v)
Hierarchical smoothing clustering
New operator in learning

MERGE with REGULATE?
ABSTRACTION
0.3
induces
0.1
enhances
induces
0.6
0.2
inhibits
suppresses
0.1
up-regulates
0.2
INDUCE

ISA
ISA
INHIBIT
INDUCE
inhibits
0.4
inhibits
0.4
induces
0.6
suppresses
INHIBIT
0.2
suppresses
0.2
up-regulates
0.2

145
End of The Beginning

Not merely a user guide of MLN and Alchemy
Statistical relational learning
Growth area for machine learning and NLP

146
Future Work Inference

Scale up inference
Cutting-planes methods (e.g., Riedel, 2008)
Unify lifted inference with sampling
Coarse-to-fine inference
Alternative technology
E.g., linear programming, lagrangian relaxation

147
Future Work Supervised Learning

Alternative optimization objectives
E.g., max-margin learning Huynh Mooney, 2009
Learning for efficient inference
E.g., learning arithmetic circuits Lowd
Domingos, 2008
Structure learning
Improve accuracy and scalability
E.g., Kok Domingos, 2009

148
Future Work Unsupervised Learning

Model Learning objective, formalism, etc.
Learning Local optima, intractability, etc.
Hyperparameter tuning
Leverage available resources
Semi-supervised learning
Multi-task learning
Transfer learning (e.g., domain adaptation)
Human in the loop
E.g., interative ML, active learning,
crowdsourcing

149
Future Work NLP Applications

Existing application areas
More joint inference opportunities
Additional domain knowledge
Combine multiple pipeline stages
A killer app Machine reading
Many, many more awaiting YOU to discover

150
Summary

We need to unify logical and statistical NLP
Markov logic provides a language for this
Syntax Weighted first-order formulas
Semantics Feature templates of Markov nets
Inference Satisfiability, MCMC, lifted BP, etc.
Learning Pseudo-likelihood, VP, PSCG, ILP, etc.
Growing set of NLP applications
Open-source software Alchemy
Book Domingos Lowd, Markov Logic,Morgan
Claypool, 2009.

alchemy.cs.washington.edu
151
References

Banko et al., 2007 Michele Banko, Michael J.
Cafarella, Stephen Soderland, Matt Broadhead,
Oren Etzioni, "Open Information Extraction From
the Web", In Proc. IJCAI-2007.
Chakrabarti et al., 1998 Soumen Chakrabarti,
Byron Dom, Piotr Indyk, "Hypertext Classification
Using Hyperlinks", in Proc. SIGMOD-1998.
Damien et al., 1999 Paul Damien, Jon Wakefield,
Stephen Walker, "Gibbs sampling for Bayesian
non-conjugate and hierarchical models by
auxiliary variables", Journal of the Royal
Statistical Society B, 612.
Domingos Lowd, 2009 Pedro Domingos and Daniel
Lowd, Markov Logic, Morgan Claypool.
Friedman et al., 1999 Nir Friedman, Lise
Getoor, Daphne Koller, Avi Pfeffer, "Learning
probabilistic relational models", in Proc.
IJCAI-1999.

152
References

Halpern, 1990 Joe Halpern, "An analysis of
first-order logics of probability", Artificial
Intelligence 46.
Huynh Mooney, 2009 Tuyen Huynh and Raymond
Mooney, "Max-Margin Weight Learning for Markov
Logic Networks", In Proc. ECML-2009.
Kautz et al., 1997 Henry Kautz, Bart Selman,
Yuejun Jiang, "A general stochastic approach to
solving problems with hard and soft constraints",
In The Satisfiability Problem Theory and
Applications. AMS.
Kok Domingos, 2007 Stanley Kok and Pedro
Domingos, "Statistical Predicate Invention", In
Proc. ICML-2007.
Kok Domingos, 2008 Stanley Kok and Pedro
Domingos, "Extracting Semantic Networks from Text
via Relational Clustering", In Proc. ECML-2008.

153
References

Kok Domingos, 2009 Stanley Kok and Pedro
Domingos, "Learning Markov Logic Network
Structure via Hypergraph Lifting", In Proc.
ICML-2009.
Ling Weld, 2010 Xiao Ling and Daniel S.
Weld, "Temporal Information Extraction", In Proc.
AAAI-2010.
Lowd Domingos, 2007 Daniel Lowd and Pedro
Domingos, "Efficient Weight Learning for Markov
Logic Networks", In Proc. PKDD-2007.
Lowd Domingos, 2008 Daniel Lowd and Pedro
Domingos, "Learning Arithmetic Circuits", In
Proc. UAI-2008.
Meza-Ruiz Riedel, 2009 Ivan Meza-Ruiz and
Sebastian Riedel, "Jointly Identifying
Predicates, Arguments and Senses using Markov
Logic", In Proc. NAACL-2009.

154
References

Muggleton, 1996 Stephen Muggleton, "Stochastic
logic programs", in Proc. ILP-1996.
Nilsson, 1986 Nil Nilsson, "Probabilistic
logic", Artificial Intelligence 28.
Page et al., 1998 Lawrence Page, Sergey Brin,
Rajeev Motwani, Terry Winograd, "The PageRank
Citation Ranking Bringing Order to the Web",
Tech. Rept., Stanford University, 1998.
Poon Domingos, 2006 Hoifung Poon and Pedro
Domingos, "Sound and Efficient Inference with
Probabilistic and Deterministic Dependencies", In
Proc. AAAI-06.
Poon Domingos, 2007 Hoifung Poon and Pedro
Domingo, "Joint Inference in Information
Extraction", In Proc. AAAI-07.

155
References

Poon Domingos, 2008a Hoifung Poon, Pedro
Domingos, Marc Sumner, "A General Method for
Reducing the Complexity of Relational Inference
and its Application to MCMC", In Proc. AAAI-08.
Poon Domingos, 2008b Hoifung Poon and Pedro
Domingos, "Joint Unsupervised Coreference
Resolution with Markov Logic", In Proc. EMNLP-08.
Poon Domingos, 2009 Hoifung and Pedro
Domingos, "Unsupervised Semantic Parsing", In
Proc. EMNLP-09.
Poon Cherry Toutanova, 2009 Hoifung Poon,
Colin Cherry, Kristina Toutanova, "Unsupervised
Morphological Segmentation with Log-Linear
Models", In Proc. NAACL-2009.

156
References

Poon Vanderwende, 2010 Hoifung Poon and Lucy
Vanderwende, "Joint Inference for Knowledge
Extraction from Biomedical Literature", In Proc.
NAACL-10.
Poon Domingos, 2010 Hoifung and Pedro
Domingos, "Unsupervised Ontology Induction From
Text", In Proc. ACL-10.
Riedel 2008 Sebatian Riedel, "Improving the
Accuracy and Efficiency of MAP Inference for
Markov Logic", In Proc. UAI-2008.
Riedel et al., 2009 Sebastian Riedel, Hong-Woo
Chun, Toshihisa Takagi and Jun'ichi Tsujii, "A
Markov Logic Approach to Bio-Molecular Event
Extraction", In Proc. BioNLP 2009 Shared Task.
Selman et al., 1996 Bart Selman, Henry Kautz,
Bram Cohen, "Local search strategies for
satisfiability testing", In Cliques, Coloring,
and Satisfiability Second DIMACS Implementation
Challenge. AMS.

157
References

Singla Domingos, 2006a Parag Singla and Pedro
Domingos, "Memory-Efficient Inference in
Relational Domains", In Proc. AAAI-2006.
Singla Domingos, 2006b Parag Singla and Pedro
Domingos, "Entity Resolution with Markov Logic",
In Proc. ICDM-2006.
Singla Domingos, 2007 Parag Singla and Pedro
Domingos, "Markov Logic in Infinite Domains", In
Proc. UAI-2007.
Singla Domingos, 2008 Parag Singla and Pedro
Domingos, "Lifted First-Order Belief
Propagation", In Proc. AAAI-2008.
Taskar et al., 2002 Ben Taskar, Pieter Abbeel,
Daphne Koller, "Discriminative probabilistic
models for relational data", in Proc. UAI-2002.

158
References

Toutanova Haghighi Manning, 2008 Kristina
Toutanova, Aria Haghighi, Chris Manning, "A
global joint model for semantic role labeling",
Computational Linguistics.
Wang Domingos, 2008 Jue Wang and Pedro
Domingos, "Hybrid Markov Logic Networks", In
Proc. AAAI-2008.
Wellman et al., 1992 Michael Wellman, John S.
Breese, Robert P. Goldman, "From knowledge bases
to decision models", Knowledge Engineering Review
7.
Yoshikawa et al., 2009 Katsumasa Yoshikawa,
Sebastian Riedel, Masayuki Asahara and Yuji
Matsumoto, "Jointly Identifying Temporal
Relations with Markov Logic", In Proc. ACL-2009.