Natural Language Semantics Combining Logical and Distributional Methods using Probabilistic Logic - PowerPoint PPT Presentation

About This Presentation
Title:

Natural Language Semantics Combining Logical and Distributional Methods using Probabilistic Logic

Description:

Has a rich ontology of types ... Represent word meanings ... (e.g. how often it co-occurs with another specific word). Semantic similarity defined as distance ... – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 50
Provided by: utexasEdu
Category:

less

Transcript and Presenter's Notes

Title: Natural Language Semantics Combining Logical and Distributional Methods using Probabilistic Logic


1
Natural Language Semantics Combining Logical and
Distributional Methods using Probabilistic Logic
  • Raymond J. Mooney
  • Katrin Erk
  • Islam Beltagy
  • University of Texas at Austin

1
1
2
Logical AI Paradigm
  • Represents knowledge and data in a binary
    symbolic logic such as FOPC.
  • Rich representation that handles arbitrary
    sets of objects, with properties, relations,
    quantifiers, etc.
  • ? Unable to handle uncertain knowledge and
    probabilistic reasoning.

3
Logical Semantics for Language
  • Richard Montague (1970) developed a formal method
    for mapping natural-language to FOPC using
    Churchs lambda calculus of functions and the
    fundamental principle of semantic
    compositionality for

recursively computing the meaning of each
syntactic constituent from the meanings of its
sub-constituents.
  • Later called Montague Grammar or Montague
    Semantics

4
Interesting Book on Montague
  • See Aifric Campbells (2009) novel The Semantics
    of Murder for a fictionalized account of his
    mysterious death in 1971 (homicide or homoerotic
    asphyxiation??).

5
Semantic Parsing
  • Mapping a natural-language sentence to a detailed
    representation of its complete meaning in a fully
    formal language that
  • Has a rich ontology of types, properties, and
    relations.
  • Supports automated reasoning or execution.

6
Geoquery A Database Query Application
  • Query application for a U.S. geography database
    containing about 800 facts Zelle Mooney, 1996

What is the smallest state by area?
Rhode Island
Answer
Semantic Parsing
Query
answer(x1,smallest(x2,(state(x1),area(x1,x2))))
7
Distributional (Vector-Space)Lexical Semantics
  • Represent word meanings as points (vectors) in a
    (high-dimensional) Euclidian space.
  • Dimensions encode aspects of the context in which
    the word appears (e.g. how often it co-occurs
    with another specific word).
  • Semantic similarity defined as distance between
    points in this semantic space.
  • Many specific mathematical models for computing
    dimensions and similarity
  • 1st model (1990) Latent Semantic Analysis (LSA)

8
Sample Lexical Vector Space(reduced to 2
dimensions)
bottle
cup
water
dog
cat
computer
robot
woman
rock
man
9
Issues with Distributional Semantics
  • How to compose meanings of larger phrases and
    sentences from lexical representations? (many
    recent proposals)
  • None of the proposals for compositionality
    capture the full representational or inferential
    power of FOPC (Grefenstette, 2013).

You cant cram the meaning of a whole !
sentence into a single ! vector!
10
Using Distributional Semantics with Standard
Logical Form
  • Recent work on unsupervised semantic parsing
    (Poon Domingos, 2009) and work by Lewis and
    Steedman (2013) automatically create an ontology
    of predicates by clustering based using
    distributional information.
  • But they do not allow gradedness and uncertainty
    in the final semantic representation and
    inference.

11
Probabilistic AI Paradigm
  • Represents knowledge and data as a fixed set of
    random variables with a joint probability
    distribution.
  • Handles uncertain knowledge and probabilistic
    reasoning.
  • ? Unable to handle arbitrary sets of objects,
    with properties, relations, quantifiers, etc.

12
Statistical Relational Learning (SRL)
  • SRL methods attempt to integrate methods from
    predicate logic (or relational databases) and
    probabilistic graphical models to handle
    structured, multi-relational data.

13
SRL Approaches(A Taste of the Alphabet Soup)
  • Stochastic Logic Programs (SLPs)
  • (Muggleton, 1996)
  • Probabilistic Relational Models (PRMs)
  • (Koller, 1999)
  • Bayesian Logic Programs (BLPs)
  • (Kersting De Raedt, 2001)
  • Markov Logic Networks (MLNs)
    (Richardson Domingos, 2006)
  • Probabilistic Soft Logic (PSL)
    (Kimmig et al., 2012)

14
Formal Semantics for Natural Language using
Probabilistic Logical Form
  • Represent the meaning of natural language in a
    formal probabilistic logic (Beltagy et al., 2013,
    2014).
  • Markov Logic Networks (MLNs)
  • Probabilistic Similarity Logic (PSL)
  • Montague meets Markov

15
Markov Logic Networks Richardson Domingos,
2006
  • Set of weighted clauses in first-order predicate
    logic.
  • Larger weight indicates stronger belief that the
    clause should hold.
  • MLNs are templates for constructing Markov
    networks for a given set of constants

MLN Example Friends Smokers
15
16
Example Friends Smokers
Two constants Anna (A) and Bob (B)
16
17
Example Friends Smokers
Two constants Anna (A) and Bob (B)
Friends(A,B)
Smokes(A)
Friends(A,A)
Smokes(B)
Friends(B,B)
Cancer(A)
Cancer(B)
Friends(B,A)
17
18
Example Friends Smokers
Two constants Anna (A) and Bob (B)
Friends(A,B)
Smokes(A)
Friends(A,A)
Smokes(B)
Friends(B,B)
Cancer(A)
Cancer(B)
Friends(B,A)
18
19
Example Friends Smokers
Two constants Anna (A) and Bob (B)
Friends(A,B)
Smokes(A)
Friends(A,A)
Smokes(B)
Friends(B,B)
Cancer(A)
Cancer(B)
Friends(B,A)
19
20
Probability of a possible world
a possible world
A possible world becomes exponentially less
likely as the total weight of all the grounded
clauses it violates increases.
20
21
MLN Inference
  • Infer probability of a particular query given a
    set of evidence facts.
  • P(Cancer(Anna) Friends(Anna,Bob), Smokes(Bob))
  • Use standard algorithms for inference in
    graphical models such as Gibbs Sampling or belief
    propagation.

22
MLN Learning
  • Learning weights for an existing set of clauses
  • EM
  • Max-margin
  • On-line
  • Learning logical clauses (a.k.a. structure
    learning)
  • Inductive Logic Programming methods
  • Top-down and bottom-up MLN clause learning
  • On-line MLN clause learning

23
Strengths of MLNs
  • Fully subsumes first-order predicate logic
  • Just give ? weight to all clauses
  • Fully subsumes probabilistic graphical models.
  • Can represent any joint distribution over an
    arbitrary set of discrete random variables.
  • Can utilize prior knowledge in both symbolic and
    probabilistic forms.
  • Large existing base of open-source software
    (Alchemy)

24
Weaknesses of MLNs
  • Inherits computational intractability of general
    methods for both logical and probabilistic
    inference and learning.
  • Inference in FOPC is semi-decidable
  • Inference in general graphical models is P-space
    complete
  • Just producing the ground Markov Net can
    produce a combinatorial explosion.
  • Current lifted inference methods do not help
    reasoning with many kinds of nested quantifiers.

25
PSL Probabilistic Soft LogicKimmig Bach
Broecheler Huang Getoor, NIPS 2012
  • Probabilistic logic framework designed with
    efficient inference in mind.
  • Input set of weighted First Order Logic rules
    and a set of evidence, just as in BLP or MLN
  • MPE inference is a linear-programming problem
    that can efficiently draw probabilistic
    conclusions.

25
26
PSL vs. MLN
  • MLN
  • Atoms have boolean truth values 0, 1.
  • Inference finds probability of atoms given the
    rules and evidence.
  • Calculates conditional probability of a query
    atom given evidence.
  • Combinatorial counting problem.
  • PSL
  • Atoms have continuous truth values in the
    interval 0,1.
  • Inference finds truth value of all atoms that
    best satisfy the rules and evidence.
  • MPE inference Most Probable Explanation.
  • Linear optimization problem.

26
27
PSL Example
  • First Order Logic weighted rules
  • Evidence
  • I(friend(John,Alex)) 1 I(spouse(John,Mary))
    1
  • I(votesFor(Alex,Romney)) 1 I(votesFor(Mary,Obama
    )) 1
  • Inference
  • I(votesFor(John, Obama)) 1
  • I(votesFor(John, Romney)) 0

27
28
PSLs Interpretation of Logical Connectives
  • Lukasiewicz relaxation of AND, OR, NOT
  • I(l1 ? l2) max 0, I(l1) I(l2) 1
  • I(l1 ? l2) min 1, I(l1) I(l2)
  • I( l1) 1 I(l1)
  • Distance to satisfaction
  • Implication l1 ? l2 is Satisfied iff I(l1)
    I(l2)
  • d max 0, I(l1) - I(l2)
  • Example
  • I(l1) 0.3, I(l2) 0.9 ? d 0
  • I(l1) 0.9, I(l2) 0.3 ? d 0.6

28
29
PSL Probability Distribution
  • PDF

Distance to satisfaction of rule r
Weight of formula r
Normalization constant
a possible continuous truth assignment
For all rules
29
30
PSL Inference
  • MPE Inference (Most probable explanation)
  • Find interpretation that maximizes PDF
  • Find interpretation that minimizes summation
  • Distance to satisfaction is a linear function
  • Linear optimization problem

30
31
Semantic Representations
  • Formal Semantics
  • Uses first-order logic
  • Deep
  • Brittle
  • Distributional Semantics
  • Statistical method
  • Robust
  • Shallow
  • Combining both logical and distributional
    semantics
  • Represent meaning using a probabilistic logic
  • Markov Logic Network (MLN)
  • Probabilistic Soft Logic (PSL)
  • Generate soft inference rules
  • From distributional semantics

31
32
System ArchitectureGarrette et al. 2011, 2012
Beltagy et al., 2013
Sent1
LF1
BOXER
Dist. Rule Constructor
Rule Base
Sent2
LF2
Vector Space
MLN/PSL Inference
  • BOXER Bos, et al. 2004 maps sentences to
    logical form
  • Distributional Rule constructor generates
    relevant soft inference rules based on
    distributional similarity
  • MLN/PSL probabilistic inference
  • Result degree of entailment or semantic
    similarity score (depending on the task)

result
32
33
Recognizing Textual Entailment (RTE)
  • Premise A man is cutting a pickle
  • ?x,y,z man(x) ? cut(y) ? agent(y, x) ? pickle(z)
    ? patient(y, z)
  • Hypothesis A guy is slicing a cucumber
  • ?x,y,z guy(x) ? slice(y) ? agent(y, x) ?
    cucumber(z) ? patient(y, z)
  • Inference Pr(Hypothesis Premise)
  • Degree of entailment

33
34
Distributional Lexical Rules
  • For all pairs of words (a, b) where a is in S1
    and b is in S2 add a soft rule relating the two
  • ?x a(x) ? b(x) wt(a, b)
  • wt(a, b) f( cos(a, b) )
  • Premise A man is cutting pickles
  • Hypothesis A guy is slicing cucumber
  • ?x man(x) ? guy(x) wt(man, guy)
  • ?x cut(x) ? slice(x) wt(cut, slice)
  • ?x pickle(x) ? cucumber(x) wt(pickle, cucumber)
  • ?x man(x) ? cucumber(x) wt(man, cucumber)
  • ?x pickle(x) ? guy(x) wt(pickle, guy)

? ?
34
35
Distributional Phrase Rules
  • Premise A boy is playing
  • Hypothesis A little kid is playing
  • Need rules for phrases
  • ?x boy(x) ? little(x) ? kid(x) wt(boy, "little
    kid")
  • Compute vectors for phrases using vector addition
    Mitchell Lapata, 2010
  • "little kid" little kid

35
36
Paraphrase Rules
  • Generate inference rules from pre-compiled
    paraphrase collections like Berant et al. 2012
  • e.g,
  • X solves Y gt X finds a solution to Y w

36
37
Evaluation (RTE using MLNs)
  • Dataset
  • RTE-1, RTE-2, RTE-3
  • Each dataset is 800 training pairs and 800
    testing pairs
  • Use multiple parses to reduce impact of misparses

37
38
Evaluation (RTE using MLNs)
Logic-only baseline KB is wordnet
  • RTE-1 RTE-2 RTE-3
  • Bos Markert2005 0.52
  • MLN 0.57 0.58 0.55
  • MLN-multi-parse 0.56 0.58 0.57
  • MLN-paraphrases 0.60 0.60 0.60

38
39
Enhancing MLN inference for the RTE task
  • Inference algorithm to compute probabilities of
    complete formulas, not just individual ground
    atoms(QF)
  • Pr(QR) ratio between Z of the MLN with and
    without Q added as a hard clause
  • Use SampleSearch to estimate Z
  • A modified closed-world assumption (MCW) that
    removes unnecessary ground atoms from the ground
    network
  • All ground atoms are False by default unless they
    are reachable from the evidence

39
40
Evaluation (RTE using enhanced MLN inference)
  • Using the SICK dataset from SemEval 2014
  • System Accuracy CPU Time Timeouts(30 min)
  • mln 57 2min 27sec 96
  • mlnqf 69 1min 51sec 30
  • mlnmcw 66 10sec 2.5
  • mlnqfmcw 72 7sec 2.1

40
41
Semantic Textual Similarity (STS)
  • Rate the semantic similarity of two sentences on
    a 0 to 5 scale
  • Gold standards are averaged over multiple human
    judgments
  • Evaluate by measuring correlation to human rating
  • S1 S2 score
  • A man is slicing a cucumber A guy is cutting a
    cucumber 5
  • A man is slicing a cucumber A guy is cutting a
    zucchini 4
  • A man is slicing a cucumber A woman is cooking
    a zucchini 3
  • A man is slicing a cucumber A monkey is riding
    a bicycle 1

41
42
Softening Conjunction for STS
  • Premise A man is driving
  • ?x,y. man(x) ? drive(y) ? agent(y, x)
  • Hypothesis A man is driving a bus
  • ?x,y,z. man(x) ? drive(y) ? agent(y, x) ? bus(z)
    ? patient(y, z)
  • Break the sentence into mini-clauses then
    combine their evidences using an averaging
    combiner Natarajan et al., 2010
  • Becomes
  • ?x,y,z. man(x) ? agent(y, x)? result()
  • ?x,y,z. drive(y) ? agent(y, x)? result()
  • ?x,y,z. drive(y) ? patient(y, z) ? result()
  • ?x,y,z. bus(z) ? patient(y, z) ? result()

42
43
Evaluation (STS using MLN)
  • Microsoft video description corpus (SemEval 2012)
  • Short video descriptions
  • System Pearson r
  • Our System with no distributional rules Logic
    only 0.52
  • Our System with lexical rules 0.60
  • Our System with lexical and phrase rules 0.63

43
44
PSL Probabilistic Soft LogicKimmig Bach
Broecheler Huang Getoor, NIPS 2012
  • MLN's inference is very slow
  • PSL is a probabilistic logic framework designed
    with efficient inference in mind
  • Inference is a linear program

44
45
STS using PSL - Conjunction
  • Lukasiewicz relaxation of AND is very restrictive
  • I(l1 ? l2) max 0, I(l1) I(l2) 1
  • Replace AND with weighted average
  • I(l1 ? ? ln) w_avg( I(l1), , I(ln))
  • Learning weights (future work)
  • For now, they are equal
  • Inference
  • weighted average is a linear function
  • no changes in the optimization problem

45
46
Evaluation (STS using PSL)
msr-vid Microsoft video description corpus
(SemEval 2012) Short video description
sentences msr-par Microsoft paraphrase corpus
(SemEval 2012) Long news sentences SICK
(SemEval 2014)
  • msr-vid msr-par SICK
  • vec-add (dist. only) 0.78 0.24 0.65
  • vec-mul (dist. only) 0.76 0.12 0.62
  • MLN (logic dist.) 0.63 0.16 0.47
  • PSL-no-DIR (logic only) 0.74 0.46 0.72
  • PSL (logic dist.) 0.79 0.53 0.74

46
47
Evaluation (STS using PSL)
msr-vid msr-par SICK PSL time/pair
8s 30s 10s MLN time/pair 1m 31s 11m
49s 4m 24s MLN timeouts(10 min) 9 97 36
47
48
Future Work
  • Improve inference efficiency for MLNs by
    exploiting latest in lifted inference
  • Improve parsing into logical form using the
    latest improvements in Boxer and semantic
    parsing.
  • Improve extraction and distributional
    representation of phrases.
  • Use asymmetric distributional similarity
  • Add additional knowledge sources to MLN
  • WordNet
  • PPDB

49
Conclusions
  • Traditional logical and distributional approaches
    to natural language semantics have limitations
    and weaknesses.
  • These competing approaches can be combined using
    a probabilistic logic (e.g. MLN, PSL) as a
    uniform semantic representation.
  • We have initial promising results using MLNs for
    RTE and PSL for STS.
Write a Comment
User Comments (0)
About PowerShow.com