Title: Folie 1
1SOFIE A Self-Organizing Framework for
Information Extraction
Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum
(Max-Planck-Institute for Informatics,
Saarbrücken, Germany)?
2Ontologies
Entity
subclassOf
subclassOf
Singer
Country
type
type
DBpedia, YAGO, KYLIN, ...
Wikipedia
bornInPlace
USA
?
birth-place USA
"Elvis died in England"
Internet
3Information Extraction
Goal Extract ontological information from
natural language documents
diedInPlace
England
"Elvis died in England"
recoverWithout(most_people, medication) areUnder(0
, the_age_of_18) support(these_findings,
the_notion)
Previous approaches Espresso, DIPRE, LEILA,
Snowball, TextRunner, Alice, and many more
died in, perished in, was killed in
? May deliver non-canonic relations
England, UK, Great Britain
? May deliver non-canonic entities
diedInPlace(Elvis, England) diedInPlace(Elvis,
Germany)
? May deliver inconsistent facts
SOFIE aims to solve these problems in a new
unified framework
4Pitfalls of Information Extraction
Ontology
Web page
Elvis died in England.
diedInPlace
France
Louis XIV died in France.
If a pattern occurs with two entities that stand
in a relation, then the pattern maps to the
relation.
"died in" diedInPlace
5Pitfalls of Information Extraction
Ontology
Web page
Elvis died in England.
Louis XIV died in France.
If a pattern occurs with two entities that stand
in a relation, then the pattern maps to the
relation.
"died in" diedInPlace
If a meaningful pattern occurs with two entities,
then the entities stand in the relation.
diedInPlace
"Elvis"
"England"
6Pitfalls of Information Extraction
Ontology
Web page
?
Taxidophobist
Elvis died in England.
Louis XIV died in France.
If a pattern occurs with two entities that stand
in a relation, then the pattern maps to the
relation.
"died in" diedInPlace
If a meaningful pattern occurs with two entities,
then the entities stand in the relation.
diedInPlace
"Elvis"
"England"
7Pitfalls of Information Extraction
Web page
Reasoning Problem
Elvis died in England.
Taxidophobist
Louis XIV died in France.
If a pattern occurs with two entities that stand
in a relation, then the pattern maps to the
relation.
"died in" diedInPlace
If a meaningful pattern occurs with two entities,
then the entities stand in the relation.
diedInPlace
"Elvis"
"England"
8Pitfalls of Information Extraction
Web page
Reasoning Problem
Elvis died in England.
Taxidophobist
Louis XIV died in France.
If a pattern occurs with two entities that stand
in a relation, then the pattern maps to the
relation.
Disambiguation Problem
"died in" diedInPlace
If a meaningful pattern occurs with two entities,
then the entities stand in the relation.
9Pitfalls of Information Extraction
Reasoning Problem
Pattern Matching Problem
Taxidophobist
Elvis died in England.
Louis XIV died in France.
"died in" diedInPlace ?
Disambiguation Problem
10Information Extraction as Formulas
Reasoning Problem
type(Elvis,Taxidophobist).
Taxidophobist
type(X,Taxidophobist) bornInPlace(X,Y) ?
diedInPlace(X,Z) 0.8
11Information Extraction as Formulas
Reasoning Problem
Pattern Matching Problem
type(Elvis,Taxidophobist).
Elvis died in England.
type(X,Taxidophobist) bornInPlace(X,Y) ?
diedInPlace(X,Z)
Louis XIV died in France.
"died in" diedInPlace ?
Disambiguation Problem
12Information Extraction as Formulas
Assumptions ? In one document, the same word
has always the same meaning ? The ontology
already knows all important meanings of proper
names
possibleMeaning(Elvis_at_D15, ElvisPresley). 0.7
Disambiguation Problem
13Information Extraction as Formulas
Assumptions ? In one document, the same word
has always the same meaning ? The ontology
already knows all important meanings of proper
names
possibleMeaning(Elvis_at_D15, ElvisPresley). 0.7
Prior estimation for the likelihood of this
meaning.
A word in context (wic). Here The word "Elvis"
in document D15
words(D15) n rel(ElvisPresley)
One possible meaning of "Elvis" as given by the
ontology
words(D15)
14Information Extraction as Formulas
Assumptions ? In one document, the same word
has always the same meaning ? The ontology
already knows all important meanings of proper
names
possibleMeaning(Elvis_at_D15, ElvisPresley). 0.7
possibleMeaning(X,Y) means(X,Y) means(X,Y)
Y?Z ? means(X,Z)
15Information Extraction as Formulas
Reasoning Problem
Pattern Matching Problem
type(Elvis,Taxidophobist).
Elvis died in England.
type(X,Taxidophobist) bornInPlace(X,Y) ?
diedInPlace(X,Z)
Louis XIV died in France.
"died in" diedInPlace ?
Disambiguation Problem
meaning(Elvis_at_D15,
ElvisPresley). 0.7
16Information Extraction as Formulas
Pattern Matching Problem
occurs("died in", Elvis_at_D15,
England_at_D15). 14
Elvis died in England.
Louis XIV died in France.
"died in" diedInPlace ?
occurs(P,Wic1,Wic2) means(Wic1,X)
means(Wic2,Y) R(X,Y) mapsTo(P,R)
occurs(P,Wic1,Wic2) means(Wic1,X)
means(Wic2,Y) mapsTo(P,R) R(X,Y)
17Information Extraction as Formulas
Reasoning Problem
Pattern Matching Problem
type(Elvis,Taxidophobist).
occurs("died in", Elvis_at_D15,
England_at_D15). 14
type(X,Taxidophobist) bornInPlace(X,Y) ?
diedInPlace(X,Z)
Find truth assignments to hypotheses so that the
weight of satisfied formulas is
maximized means(Elvis_at_D15, ElvisPresley)
? mapsTo("died In", diedInPlace)
? diedIn(ElvisPresley, England) ?
Disambiguation Problem
meaning(Elvis_at_D15,
ElvisPresley). 0.7
18Weighted MAX SAT Problem
Weighted MAX SAT Problem
Find truth assignments to hypotheses so that the
weight of satisfied formulas is maximized
Structurally much simpler than MLNs. No need to
model probabilities if we're just interested in
the maximum.
Problems ? The Weighted MAX SAT Problem is
NP-hard ? Our instance of the problem is huge ?
The most popular greedy approximation algorithm
(Johnson's) does not work well with our type
of formulas
bornInPlace(X,Y) ? bornInPlace(X,Z) ? A v ?
B ? A v ? C ? B v ? C
Johnson's has upper bound 2/3 on approximation
19FMS Algorithm
The Functional MAX SAT Algorithm considers only
unit clauses.
Formulas
Hypotheses
?A v ?B w1 ?A v ?B w2 ?B v ?C
w3 C w4
false
A B C
false
true
The Functional MAX SAT Algorithm propagates
Dominating Unit Clauses
?A v B 10 ?A 10 A
30
A true
30 1010
20FMS Algorithm
Polynomial time
FMS Algorithm FOR i1 TO 42 ... NEXT i
Approximation Guarantee
Experiments show better performance in practice
than Johnson's algorithm in our setting .
21FMS Algorithm
Elvis died in England
r(X,Y) s(Y) t(X,Y)
FMS Algorithm FOR i1 TO 42 ... NEXT i
22FMS Algorithm
Elvis died in England
r(X,Y) s(Y) t(X,Y)
type(Elvis,Taxidophobist)1
diedIn(Elvis,England)0
FMS Algorithm FOR i1 TO 42 ... NEXT i
means(Elvis_at_D15,Elvis)0
means(Elvis_at_D15,...)1
diedIn
England
St. Elvis
23SOFIE
r(X,Y) s(Y) t(X,Y)
diedIn
England
St. Elvis
24Other Experiments
(All experiments with the YAGO ontology)
25Conclusion
SOFIE unifies the tasks of ? entity
disambiguation ? pattern extraction ? semantic
constraint reasoning in a single framework,
delivering ? canonicalized facts ? of high
precision
s(Y) t(X)
died in England...
but is alive!
http//mpii.de/yago-naga
26SOFIE rules!
occurs(P,WX,WY) /\ refersTo(WX.X) /\
refersTo(WY,Y) /\ R(X,Y) expresses(P,R)
occurs(P,WX,WY) /\ expressed(P,R) /\
refersTo(WX.X) /\ refersTo(WY,Y) /\
range(R,D1) /\ domain(R,D2) /\ type(X,D1) /\
type(Y,D2) R(X,Y)
R(X,Y) /\ R(X,Z) /\ type(R,function) Y
Z
disambiguationPrior(W,X) refersTo(W,X)
? R(X,Y)
bornInYear(X,B) /\ diedInYear(X,D) B
27SOFIE Experiments
28SOFIE Large-Scale Experiment
Goal Extract bornIn, bornOnDate, diedIn,
diedOnDate, politicianOf
Corpus 3700 biography documents downloaded from
the Web
Results (precision in )
Runtime (summed over 5 batches)
Parsing 705h Hypothesis Generation 615h Sol
ving 230h Total 1550h
87 87 13 98 95
? 90
bornIn bornOnD diedIn diedOnD polOf
29SOFIE Relation to Markov Logic
Number of satisfied instances of the ith formula
Weight of the ith formula
r(x,y) /\ s(x,z) t(x,z) w ...
P(X) ? e sat(i,X) wi
max X ? e sat(i,X) wi
P
max X log( ? e sat(i,X) wi )
max X ? sat(i,X) wi
false true
bornIn(Nicholas, Patras)
Weighted MAX SAT problem
30Grounding
r(X,Y) s(Y) t(X,Y)
Immutable, complete facts (e.g. pattern
occurrences)
? r(X,Y), ? s(Y), t(X,Y)
r(a,a)
? r(a,b) ? r(b,a) ? r(b,b)
Entitiesa,b
? r(a,a), ? s(a), t(a,a) ? r(a,b), ? s(b),
t(a,b) ? r(b,a), ? s(a), t(b,a) ?
r(b,b), ? s(b), t(b,b)
31Grounding
r(X,Y) s(Y) t(X,Y)
Immutable, complete facts (e.g. pattern
occurrences)
? r(X,Y), ? s(Y), t(X,Y)
r(a,a) w
? r(a,b) ? r(b,a) ? r(b,b)
? s(a), t(a,a) w
32Grounding
? s(a), t(a,a) w1 p(c,d), ? q(e),
w2
Find truth assignments to hypotheses so that the
weight of satisfied formulas is maximized
means(Elvis_at_D15, ElvisPresley) true ?
mapsTo("died In", diedInPlace) true ?
diedIn(ElvisPresley, England) true ?