Title: Folie 1
1YAGO Yet Another Great Ontology
PhD Defense Fabian M. Suchanek (Max-Planck
Institute for Informatics, Saarbrücken)?
2Overview
- Motivation Why would anybody need
Ontologies? - Building a Core Ontology YAGO
- Extending the Core Ontology SOFIE
3Santa Claus in Need
World population
4The Search for a Second Santa Claus
strong, tall guy , australian
Seeking strong, tall Australian man I'm 27, blue
eyes, looking for a tall strong Australian man.
girls-seek-guys.com/london/42 Cached
Similar pages
5The Search for a Second Santa Claus
strong person, gt 1.90, Australian
Seeking strong, tall Australian man I'm 27, blue
eyes, looking for a tall strong Australian man.
... I'm 190 kg girls-seek-guys.com/london/42
Cached Similar pages
6The Search for a Second Santa Claus
Hi Larry, it's me, Santa Claus. I think you
misunderstood wh
Seeking strong, tall Australian man I'm 27, blue
eyes, looking for a tall strong Australian man.
girls-seek-guys.com/london/42 Cached
Similar pages
7Solution An Ontology
physical entity
is a
person
is a
is a
continent
is a
isFrom
height
Australia
1.90m
8Solution An Ontology
physical entity
is a
person
is a
Classes
is a
Relations
continent
is a
isFrom
Individuals
Australia
9Vision
Gathering the knowledge of this world in a
structured ontology.
? Semantic Search ? Question answering ? Machine
Translation ? Document classification ?
The world, Id like to say, even though some may
contradict, is not as it seems. It rather seems
as if the world seems not what it seems
10Plan of Attack
- Motivation ?
- Building a Core Ontology YAGO
- Extending the Core Ontology SOFIE
The world, Id like to say, even though some may
contradict, is not as it seems. It rather seems
as if the world seems not what it seems
11YAGO Goal
Goal Build a Large Ontology
Previous Approaches ? Assemble the ontology
manually (WordNet, SUMO, Cyc, GeneOntology)?
Problem Usually low coverage (MPI is in none
of these)?
? Use community work (Semantic Wikipedia,
Freebase)? Problem We don't know yet
whether it takes off
12YAGO Goal
Goal Build a Large Ontology
Our Approach ? Extract knowledge from
Wikipedia and WordNet (securing high coverage)
? Use extensive quality control techniques
(securing high consistency)
13YAGO Infoboxes
Claus K
bornIn
Sydney
blah blah blub (don't read this! Better listen to
the talk!) laber fasel suelz. Insbesondere, blub,
texte zu, und so weiter blah blah blub Elvis
laber fasel suelz. Blub, aber blah! Insbesondere,
blub, texte zu, und so weiter blah blah blub
Elvis laber fasel suelz. Insbesondere, blub,
texte zu, und so weiter
Exploit infoboxes
Born in Sydney ...
14YAGO Categories
Claus K
bornIn
born
Sydney
blah blah blub (don't read this! Better listen to
the talk!) laber fasel suelz. Insbesondere, blub,
texte zu, und so weiter blah blah blub Elvis
laber fasel suelz. Blub, aber blah! Insbesondere,
blub, texte zu, und so weiter blah blah blub
Elvis laber fasel suelz. Insbesondere, blub,
texte zu, und so weiter
1980
Exploit infoboxes
Exploit relational categories
Categories
1980_births
15YAGO Categories
Australian Boxer
Claus K
isA
bornIn
born
Sydney
blah blah blub (don't read this! Better listen to
the talk!) laber fasel suelz. Insbesondere, blub,
texte zu, und so weiter blah blah blub Elvis
laber fasel suelz. Blub, aber blah! Insbesondere,
blub, texte zu, und so weiter blah blah blub
Elvis laber fasel suelz. Insbesondere, blub,
texte zu, und so weiter
1980
Exploit infoboxes
Exploit relational categories
Categories
Exploit conceptual categories
Australian Boxers
16YAGO Categories
Australian Boxer
Kick boxing
Claus K
isA
isA
bornIn
born
Sydney
blah blah blub (don't read this! Better listen to
the talk!) laber fasel suelz. Insbesondere, blub,
texte zu, und so weiter blah blah blub Elvis
laber fasel suelz. Blub, aber blah! Insbesondere,
blub, texte zu, und so weiter blah blah blub
Elvis laber fasel suelz. Insbesondere, blub,
texte zu, und so weiter
1980
Exploit infoboxes
Exploit relational categories
Categories
Exploit conceptual categories
Kick boxing
Avoid thematic categories
17YAGO Upper Model
entity
?
person
Australian boxer
is a
born
1980
18YAGO Upper Model
Business
Social_group
?
People_by_occupation
Australian boxer
is a
born
1980
19YAGO Upper Model
Person
subclass
WordNet
Boxer
subclass
Australian boxer
is a
Wikipedia
born
1980
Suchanek et al. WWW 2007
20YAGO Quality Control
1. Canonicalization 1. ... of entities
Santa Klaus
Santa Clause
Santa Claus
Santa
21YAGO Quality Control
1. Canonicalization 1. ... of entities
22YAGO Quality Control
1. Canonicalization 1. ... of entities
2. ... of facts
born
1980
born
1980-12-19
23YAGO Quality Control
1. Canonicalization 1. ... of entities
2. ... of facts 2. Type Checks 1.
Reductive Type Checking
range(bornOnDate, timepoint)? bornOnDate(Claus_Ken
t, Sydney)?
24YAGO Quality Control
Entity
1. Canonicalization 1. ... of entities
2. ... of facts 2. Type Checks 1.
Reductive Type Checking 2. Type Coherence
Checking
Person
Artifact
Boxer, Swimmer, Flight instructor, Airplane
25YAGO Quality Control
1. Canonicalization 1. ... of entities
2. ... of facts 2. Type Checks 1.
Reductive Type Checking 2. Type Coherence
Checking
Every fact and every entity occurs exactly once
Every fact fulfills its type constraints
Suchanek et al. JWS 2008
26YAGO Numbers
bornIn, actedIn, hasInflation,...
Relations 100 Entities 2 million Facts 19
million Accuracy 95
One of the largest public free ontologies
Unprecedented quality among automatedly
constructed ontologies
27YAGO Model
boxer
1 (ClausKent,is_a,boxer)? 2 (1, since,
1990)? 3 (1, source, Wikipedia)?
since
1990
is a
source
Wikipedia
28YAGO Model
- A YAGO ontology over
- a set of relations R
- a set of common entities C
- a set of fact identifiers I
- is a function
- I ? (R?C?I) ? R ? (R?I?C)?
1 (ClausKent,is_a,boxer)? 2 (1, since,
1990)? 3 (1, source, Wikipedia)?
- We can talk about
- facts (1, source, Wikipedia)?
- additional arguments (1, since, 1990)?
- relations (since, hasRange, time_interval)?
Still Decideable Consistency
29YAGO Summary
YAGO is an ontology that is ? large (combining
Wikipedia and WordNet) ? accurate (using
extensive quality control) ? computationally
tractable (with a decideable consistency)
30Plan of Attack
- Motivation ?
- Building a Core Ontology YAGO ?
- Extending the Core Ontology SOFIE
YAGO
The world, Id like to say, even though some may
contradict, is not as it seems. It rather seems
as if the world seems not what it seems
31SOFIE Goal Statement
bornIn
Patara
Saint Nicholas
Goal Extending the ontology
Saint Nicholas was born in Patara.
32SOFIE Goal Statement
bornIn
Patara
Saint Nicholas
Goal Extending the ontology
Saint Nicholas ce e po?u? ? Patara.
33SOFIE Goal Statement
bornIn
Patara
Saint Nicholas
Goal Extending the ontology
recoverWithout(most_people, medication)? areUnder(
0, the_age_of_18)? support(these_findings,
the_notion)?
Saint Nicholas was born in Patara.
Previous Approaches
? Extract knowledge from corpora (e.g. the
Web)? (Text2Onto, Espresso, Snowball,
TextRunner)? Problems Low accuracy,
non-canonicity
34SOFIE Goal Statement
bornIn
Patara
Saint Nicholas
Goal Extending the ontology
Saint Nicholas was born in Patara.
Our Approach (1)
? LEILA - Combining Linguistic and Statistical
Analysis Suchanek et al. KDD 2006 Has high
accuracy, but does not deliver canonicity
35SOFIE Goal Statement
bornIn
Patara
Saint Nicholas
Goal Extending the ontology
Saint Nicholas was born in Patara.
Our Approach (2)
? SOFIE Use logical reasoning to guarantee
canonicity
36SOFIE Example
YAGO
Worshipped People
bornInYear
1935
Saint Nicholas was born in the year 1417.
Elvis Presley was born in the year 1935.
"was born in the year" expresses bornInYear
Pattern occurrence gt pattern meaning
37SOFIE Example
YAGO
Worshipped People
bornInYear
1935
Saint Nicholas was born in the year 1417.
Elvis Presley was born in the year 1935.
"was born in the year" expresses bornInYear
Pattern occurrence gt pattern meaning
bornInYear
Pattern occurrence gt sentence meaning
1417
38SOFIE Example
YAGO
Worshipped People
bornInYear
1935
Saint Nicholas was born in the year 1417.
diedInYear
Elvis Presley was born in the year 1935.
347
"was born in the year" expresses bornInYear
Pattern occurrence gt pattern meaning
bornInYear
Pattern occurrence gt sentence meaning
1417
People should be born before they die.
39SOFIE Example
YAGO
Worshipped People
bornInYear
1935
Saint Nicholas was born in the year 1417.
diedInYear
Elvis Presley was born in the year 1935.
347
"was born in the year" expresses bornInYear
Pattern occurrence gt pattern meaning
bornInYear
Pattern occurrence gt sentence meaning
1417
People should be born before they die.
40SOFIE Example
YAGO
Task 1 Find Patterns
bornInYear
1935
Saint Nicholas was born in the year 1417.
diedInYear
Elvis Presley was born in the year 1935.
347
Task 2 Use semantic reasoning
Task 3 Disambiguate entities
Pattern occurrence gt pattern meaning
Pattern occurrence gt sentence meaning
bornInYear
1417
People should be born before they die.
41SOFIE Its all logical formulae!
YAGO
Task 1 Find Patterns
bornInYear(ElvisPresley,1935) diedInYear(Nichola
sOfMyra,347)
occurs("was born in the year", SaintNicholas,1417)
occurs("was born in the year", ElvisPresley,1935)
Task 2 Use semantic reasoning
Task 3 Disambiguate entities
occurs(P,X,Y) /\ expresses(P,R) gt R(X,Y)
means(SaintNicholas,NicholasOfMyra) 0.8
means(SaintNicholas,NicholasOfFüe)
0.2 refersTo(SaintNicholas,NicholasOfFüe)
? bornOnDate(NicholasOfFüe, 1417) ?
bornInYear(X,B) /\ diedInYear(X,D) gt BltD
42SOFIE Information Extraction as MAX SAT
We have a Weighted MAX SAT Problem
r(x,y) /\ s(x,z) gt t(x,z) w ...
Problem ? The Weighted MAX SAT Problem is
NP-hard ? Our instance contains YAGO (19
million facts) and textual facts (e.g.
10,000 facts) ? The best-known approximation
algorithm cannot deal well with our
specific instance
43SOFIE A Unifying Framework
r(a,b) gt s(x,y)?
Task 1 Find Patterns
Polynomial time
Algorithm Functional MAX SAT FOR i1 TO
42 ... NEXT i
Task 2 Use semantic reasoning
Approximation Guarantee
Task 3 Disambiguate entities
1417
NicholasOfFlüe
Suchanek et al TR 2009
44SOFIE Experiments
Corpus Type Docs Relations Time Precision
Wikipedia toy corpus structured 100 3 8min 100
Wikipedia subcorpus semi-structured 2000 15 15h 94
News article toy corpus unstructured 150 1 24min 91
Biographies from Web unstructured 3440 5 15h 90
45SOFIE Summary
SOFIE unifies 3 tasks in a single
framework SOFIE delivers ? canonicalized
facts ? of high precision
Task 1 Find Patterns
Task 2 Use semantic reasoning
Task 3 Disambiguate entities
46But back to the original question...
Is there any Australian guy taller than 1.90m who
could help me out?
47Conclusion Good News
? We made a great step towards gathering
the knowledge of this world in a structured
ontology
YAGO
SOFIE
The world, Id like to say, even though some may
contradict, is not as it seems. It rather seems
as if the world seems not what it seems
? Christmas is safe!
48References
Suchanek et al. KDD 2006 Fabian M. Suchanek,
Georgiana Ifrim and Gerhard Weikum
"Combining Linguistic and Statistical
Analysis to Extract
Relations from Web Documents"
Conference on Knowledge Discovery and Data
Mining (KDD 2006)? Suchanek et al. WWW 2007
Fabian M. Suchanek, Gjergji Kasneci and Gerhard
Weikum "YAGO - A Core of
Semantic Knowledge"
International World Wide Web conference (WWW
2007)? Suchanek et al. JWS 2008 Fabian M.
Suchanek, Gjergji Kasneci and Gerhard Weikum
"YAGO - A Large Ontology
from Wikipedia and WordNet"
Suchanek et al. JWS Journal of Web Semantics
2008 Suchanek et al. TR 2009 Fabian M.
Suchanek, Mauro Sozio, Gerhard Weikum
SOFIE A Self-Organizing Framework
for Information Extraction
Submitted to the International World Wide Web
conference (WWW 2009)?
See Technical Report or my PhD Thesis on
http//mpii.de/suchanek
49Acronyms
LEILA Learning to Extract Information by
Linguistic Analysis YAGO Yet Another Great
Ontology SOFIE Self-Organizing Framework for
Information Extraction NAGA Not another Google
Answer
50YAGO Thematic vs Conceptual Categories
Australian boxers of German origin
? conceptual
? thematic
Kick boxing in Australia
Shallow linguistic noun phrase parsing
Premodifier Head Postmodifier
Heuristics If the head is a plural word, the
category is conceptual
51YAGO Upper Model
Person
subclass
WordNet
Boxer 42
Boxer 1
....
Australian boxer
is a
Wikipedia
born
1980
52A Hitchhiker's Guide to Ontology
DBpedia (HU Berlin)?
SUMO (research project)?
YAGO forms taxonomic backbone
YAGO and SUMO have been merged
YAGO
YAGO is part of the project by its Web service
YAGO will be included
Linking Open Data (HU Berlin, U Leipzig, OLS
Inc.)?
Freebase (community)?
Planned
YAGO contributes the entities
YAGO is used for bootstrapping
Cyc (commercial)?
KOG (U Washington)?
UMBEL (commercial)?
Suchanek et al. JWS 2008
53YAGO Applications
NAGA (Semantic Search Ranking)? Kasneci et
al ICDE 2008
TagBooster (User Study on Social
Tagging)? Suchanek et al. CIKM 2008
YAGO
ESTER (Semantic Search Full Text Search)? Bast
et al. SIGIR 2007
Projects by other people
54YAGO Relations
establishedOnDate isMarriedTo hasPopulation hasHei
ght hasWeight hasInflation actedIn ...
is a familyName givenName bornOnDate diedOnDate bo
rnIn diedIn locatedIn
100 relations
5519,000,000
YAGO Size
3,000,000
30,000 60,000 200,000 300,000
KnowItAll SUMO WordNet OpenCyc Cyc
Yago
Publicly available ontologies with a quality
guarantee. Size is not correlated with usefulness.
56YAGO Model
Axioms (x, is_a, y)? (y, subclass, z)? gt (x,
is_a, z)? ...
person
subclass
saint
is a
is a
57YAGO Model
finite, unique
f1, f2, f3, f4, f5, f6, f7, f8, f9, f10
Axioms (x, is_a, y)? (y, subclass, z)? gt (x,
is_a, z)? ...
derive facts
f1, f2, f3, f4, f5
Eliminate facts
f1, f2, f3
finite, unique
Suchanek et al. WWW 2007
58YAGO Knowledge Representation
OWL Full
RDFS
YAGO
ADTs
Acyclicity Datatypes
Reification
subClassOf
Transitivity
Property Restrictions
OWL DL
59SOFIE rules!
occurs(P,WX,WY) /\ refersTo(WX.X) /\
refersTo(WY,Y) /\ R(X,Y) gt expresses(P,R)
occurs(P,WX,WY) /\ expressed(P,R) /\
refersTo(WX.X) /\ refersTo(WY,Y) /\
range(R,D1) /\ domain(R,D2) /\ type(X,D1) /\
type(Y,D2) gt R(X,Y)
R(X,Y) /\ R(X,Z) /\ type(R,functionalRelation)
gt Y Z
disambiguationPrior(W,X) gt refersTo(W,X)
? R(X,Y)
relation-dependent rules
bornInYear(X,B) /\ diedInYear(X,D) gt BltD
60SOFIE Clause transformation
Rules
r(X,Y) /\ s(X,Y) gt t(X,X) u(a)
Entities a,b
Grounded Rules
Clauses
r(a,a) /\ s(a,a) gt t(a,a) r(a,b) /\ s(a,b) gt
t(a,a) r(b,a) /\ s(b,a) gt t(b,b) r(b,b) /\
s(b,b) gt t(b,b) u(a)
? r(a,a) \/ ? s(a,a) \/ t(a,a) ? r(a,b) \/ ?
s(a,b) \/ t(a,a) ? r(b,a) \/ ? s(b,a) \/ t(b,b) ?
r(b,b) \/ ? s(b,b) \/ t(b,b) u(a)
61SOFIE Clause transformation
Clauses
Textual Facts
1
? r(a,a) \/ ? s(a,a) \/ t(a,a) ? r(a,b) \/ ?
s(a,b) \/ t(a,a) ? r(b,a) \/ ? s(b,a) \/ t(b,b) ?
r(b,b) \/ ? s(b,b) \/ t(b,b) u(a)
r(a,a) w1 r(a,b) w2 r(b,a) w3 r(b,b)
w4
YAGO
s(a,a)
62SOFIE Clause weighting
Clauses
Textual Facts
? 1 \/ ? 1 \/ t(a,a) w1 ? 1
\/ ? s(a,b) \/ t(a,a) w2 ? 1 \/ ? s(b,a)
\/ t(b,b) w3 ? 1 \/ ? s(b,b) \/ t(b,b)
w4 u(a) W
r(a,a) w1 r(a,b) w2 r(b,a) w3 r(b,b)
w4
YAGO
s(a,a)
63SOFIE Hypothesis generation
Textual Facts
Rules
r(a,b) w1
r(X,Y) /\ s(X,Y) gt t(X,X)
Hypotheses
t(a,a) t(b,b)
64SOFIE Hypothesis generation
Grounded Rules
Rules
r(a,a) /\ s(a,a) gt t(a,a) r(a,b) /\ s(a,b) gt
t(a,a)
r(X,Y) /\ s(X,Y) gt t(X,X)
Hypotheses
t(a,a)
65SOFIE Functional MAX SAT Algorithm
The functional MAX SAT Algorithm considers only
unit clauses.
Variables
Clauses
0
X Y Z
?X \/ ?Z w1 ?X \/ ?Y w1 ?Y \/ ?Z
w1 Z w1
0
1
66SOFIE Experiments
Corpus Type Docs Rel Time Facts Precision Recall
Wikipedia toy corpus structured 100 3 8min 165 100 98
Wikipedia toy corpus semi-structured 50 infoboxes removed 100 3 8min 165 100 57
Wikipedia subcorpus semi-structured 2000 15 15h 505 94 ?
News article toy corpus unstructured 150 1 24min 35, 46 91 24, 31
Snowball Snowball Snowball Snowball Snowball 65 56 31
Biographies from Web unstructured 3440 5 15h 744 90 ?
67SOFIE Large-Scale Experiment
Goal Extract bornIn, bornOnDate, diedIn,
diedOnDate, politicianOf
Corpus 3700 biography documents downloaded from
the Web
Results (precision in )
Runtime (summed over 5 batches)
Parsing 705h Hypothesis Generation 615h Sol
ving 230h Total 1550h
87 87 13 98 95
? 90
bornIn bornOnD diedIn diedOnD polOf
68SOFIE Relation to Markov Logic
Number of satisfied instances of the ith formula
Weight of the ith formula
r(x,y) /\ s(x,z) gt t(x,z) w ...
P(X) ? e sat(i,X) wi
max X ? e sat(i,X) wi
P
max X log( ? e sat(i,X) wi )
max X ? sat(i,X) wi
false true
bornIn(Nicholas, Patras)
gt Weighted MAX SAT problem
69LEILA Workflow
Fix one relation, e.g. foundedInYear
The UDS was founded in 1948. The UDS has 1974
employees. The MPII has 1'000 employees. The
MPI-SWS was founded in 2004. The MPI-SWS has
2'003 employees.
10 2
X was founded in Y
Examples UDS 1948 UDS 1949 UDS 1950 ... MPII
1988 MPII 1989 MPII 1990 ...
X has Y employees
3 20
foundedIn(MPI-SWS, 2004)?
70LEILA Theoretical considerations
THEOREM Goodnaturedness As the number
of parsed sentences increases, the
probability of false extractions
decreases. Intuition One of two cases
applies 1. A pattern occurs very frequently.
Then it is unlikely to be mistaken for a good
pattern 2. A pattern occurs very infrequently.
Then it does not matter if it is mistaken for a
good pattern.
Suchanek et al. KDD 2006
71LEILA The Linguistic Part
X was founded in Y
The MPI-SWS was founded in 2004.
foundedIn(MPI-SWS, 2004)
72LEILA The Linguistic Part
X was founded in Y
The MPI-SWS, the great institution, was founded
in 2004.
foundedIn(MPI-SWS, 2004)
73LEILA The Linguistic Part
X was founded in Y
The MPI-SWS, the great institution, was founded
in 2004.
foundedIn(MPI-SWS, 2004)
74Future Work With YAGO
? personalize (Shady, Maya) ? use social
networks to extend YAGO (Maya, Sharat, Ashwin) ?
make YAGO multilingual (Gerard) ? add Web
services (Nicoleta) ? make querying efficient
(Gjergji) ? store YAGO efficiently (Thomas) ?
make reasoning efficient (Mauro,Martin) ?
provide good visualization (Shady) ? add a
temporal component to SOFIE ? add biomedical
knowledge (Alessandro Fiori) ? add multimodal
support (Martin Schreiber) ? add natural
language support (help from workshop on Monday)
(slide by Prof. Gerhard Weikum)
75Future Work Beyond YAGO
? join forces with other ontology projects ?
learn not just facts, but also relations ? apply
the SOFIE approach in related settings
(information extraction with music or pictures?)
76Acknowledgements
The following people have worked
with me LEILA Georgiana Ifrim and Gerhard
Weikum YAGO Gjergji Kasneci and Gerhard
Weikum SOFIE Mauro Sozio and Gerhard
Weikum TagBooster Milan Vojnovic and Dinan
Gunawardena NAGA Gjergji Kasneci, Georgiana
Ifrim, Shady Elbassuoni and Gerhard Weikum ESTER
Holger Bast, Ingmar Weber and Alex
Chitea YAGOSUMO Gerard de Melo and Adam
Pease STAR Gjergji, Mauro, Maya Ramanath and
Gerhard
Thank you for making these projects possible!