Title: Joint work with
1Joint work with Georgiana Ifrim, Gjergji
Kasneci, Thomas Neumann, Maya Ramanath, Fabian
Suchanek
2Vision
Opportunity Turn the Web (and Web 2.0 and Web
3.0 ...) into the worlds most comprehensive
knowledge base
- Approach
- 1) harvest and combine
- hand-crafted knowledge sources
- (Semantic Web, ontologies)
- automatic knowledge extraction
- (Statistical Web, text mining)
- social communities and human computing
- (Social Web, Web 2.0)
- 2) express knowledge queries, search, and rank
- 3) everything efficient and scalable
3Why Google and Wikipedia Are Not Enough
Answer knowledge queries such as
proteins that inhibit proteases and other human
enzymes
connection between Thomas Mann and Goethe
German Nobel prize winner who survived both world
wars and all of his four children
German universities with world-class computer
scientists
politicians who are also scientists
4Why Google and Wikipedia Are Not Enough
Which politicians are also scientists ?
- What is lacking?
- Information is not Knowledge.
- Knowledge is not Wisdom.
- Wisdom is not Truth
- Truth is not Beauty.
- Beauty is not Music.
- Music is the best.
- (Frank Zappa)
- extract facts from Web pages
- capture user intention by
- concepts, entities, relations
5NAGA Example
Query x isa politician x isa
scientist Results Benjamin Franklin Paul
Wolfowitz Angela Merkel
6Related Work
Cimple DBlife
Libra
TextRunner
START
Avatar
Answers
information extraction ontology building
Web entity search QA
UIMA
Hakia
Powerset
Freebase
Cyc
EntityRank
DBpedia
semistructured IR graph search
TopX
XQ-FT
Yago Naga
Tijah
SPARQL
DBexplorer
Banks
SWSE
7Outline
?
Motivation
Information Extraction Knowledge Harvesting
(YAGO)
Ranking for Search over Entity-Relation Graphs
(NAGA)
Efficient Query Processing (RDF-3X)
Conclusion
8Information Extraction (IE) Text to Records
- extracted facts often
- have confidence
- DB with uncertainty
- (probabilistic DB)
expensive and error-prone
combine NLP, pattern matching, lexicons,
statistical learning
9High-Quality Knowledge Sources
General-purpose ontologies and thesauri WordNet
family
- 200 000 concepts and relations
- can be cast into
- description logics or
- graph, with weights for relation strengths
- (derived from co-occurrence statistics)
scientist, man of science (a person with
advanced knowledge) cosmographer,
cosmographist biologist, life scientist
chemist cognitive scientist
computer scientist ... principal
investigator, PI HAS INSTANCE Bacon,
Roger Bacon
10Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
Infobox_Scientist name Max Planck
birth_date April 23, 1858
birth_place Kiel, Germany death_date
October 4, 1947 death_place
Göttingen, Germany residence
Germany nationality GermanyGerman
field Physicist work_institution
University of Kiel Humboldt-Universi
tät zu Berlin Georg-August-Universität
Göttingen alma_mater Ludwig-Maximilians-Un
iversität München doctoral_advisor
Philipp von Jolly doctoral_students
Gustav Ludwig Hertz known_for
Planck's constant,
Quantum mechanicsquantum theory prizes
Nobel Prize in Physics (1918)
11Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
12YAGO Yet Another Great OntologyF. Suchanek, G.
Kasneci, G. Weikum WWW07
- Turn Wikipedia into explicit knowledge base
(semantic DB) - keep source pages as witnesses
- Exploit hand-crafted categories and infobox
templates - Represent facts as explicit knowledge triples
- relation (entity1, entity2)
- (in FOL, compatible with RDF, OWL-lite, XML,
etc.) - Map (and disambiguate) relations into WordNet
concept DAG
relation
entity1
entity2
Examples
bornIn
isInstanceOf
City
Max_Planck
Kiel
Kiel
13YAGO Knowledge Base F. Suchanek et al. WWW07
Accuracy ? 95
Entity
subclass
subclass
Person
concepts
Location
subclass
Scientist
subclass
subclass
subclass
subclass
City
Country
Biologist
Physicist
instanceOf
instanceOf
Nobel Prize
Erwin_Planck
Kiel
bornIn
hasWon
FatherOf
individuals
bornOn
diedOn
October 4, 1947
Max_Planck
April 23, 1858
means
means
means
Dr. Planck
Max Karl Ernst Ludwig Planck
Max Planck
words
Online access and download at http//www.mpi-inf.m
pg.de/suchanek/yago/
14Wikipedia Harvesting Difficulties Solutions
- instanceOf relation isleading and difficult
category names - (disputed articles, particle physics,
American Music of the 20th Century, - Nobel laureates in physics, naturalized
citizens of the United States, ) - ? noun group parser ignore when head word in
singular - isA relation mapping categories onto WordNet
classes - Nobel laureates in physics ?
Nobel_laureates, people from Kiel ? person - ? map to (singular of) head exploit synsets
and statistics - Entity name ambiguities
- St. Petersburg, Saint Petersburg, M31,
NGC224 ? means ... - ? exploit Wikipedia redirects
disambiguations, WN synsets
- type checking for scrutinizing candidates
- accept fact candidate only if arguments have
proper classes - marriedTo (Max Planck, quantum physics) ?
Person ? Person
15Higher-Order Facts in YAGO
CapitalOf
CapitalOf
Bonn
Berlin
Germany
16Ongoing Work YAGO for Easier IE
YAGO knows (almost) all (interesting)
entities leverage for discovering extracting
new facts in NL texts
IE with dependency parser is expensive !
river
city
The city of Paris was founded on an island in
the Seine in 300 BC
isa
isa
runs Through
Seine
Paris
France
locatedIn
locatedIn
locatedIn
- can filter out many uninteresting sentences
- can quickly identify relation arguments
- can eliminate many fact candidates by type
checking - can focus on specific properties like time
Europe
17Outline
?
Motivation
?
Information Extraction Knowledge Harvesting
(YAGO)
Ranking for Search over Entity-Relation Graphs
(NAGA)
Efficient Query Processing (RDF-3X)
Conclusion
18NAGA Graph Search G. Kasneci et al. ICDE08
Graph-based search on YAGO-style knowledge bases
with built-in ranking based on confidence and
informativeness
discovery queries
connectedness queries
isa
Thomas Mann
German novelist
isa
isa
Goethe
politician
x
scientist
complex queries (with regular expressions)
isa
wonPrize
inField
computer science
x
scientist
p
worksAt graduatedFrom
locatedIn
u
university
Germany
isa
capitalOf
queries over reified facts
isa
c
city
Germany
validIn
1988
19Search Results Without Ranking
q Fisher isa scientist Fisher isa x
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X alumnus_109165182
_at_Fisher Irving_Fisher _at_scientist
scientist_109871938 X social_scientist_1099273
04 _at_Fisher James_Fisher _at_scientist
scientist_10981938 X ornithologist_109711173
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X theorist_110008610
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X colleague_109301221
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X organism_100003226
mathematician_109635652 subClassOf
scientist_109871938 Alumni_of_Gonville_and_Caiu
s_College,_Cambridge subClassOf
alumnus_109165182 "Fisher" familyNameOf
Ronald_Fisher Ronald_Fisher type
Alumni_of_Gonville_and_Caius_College,_Cambridge
Ronald_Fisher type 20th_century_mathematic
ians "scientist" means scientist_109871938
20Ranking with Statistical Language Model
q Fisher isa scientist Fisher isa x
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X mathematician_109635652
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X statistician_109958989
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X president_109787431 _at_Fi
sher Ronald_Fisher _at_scientist
scientist_109871938 X geneticist_109475749
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X scientist_109871938
Score 7.184462521168058E-13 mathematician_1096356
52 subClassOf scientist_109871938
"Fisher" familyNameOf Ronald_Fisher
Ronald_Fisher type 20th_century_mathematic
ians "scientist" means scientist_109871938
20th_century_mathematicians subClassOf
mathematician_109635652
? statistical language model for result
graphs
Online access at http//www.mpi-inf.mpg.de/kasnec
i/naga/
21Ranking Factors
- Confidence
- Prefer results that are likely to be correct
- Certainty of IE
- Authenticity and Authority of Sources
bornIn (Max Planck, Kiel) from Max Planck was
born in Kiel (Wikipedia)
livesIn (Elvis Presley, Mars) from They believe
Elvis hides on Mars (Martian Bloggeria)
- Informativeness
- Prefer results that are likely important
- May prefer results that are likely new to user
- Frequency in answer
- Frequency in corpus (e.g. Web)
- Frequency in query log
q isa (Einstein, y)
isa (Einstein, scientist)
isa (Einstein, vegetarian)
q isa (x, vegetarian)
isa (Einstein, vegetarian)
isa (Al Nobody, vegetarian)
- Compactness
- Prefer results that are tightly connected
- Size of answer graph
vegetarian
Tom Cruise
isa
isa
bornIn
Einstein
1962
won
won
Bohr
Nobel Prize
diedIn
22NAGA Ranking Model
Following the paradigm of statistical language
models (used in speech recognition and modern
IR), applied to graphs
For query q with fact templates q1 qn
bornIn (x, Frankfurt) rank result graphs g with
facts g1 gn bornIn (Goethe,
Frankfurt) by decreasing likelihoods
using generative mixture model
background model
reflect informativeness
weights subqueries Ex. bornIn (x, Germany)
wonAward (x, Nobel)
23NAGA Ranking Model Informativeness
Estimate Pqi gi
for qi (x, r, z) with var x (analogously
for other cases)
bornIn (GW, Frankfurt)
Ex. bornIn (x, Frankfurt)
bornIn (Goethe, Frankfurt)
isa (Einstein, physicist)
Ex. isa (Einstein, z)
bornIn (Einstein, vegetarian)
Estimate on knowledge graph
Estimate on Web (exploit redundancy)
vegetarian
freq (Einstein, isa, physicist) vs. freq
(Einstein, isa, vegetarian)
isa
Albert Einstein
isa
physicist
24NAGA Example
Query x isa politician x isa
scientist Results Benjamin Franklin Paul
Wolfowitz Angela Merkel
25User Study for Quality Assessment (1)
- Benchmark
- 55 queries from TREC QA 2005/2006
- Examples 1) In what country is Luxor?
- 2) Discoveries
of the 20th Century? - 12 queries from work on SphereSearch
- Examples 1) In which movies did a
governor act? - 2) Firstname of
politician Rice? - 18 regular expression queries by us
- Example What do Albert Einstein and Niels
Bohr have in common? - Competitors
- NAGA vs.
- Google, Yahoo! Answers,
- BANKS (IIT Bombay), START (MIT)
26User Study for Quality Assessment (2)
- Quality Measures
- Precision_at_1
- NDCG normalized discounted cumulative gain
- based on ratings highly relevant (2), somewhat
relevant (1), irrelevant (0) - with Wilson confidence intervals at ? 0.95
27Outline
?
Motivation
?
Information Extraction Knowledge Harvesting
(YAGO)
?
Ranking for Search over Entity-Relation Graphs
(NAGA)
Efficient Query Processing (RDF-3X)
Conclusion
28Why RDF? Why a New Engine?
Poland
Nobel Prize Chemistry
Maria Sklodowska
inCountry
Warsaw
1852
bornOn
wonAward
Henri Becquerel
bornAs
bornIn
advsior
1908
diedOn
bornOn
Marie Curie
1867
Alma Mater
U Paris
won Award
wonAward
1934
marriedTo
diedOn
Pierre Curie
won Award
Nobel Prize Physics
- RDF triples (subject property/predicate
value/object) - (id1, Name, Marie Curie), (id1, bornAs,
Maria Sklobodowska), (id1, bornOn, 1867), - (id1, bornIn, id2), (id2, Name, Warsaw),
(id2, locatedIn, id3), (id3, Name, Poland), - (id1, marriedTo, id4), (id4, Name, Pierre
Curie), (id1, wonAward, id5), (id4, wonAward,
id5), - pay-as-you-go schema-agnostic or schema later
- RDF triples form fine-grained (ER) graph
- queries bound to need many star-joins and long
chain-joins - physical design critical, but hardly predictable
workload
29SPARQL Query Language
SPJ combinations of triple patterns
Ex Select ?c Where ?p isa scientist .
?p bornIn ?t . ?p hasWon ?a . ?t
inCountry ?c . ?a Name NobelPrize
options for filter predicates, duplicate
handling, wildcard join, etc.
Ex Select Distinct ?c Where ?p ?r1 ?t . ?t
?r2 ?c . ?c isa .
?p bornOn ?b
. Filter (?b 1945)
support for RDFS types
30RDF SPARQL Engines
choice of physical design is crucial
giant triples table
(vert. partitioned) property tables
clustered property tables ( leftover table)
bornOn
S O
id1 1867 id
id5 1852
Advisor
S O
id1 id5
id2 Warsaw id11 .,.
SESAME / OpenRDF YARS2 (DERI)
Jena (HP Labs) Oracle RDF_MATCH
C-Store (MIT) MonetDB (CWI)
column stores
physical design wizard !
materialized views
31RDF-3X a RISC-style EngineT. Neumann, G.
Weikum VLDB 2008
- Design rationale
- RDF-specific engine (not an RXORDBMS)
- Simplify operations
- Reduce implementation choices
- Optimize for common case
- Eliminate tuning knobs
- Key principles
- Mapping dictionary for encoding all literals
into ids - Exhaustive indexing of id triples
- Index-only store, high compression
- QP mostly merge joins with order-preservation
- Very fast DP-based query optimizer
- Frequent-paths synopses, property-value
histograms
32RDF-3X Indexing
- index all collation orders of subject-property-obj
ect id triples - SPO, SOP, OSP, OPS, PSO, POS
- directly stored in clustered B trees
- high compression ? indexes
- can choose any order for scan join
- additionally index count-aggregated projections
in all orders - SP, SO, OS, OP, PS, PO with counter for
each entry - enables efficient bookkeeping for duplicates
- also index projections S, P, O with
count-aggregation
also need two mapping indexes literal ?
id, id ? literal,
33RDF-3X Query Optimization
- Principles
- optimizing join orders is key (star joins, long
join chains) - should exploit exhaustive indexes and
order-preservation - support merge-joins and hash-joins
Bottom-up dynamic programming for exhaustive plan
enumeration (
- Cost model based on selectivity estimation from
- histograms for each of the 6 SPO orderings
(approx. equi-depth) - frequent join paths (property sequences) for
stars and chains
Example Query
p1
p2
p3
p4
p5
?x1
?x2
?x3
?x4
?x5
?x6
34Experimental Evaluation Setup
- Setup and competitors
- 2GHz dual core, 2 GB RAM, 30MB/s disk, Linux
- column-store property tables by Abadi et al.,
using MonetDB - triples store with SPO, POS, PSO indexes, using
PostgreSQL
Datasets 1) Barton library catalog 51 Mio.
triples (4.1 GB) 2) YAGO knowledge base 40 Mio.
triples (3.1 GB) 3) Librarything social-tagging
excerpt 30 Mio. triples (1.8 GB)
Select ?t Where ?b hasTitle ?t . ?u romance
?b . ?u love ?b . ?u mystery ?b . ?u suspense ?b
. ?u crimeNovel ?c . ?u hasFriend ?f . ?f ...
Benchmark queries (7 or 8 per dataset) in the
spirit of 1) counts of French library items
(books, music, etc.), with creator,
publisher, language, etc. 2) scientist from
Poland with French advisor who both won awards 3)
books tagged with romance, love, mystery,
suspense by users who like crime novels and
have friends who ...
35Experimental Evaluation Results
DB sizes GB Barton Yago
LibThing RDF-3X 2.8 2.7
1.6 MonetDB 1.6-2.0 1.1-2.4
0.7-6.9 PostgreSQL 8.7 7.5 5.7
DB load times min Barton Yago
LibThing RDF-3X 13 25
20 MonetDB 11 21
4 PostgreSQL 30 25 20
Geometric means for query run-times sec for
warm (cold) cache
Barton Yago LibThing RDF-3X
0.4 (5.9) 0.04 (0.7) 0.13
(0.89) MonetDB 3.8 ( 26.4) 54.6
(78.2) 4.39 (8.16) PostgreSQL 64.3
(167.8) 0.56 (10.6) 30.4 (93.9)
36Outline
?
Motivation
?
Information Extraction Knowledge Harvesting
(YAGO)
?
Ranking for Search over Entity-Relation Graphs
(NAGA)
?
Efficient Query Processing (RDF-3X)
Conclusion
37Summary Outlook
lift worlds best information sources (Wikipedia,
Web, Web 2.0) to the level of explicit knowledge
(ER-oriented facts)
1) building knowledge graphs combine
semantic statistical social IE sources
(for scholarly Web, digital libraries, enterprise
know-how) challenges in consistency vs.
uncertainty, long-term evolution
2) heterogeneity uncertain IE necessitate
ranking new ranking models (e.g. statistical
LM for graphs)
3) efficiency and scalability challenges for
search ranking (top-k queries) and updates
38Thank You !