Title: In collaboration with Giorgiana Ifrim, Gjergji Kasneci,
1In collaboration with Giorgiana Ifrim, Gjergji
Kasneci, Josiane Parreira, Maya Ramanath, Ralf
Schenkel, Fabian Suchanek, Martin Theobald
2Vision
Opportunity Turn the Web (and Web 2.0 and Web
3.0 ...) into the worlds most comprehensive
knowledge base (semantic DB)
Challenge seize opportunity and make it happen!
- Approach combine and exploit synergies of
- hand-crafted, high-quality knowledge sources
- ? Semantic Web
- automatic knowledge extraction
- ? Statistical Web
- social networks and human computing
- ? Social Web
3 Proof of Relevance
Vannevar Bush As We May Think, 1945.
There is a growing mountain of research. A
memex is a device in which an individual stores
all his books, records, and communications, and
which is mechanized so that it may be consulted
with exceeding speed and flexibility. It is an
enlarged intimate supplement to his memory.
4Proof of Relevance
Tim Berners-Lee In the Semantic Web information
is given well-defined meaning. Jim
Gray system can answer questions
about the text as precisely and quickly
as a human expert. Brewster
Kahle The goal of universal access to
our cultural heritage is within our grasp. Jimmy
Wales Our big-picture vision is to share
knowledge with all of humanity.
Al Gore The future will be better
tomorrow.
5Proof of Relevance ?
To know that we know what we know, and that we
do not know what we do not know, that is true
knowledge.
You cannot open a book without learning
something.
A journey of a thousand miles begins with a
single step.
Confucius, 551-479 BC
Ignorance is less remote from the truth than
prejudice.
When science, art, literature, and philosophy are
simply the manifestation of personality, they
can make a man's name live for thousands of
years.
Sentences are like sharp nails, which force truth
upon our memories.
Denis Diderot, 1713-1784
6Why Google and Wikipedia Are Not Enough
Turn the Web, Web2.0, and Web3.0 into the worlds
most comprehensive knowledge base (semantic
DB/graph) !
Answer knowledge queries such as
proteins that inhibit both protease and some
other enzyme
neutron stars with Xray bursts gt 1040 erg s-1
black holes in 10
differences in Rembetiko music from Greece and
from Turkey
connection between Thomas Mann and Goethe
market impact of Web2.0 technology in December
2006
sympathy or antipathy for Germany from May to
August 2006
Nobel laureate who survived both world wars and
his children
drama with three women making a prophecy to a
British nobleman that he will become king
7Outline
Introduction Search for Knowledge
?
- Harvesting Knowledge
- Leibniz Approach
- Planck Approach
- Darwin Approach
Conclusion
8Three Roads to Knowledge
Leibniz Approach Handcrafted High-Quality
Knowledge Sources (Semantic Web)
Planck Approach Large-scale Information
Extraction Harvesting (Statistical Web)
Darwin Approach Social Wisdom from Web 2.0
Communities (Social Web)
9Leibniz Approach (Semantic Web)
- Handcrafted High-Quality Knowledge
- Ontologies and other Lexical Sources
- Build on Rigorous Knowledge Atoms
- (Characteristica Universalis)
Gottfried Wilhelm Leibniz (1646 - 1716)
10High-Quality Knowledge Sources
General-purpose ontologies for Semantic Web
SUMO, Cyc, etc.
11High-Quality Knowledge Sources
General-purpose thesauri and concept networks
WordNet family
woman, adult female (an adult female person)
gt amazon, virago (a large strong and
aggressive woman) gt donna -- (an Italian
woman of rank) gt geisha, geisha girl --
(...) gt lady (a polite name for any woman)
... gt wife (a married woman, a mans
partner in marriage) gt witch (a being,
usually female, imagined to
have special powers derived from the devil)
12High-Quality Knowledge Sources
General-purpose thesauri and concept networks
WordNet family
- 200 000 concepts and relations
- can be cast into
- description logics or
- graph, with weights for relation strengths
- (derived from co-occurrence statistics)
enzyme -- (any of several complex proteins that
are produced by cells and
act as catalysts in specific biochemical
reactions) gt protein -- (any of a large
group of nitrogenous organic compounds
that are essential
constituents of living cells ...)
gt macromolecule, supermolecule ... gt
organic compound -- (any compound of carbon
and
another element or a radical) ... gt
catalyst, accelerator -- ((chemistry) a substance
that initiates or
accelerates a chemical reaction
without itself
being affected) gt activator -- ((biology)
any agency bringing about activation ...)
13High-Quality Knowledge Sources
Domain ontologies (UMLS, GeneOntology, etc.)
- 1 Mio. biomedical concepts, 135 categories,
- 54 relationships (e.g. virus causes (disease
symptom) )
14High-Quality Knowledge Sources
Wikipedia and other lexical sources
- 2 Mio. articles
- 40 Mio. hyperlinks
- many 1000s of categories and lists
- more than 100 languages
- growing very fast
15Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
Infobox_Scientist name Max Planck
birth_date April 23, 1858
birth_place Kiel, Germany death_date
October 4, 1947 death_place
Göttingen, Germany residence
Germany nationality GermanyGerman
field Physicist work_institution
University of Kiellt/brgt Humboldt-Universi
tät zu Berlinlt/brgt Georg-August-Universität
Göttingen alma_mater Ludwig-Maximilians-Un
iversität München doctoral_advisor
Philipp von Jolly doctoral_students
Gustav Ludwig Hertzlt/brgt known_for
Planck's constant,
Quantum mechanicsquantum theory prizes
Nobel Prize in Physics (1918)
16YAGO Yet Another Great OntologyF. Suchanek, G.
Kasneci, G. Weikum WWW07
- Turn Wikipedia into explicit knowledge base
(semantic DB) - Exploit hand-crafted categories and templates
- Represent facts as explicit knowledge triples
- relation (entity1, entity2)
- (in FOL, compatible with RDF, OWL-lite, XML,
etc.) - Map (and disambiguate) relations into WordNet
concept DAG
relation
entity1
entity2
Examples
bornIn
isInstanceOf
City
Max_Planck
Kiel
Kiel
17YAGO Knowledge Representation
Accuracy ? 97
Entity
subclass
subclass
Person
concepts
Location
subclass
Scientist
subclass
subclass
subclass
subclass
City
Country
Biologist
Physicist
instanceOf
instanceOf
Nobel Prize
Erwin_Planck
Kiel
bornIn
hasWon
FatherOf
individuals
bornOn
diedOn
October 4, 1947
Max_Planck
April 23, 1858
means
means
means
Dr. Planck
Max Karl Ernst Ludwig Planck
Max Planck
words
Online access and download at http//www.mpi-inf.m
pg.de/suchanek/yago/
18YAGO Disambiguation Uncertainty
capture confidence value for each fact
Entity
subclass
1.0
subclass
1.0
Person
Location
subclass
1.0
subclass
1.0
1.0
subclass
subclass
City
Country
Mythological Figure
Celebrity
0.7
instanceOf
0.8
instanceOf
0.9
instanceOf
1.0
0.4
instanceOf
locatedIn
Paris(Myth.)
Paris(France)
France
Paris Hilton
0.95
means
0.7
means
0.1
means
0.9
means
0.2
means
0.05
Paris
France
La Grande Nation
additional harvesting of relations from
natural-language texts by info-extraction tools
19NAGA Graph IR on YAGO G. Kasneci et al. WWW07
Graph-based search on YAGO-style knowledge bases
with built-in ranking based on statistical
language model
discovery queries
hasWon
Nobel prize
diedOn
x
a
isa
bornIn
Kiel
x
scientist
hasSon
gt
diedOn
y
b
connectedness queries
isa
Thomas Mann
Goethe
German novelist
queries with regular expressions
isa
hasFirstName hasLastName
Ling
x
scientist
worksFor
(coAuthor advisor)
locatedIn
y
Zhejiang
Beng Chin Ooi
20NAGA Searching Knowledge
q Fisher isa scientist Fisher isa x
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X alumnus_109165182
_at_Fisher Irving_Fisher _at_scientist
scientist_109871938 X social_scientist_1099273
04 _at_Fisher James_Fisher _at_scientist
scientist_10981938 X ornithologist_109711173
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X theorist_110008610
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X colleague_109301221
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X organism_100003226
mathematician_109635652 subClassOfgt
scientist_109871938 Alumni_of_Gonville_and_Caiu
s_College,_Cambridge subClassOfgt
alumnus_109165182 "Fisher" familyNameOfgt
Ronald_Fisher Ronald_Fisher typegt
Alumni_of_Gonville_and_Caius_College,_Cambridge
Ronald_Fisher typegt 20th_century_mathematic
ians "scientist" meansgt scientist_109871938
21NAGA Searching Ranking Knowledge
q Fisher isa scientist Fisher isa x
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X mathematician_109635652
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X statistician_109958989
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X president_109787431 _at_Fi
sher Ronald_Fisher _at_scientist
scientist_109871938 X geneticist_109475749
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X scientist_109871938
Score 7.184462521168058E-13 mathematician_1096356
52 subClassOfgt scientist_109871938
"Fisher" familyNameOfgt Ronald_Fisher
Ronald_Fisher typegt 20th_century_mathematic
ians "scientist" meansgt scientist_109871938
20th_century_mathematicians subClassOfgt
mathematician_109635652
Online access at http//www.mpi-inf.mpg.de/kasnec
i/naga/
22Ranking Factors
- Confidence
- Prefer results that are likely to be correct
- Certainty of IE
- Authenticity and Authority
- of Sources
bornIn (Max Planck, Kiel) from Max Planck was
born in Kiel (Wikipedia)
livesIn (Elvis Presley, Mars) from They believe
Elvis hides on Mars (Martian Bloggeria)
- Informativeness
- Prefer results that are likely important
- May prefer results that are likely new to user
- Frequency in answer
- Frequency in corpus (e.g. Web)
- Frequency in query log
q isa (Einstein, y)
isa (Einstein, scientist)
isa (Einstein, vegetarian)
q isa (x, vegetarian)
isa (Einstein, vegetarian)
isa (Al Nobody, vegetarian)
- Compactness
- Prefer results that are
- tightly connected
- Size of answer graph
vegetarian
Tom Cruise
isa
isa
bornIn
Einstein
1962
won
won
Bohr
Nobel Prize
diedIn
23Summary of Leibniz Approach
Hand-crafted knowledge sources are great assets,
but expensive, partial, and isolated
Great mileage even from informal semiformal
sources
Connecting reconciling different sources gives
added value (and sometimes is not even that hard)
- Challenge
- Develop methods for comprehensive, highly
accurate - mappings across many knowledge sources
- Cross-lingual
- Cross-temporal
- Scalable
24Planck Approach (Statistical Web)
- Information Extraction Harvesting
- Gather Entities, Relations, Facts
- Live with Uncertainty
Max Planck (1858 - 1947)
25Information Extraction (IE) Text to Records
combine NLP, pattern matching, lexicons,
statistical learning
26IE Technology Rules, Patterns, Learning
- For natural-language text and for heterogeneous
sources - NLP techniques (parser, PoS tagging) for
tokenization - identify patterns (e.g. regular expressions) as
features - train statistical learners for segmentation and
labeling - use learned model to automatically tag newly
seen input
Training data The WWW conference takes place in
Banff in Canada. Todays keynote speaker is Dr.
Berners-Lee from W3C. The panel in Edinburgh,
chaired by Ron Brachman from Yahoo!,
ltlocationgt
ltorganizationgt
ltpersongt
lteventgt
Ian Foster, father of the Grid, talks at the GES
conference in Germany on 05/02/07.
ltpersongt
lteventgt
ltlocationgt
ltdategt
27Knowledge Acquisition from the Web
- Learn Semantic Relations from Entire Corpora at
Large Scale - (as exhaustively as possible but with high
accuracy) - Examples
- all cities, all basketball players, all
composers - headquarters of companies, CEOs of companies,
synonyms of proteins - birthdates of people, capitals of countries,
rivers in cities - which musician plays which instruments
- who discovered or invented what
- which enzyme catalyzes which biochemical
reaction
Existing approaches and tools use almost-unsupervi
sed pattern matching and learning seeds (known
facts) ? patterns (in text) ? (extraction) rule ?
(new) facts
28Methods for Web-Scale Fact Extration
seeds ? text ?
rules ? new facts
Example city (Seattle) in
downtown Seattle city (Seattle)
Seattle and other towns city (Las Vegas)
Las Vegas and other towns plays (Zappa,
guitar) playing guitar Zappa plays
(Davis, trumpet) Davis blows trumpet
Example city (Seattle) in
downtown Seattle in downtown X city (Seattle)
Seattle and other towns X
and other towns city (Las Vegas) Las Vegas
and other towns X and other towns plays (Zappa,
guitar) playing guitar Zappa playing Y
X plays (Davis, trumpet) Davis blows
trumpet X blows Y
Example city (Seattle) in
downtown Seattle in downtown X city (Seattle)
Seattle and other towns X
and other towns city (Las Vegas) Las Vegas
and other towns X and other towns plays (Zappa,
guitar) playing guitar Zappa playing Y
X plays (Davis, trumpet) Davis blows
trumpet X blows Y
Example city (Seattle) in
downtown Seattle in downtown X city (Seattle)
Seattle and other towns X
and other towns city (Las Vegas) Las
Vegas and other towns X and other towns plays
(Zappa, guitar) playing guitar
Zappa playing Y X plays (Davis, trumpet)
Davis blows trumpet X blows Y
in downtown
Delhi city(Delhi)
Coltrane blows sax plays(C., sax)
city(Delhi) plays(Coltrane, sax)
city(Delhi) old center of
Delhi plays(Coltrane, sax) sax player
Coltrane
city(Delhi) old center of Delhi old
center of X plays(Coltrane, sax) sax player
Coltrane Y player X
29Performance of Web-IE
State-of-the-art precision/recall results
relation precision recall corpus
systems countries 80 90 Web
KnowItAll cities 80 ??? Web
KnowItAll scientists 60 ???
Web KnowItAll CEOs 80 50 News
Snowball, LEILA birthdates 80 70
Wikipedia LEILA instanceOf 40 20
Web Text2Onto, LEILA
precision value-chain entities 80, attributes
70, facts 60, events 50
Anecdotic evidence
invented (A.G. Bell, telephone) married (Hillary
Clinton, Bill Clinton) isa (yoga, relaxation
technique) isa (zearalenone, mycotoxin) contains
(chocolate, theobromine) contains (Singapore
sling, gin)
invented (Johannes Kepler, logarithm
tables) married (Segolene Royal, Francois
Hollande) isa (yoga, excellent way) isa (your
day, good one) contains (chocolate,
raisins) plays (the liver, central role) makes
(everybody, mistakes)
30Beyond Surface Learning with LEILA
Learning to Extract Information by Linguistic
Analysis F. Suchanek et al. KDD06
Limitation of surface patterns who discovered or
invented what Teslas work
formed the basis of AC electric power
Al Gore funded more work for a better
basis of the Internet
Almost-unsupervised Statistical Learning with
Dependency Parsing
LEILA outperforms other Web-IE methods in
precision and recall, but dependency parser is
slow
31IE Efficiency and Accuracy Tradeoffs
see also tutorials by Cohen, Doan/Ramakrishnan/Va
ithyanathan, Agichtein/Sarawagi
IE is cool, but whats in it for DB folks?
- precision vs. recall two-stage processing
(filter pipeline) - recall-oriented harvesting
- precision-oriented scrutinizing
- preprocessing
- indexing NLP trees graphs, N-grams, PoS-tag
patterns ? - exploit ontologies? exploit usage logs ?
- turn crawlextract into set-oriented query
processing - candidate finding
- efficient phrase, pattern, and proximity queries
- optimizing entire text-mining workflows
Ipeirotis et al. SIGMOD06
32Summary of Planck Approach
Human text (and speech) is diverse and produced
at higher rate than manual high-quality
annotations
IE offers reasonably robust and scalable methods
for harvesting named entities and binary relations
? Deep NLP and advanced ML are computational
bottleneck
? Disambiguation (entity matching, record
linkage) needed Joe Hellerstein (UC
Berkeley) Prof. Joseph M. Hellerstein,
California) Max Planck Institute MPI
? MPI Message Passing Institute
- Challenge
- Achieve Web-scale IE throughput that can
- sustain rate of new content production (e.g.
blogs) - (may need large-scale P2P/Grid)
- with gt 90 accuracy and Wikipedia-like coverage
33Darwin Approach (Social Web)
- Social Wisdom Natural Selection
- Evolution of (Web 2.0) species
- Survival of the fittest
Charles Darwin (1809 - 1882)
34Wisdom of Crowds at Work on Web 2.0
- Information enrichment knowledge extraction by
humans - Collaborative Recommendations QA
- Amazon (product ratings reviews, recommended
products) - Netflix movie DVD rentals ? 1 Mio. Challenge
- answers.yahoo, iknow.baidu, etc.
- Social Tagging and Folksonomies
- del.icio.us Web bookmarks and tags
- flickr photo annotation, categorization, rating
- YouTube same for video
- Human Computing in Game Form
- ESP and Google Image Labeler image tagging
- Peekaboom image segmenting and tagging
- Verbosity facts from natural-language sentences
- Online Communities
- dblife.cs.wisc.edu for database research
- www.lt-world.org for language technology
- Yahoo! Groups, Myspace, Facebook, etc. etc.
35Social Tagging Example Flickr (1)
36Social Tagging Example Flickr (2)
37Social Tagging Example Flickr (3)
38Social-Tagging Community
gt 1 Mio. users gt 100 Mio. photos gt 1 Bio.
tags 30 monthly growth
Source www.flickr.com
39ESP Game Luis von Ahn et al. 2004
played against random, anonymous partner on
Internet
taboo pyramid Louvre museum Paris art
- Game with a purpose
- Collects annotations (wisdom)
- Can exploit tag statistics (crowds)
- Attracts people, fun to play, some play hours
- ESP game collected gt 10 Mio. tags from gt 20000
users - 5000 people could tag all photos on the Web in 4
weeks - (human computing)
my labels
my labels reflection
your partner has suggested 3 labels
my labels reflection water
your partner has suggested 7 labels
my labels reflection water Mitterand Mona Lisa
your partner has suggested 11 labels
my labels reflection water Mitterand Mona
Lisa metro lignes 7, 14
your partner has suggested 17 labels
my labels reflection water Mitterand Mona
Lisa metro lignes 7, 1 Da Vinci code
Congratulations! You scored 1 point!
40More Human Computing
- Verbosity von Ahn 2006
- Collect common-knowledge facts (relation
instances) - 2 players Narrator (N) and Guessor (G)
- N gives stylized clues
- is a kind of , is used for , is typically
near/in/on , is the opposite of - random pairing for independence,
- can build statistics over many games for same
concept
Peekaboom, Phetch, etc. locating tagging
objects in images, finding images, etc.
- incentives to play ?
- game design for moving up the value-chain ?
41Dark Side of Social Wisdom
- Spam (Web spam not just for email anymore)
- lucky online casino, easy MBA diploma, cheap
V!-4-gra, etc. - law suits about appropriate Google rank
- Truthiness
- degree to which something is truthy (not
necessarily facty) - truthy property of something you know from
your guts - Disputes
- editorial fights over critical Wikipedia
articles - Citizendium new endeavor with "gentle expert
oversight" - Dishonesty, Bias,
42(No Transcript)
43(No Transcript)
44The Wisdom of Crowds PageRank
PageRank (PR) links are endorsements increase
page authority, authority is
higher if links come from high-authority pages
Social Ranking
with
and
equivalent to principal eigenvector
random walk uniformly random choice of links
random jumps add bias to transitions and jumps
for personal PR, TrustRank, etc.
45The Wisdom of Crowds Beyond PR
users
tags
docs
Typed graphs data items, users, friends, groups,
postings, ratings,
queries, clicks, with weighted edges ?
spectral analysis of various graphs
Evolving over time ? tensor analysis
46Decentralized Graph Analysis
- Graph spectral analysis applied to
- pages, sites, tags, users, groups, queries,
clicks, opinions, etc. as nodes - assessment and interaction relations as weighted
edges - can compute various notions of authority,
reputation, trust, quality
- Decentralized computation in peer-to-peer network
- with arbitrary, a-priori unknown overlaps of
graph fragments
47JXP Algorithm J.X. Parreira, G. Weikum
WebDB05, VLDB06
Decentralized, asynchronous, peer-to-peer
algorithm based on theory of Markov-chain
aggregation (state lumping)
P.J.
Courtois 1977, C.D. Meyer 1988
- each peer aggregates non-local part of global
graph into world node - peers meet randomly,
- exchange data about their local computations,
and - iterate their local computations
Theorem authority scores from local
computations converge to global scores
supported by Minerva system http//www.mpi-inf.mpg
.de/departments/d5/software/minerva/index.html
48Summary of Darwin Approach
Social tagging and social networks (Web 2.0) are
potentially valuable knowledge sources
Games (human computing) are an interesting way of
enticing knowledge input and collecting
statistics
Spectral analysis is a highly versatile tool for
rating ranking that can be extended and scaled
by decentralized algorithms
- Challenges
- Design a game that intrigues serious scientists
- to semantically annotate their scholarly work
- Develop an analysis method that identifies the
best facts, - resilient to egoistic and malicious behaviors
(incl. coalitions)
49Outline
Introduction Search for Knowledge
?
- Harvesting Knowledge
- Leibniz Approach
- Planck Approach
- Darwin Approach
?
Conclusion
50Summary
- Harvesting knowledge organizing in semantic
DB/graph for - scholarly Web,
- digital libraries,
- enterprise know-how,
- online communities, etc.
- Three roads to knowledge
- Leibniz / Semantic Web ontologies,
encyclopedia, etc. - Planck / Statistical Web large-scale IE from
text, speech, etc. - Darwin / Social Web wisdom of crowds, tagging,
folksonomies
- Not covered here search and ranking
- graph IR (for ER graphs, RDF, cross-linked XML,
etc.) - new ranking models (e.g. statistical LM for
graphs) - efficient and scalable query processing
51Major Challenges
- Generalize YAGO approach (Wikipedia WordNet)
- Methods for comprehensive, highly accurate
- mappings across many knowledge sources
- cross-lingual, cross-temporal
- scalable in size, diversity, number of sources
- Pursue DB support towards efficient IE (and NLP)
- Achieve Web-scale IE throughput that can
- sustain rate of new content production (e.g.
blogs) - with gt 90 accuracy and Wikipedia-like coverage
- Integrate handcrafted knowledge with
NLP/ML-based IE - Incorporate social tagging and human computing
52Potential Synergies amongLeibniz, Planck, and
Darwin
knowledge core
bootstrap
Leibniz Semantic Web
Planck Statistical Web
validate
emerge
statistics feedback
Darwin Social Web
communities
53Thank you !