Title: Selforganization and the Semantic Web
1Self-organization and the Semantic Web
- Steffen Staab
- New Trends in Semantic Web
- December 2, 2004
2Estimations of Data Sizes
- My personal data about 30GByte
- SAP 104 tables
- Large insurance company 5000 databases
- Google 8,000,000,000 URLs
- about 90 of web content from underlying
databases
- 95 of data is not in databases (files, etc.)
3Data Integration Purpose
Find Condense Content
- eLearning 106 schools, colleges ...
- Content Management 106 documents
- Laptop file system 17150 data files
4Data Integration Capabilities
Self-organising systems
- Manual data integration technology and
maintenance feasible for up to 102 databases
5Dimensions of Self-organization
- Peer-to-Peer-like systems
- Ontology Learning Population
- Automatic mapping
- Self-adaptive query routing
- Peer-to-peer services
- Autonomy
- Terminology
-
- Terminology mapping
- Query routing
- Self-organising services
6Dimensions of Self-organization
- Peer-to-Peer-like systems
- Ontology Learning Population
- Automatic mapping
- Self-adaptive query routing
- Peer-to-peer services
- Autonomy
- Terminology
-
- Terminology mapping
- Query routing
- Self-organising services
7 The OL Layer Cake
Rules
Relations
cure(domDOCTOR,rangeDISEASE)
Concept Hierarchies
is_a(DOCTOR,PERSON)
Concepts
DISEASE
disease,illness
Terms
disease, illness, hospital
8The ontology population/semantic annotation
problem in 4 cartoons
9The annotation problem from a scientific point
of view
10The annotation problem in practice
11The viscious cycle
12Current State-of-the-art
- Large-scale IE SemTagSeeker_at_WWW03
- only disambiguation w.r.t TAP
- Standard IE (MUC)
- need of handcrafted rules
- ML-based IE (e.g.Amilcare_at_OntoMat,MnM)
- need of hand-annotated training corpus
- does not scale to large numbers of concepts
- rule induction takes time
- KnowItAll (Etzioni et al. WWW04)
- shallow (pattern-matching-based) approach
13The Self-Annotating Web
- There is a huge amount of implicit knowledge in
the Web
- Make use of this implicit knowledge together with
statistical information to propose formal
annotations and overcome the viscious cycle
- semantics syntax statistics?
- Annotation by maximal statistical evidence
PANKOW Pattern-based ANotation by Knowledge On
the Web
14A small quiz
What is Laksa?
A dish
B city
C temple
D mountain
15Asking Google!
- cities such as Laksa 0 hits
- dishes such as Laksa 10 hits
- mountains such as Laksa 0 hits
- temples such as Laksa 0 hits
- Google knows more than all of you together!
- Example of using syntactic information
statistics to derive semantic information
16Patterns
- HEARST1 s such as
- HEARST2 such s as
- HEARST3 s, (especially/including)
- HEARST4 (and/or) other s
- Examples
- dishes such as Laksa
- such dishes as Laksa
- dishes, especially Laksa
- dishes, including Laksa
- Laksa and other dishes
- Laksa or other dishes
17Patterns (Contd)
- DEFINITE1 the
- DEFINITE2 the
- APPOSITION, a
- COPULA is a
- Examples
- the Laksa dish
- the dish Laksa
- Laksa, a dish
- Laksa is a dish
18PANKOW Process
19Asking Google (more formally)
- Instance i?I, concept c ?C, pattern p ?
Hearst1,...,Copula count(i,c,p) returns the
number of Google hits of instantiated pattern
- E.g. count(Laksa,dish)count(Laksa,dish,def1)...
- Restrict to the best ones beyond threshold
20 Examples
Atlantic city 1520837 Bahamas island 649166 USA
country 582275 Connecticut state 302814 Caribbea
n sea 227279 Mediterranean sea 212284 Canada cou
ntry 176783 Guatemala city 174439 Africa region
131063 Australia country 128607 France country 1
25863 Germany country 124421 Easter island 96585
St Lawrence river 65095 Commonwealth state 4969
2 New Zealand island 40711 Adriatic sea 39726 N
etherlands country 37926
St John church 34021 Belgium country 33847 San J
uan island 31994 Mayotte island 31540 EU country
28035 UNESCO organization 27739 Austria group 2
4266 Greece island 23021 Malawi lake 21081 Isra
el country 19732 Perth street 17880 Luxembourg c
ity 16393 Nigeria state 15650 St Croix river 149
52 Nakuru lake 14840 Kenya country 14382 Benin
city 14126 Cape Town city 13768
21Evaluation Scenario
- Corpus 45 texts from http//www.lonelyplanet.com/
destinations
- Ontology tourism ontology from GETESS project
- concepts original 1043 pruned 682
- Manual Annotation by two subjects
- A 436 instance/concept assignments
- B 392 instance/concept assignments
- Overlap 277 instances (Gold Standard)
- A and B used 59 different concepts
- Categorial (Kappa) agreement on 277 instances
63.5
22Results
23Comparison
24Dimensions of Self-organization
- Peer-to-Peer-like systems
- Ontology Learning Population
- Automatic mapping
- Self-adaptive query routing
- Peer-to-peer services
- Autonomy
- Terminology
-
- Terminology mapping
- Query routing
- Self-organising services
25Bibliography Use Case
I am searching forpublications aboutSemantics.
Do you have items about Semantics?
Bibster Network
I know a peersharing metadata about Semantics.
26Bibster Screenshot
Open Source http//bibster.sourceforge.net/
27Sample BibTeX Entry
- _at_ARTICLEcodd81relational,
- author Edgar F. Codd,
- title The capabilities of relational
database management systems,
- journal IBM Research Report, San Jose,
California,
- volume RJ3132,
- year 1981
28Sample Entry
29BIBSTER Lifecycle
- Wrapping / Scraping
- RDF Store Sesame
- SeRQL
- INGA Interest-based Node Grouping
Architecture
- Duplicate Detection
- Generation of Data _at_ Peer
- Storage _at_ Peer
- Querying _at_ Peer
- Query Routingin Network
- Answering to Peer
30- Expertise-based Peer Selection
31Expertise-Based Peer Selection
- Expertise Abstract semantic description of the
knowledge base of a peer, expressed using a
shared ontology
- Advertisements to promote semantic descriptions
of expertise in the network
- Peer Selection ranks peers according to
similarity between their expertise and query
subject wrt. shared ontology
32Expertise-Based Peer Selection
SimilarityFunction
Find articles by Codd aboutDatabase Management
Peer 1
Peer 2
33Semantic topology
- Advertising strategy determines
- whom to send advertisements (e.g. random,
semantically close)
- which advertisements to accept (e.g. all,
semantically close)
- Semantic topology formed by the knowledge about
the expertise of other peers
- Idea Cluster peers with similar expertise
- Route queries along gradient of increasing
similarity between expertise and query subject
34Semantic Topologies
Peer
Peer
Peer
QueryResult
DigitalLibraries
DigitalLibraries
DigitalLibraries
DigitalLibraries
DigitalLibraries
DigitalLibraries
DatabaseManagement
Information Searchand Retrieval
Peer
InformationSystems
Peer
Peer
ArtificialIntelligence
Information Storageand Retrieval
Peer
Find articles by Codd aboutDatabase Management
Robotics
35Simulation of the Scenario
- DBLP data set (380440 publications)
- Document Classification using ACM topic hierarchy
(based on title), classified subset of 126247
publications
- Document Distribution
- Topic Distributions one peer for each of the ACM
Topics (1287 peers)
- Proceedings Distribution according to
proceedings and journals (2335 peers)
- Simulation Steps
- Setup network topology
- Advertise Knowledge
- Query Processing
36Evaluation Criteria
- Output Parameters
- Peer Selection (Peer Level)
- Recall How many of the relevant peers were
reached
- Precision How many of the reached peers were
relevant
- Query Answering (Document Level)
- Recall How many of the relevant documents where
returned
- Number of messages
- Input Parameters
- Distribution of documents
- Peer selection function
- Advertising strategy
- Maximum number of hops
37Hypotheses for Simulation
- Expertise based selection is better than a naive
broadcast approach based on random selection.
- Using a shared ontology with a metric for
semantic similarity improves the system compared
with an approach with exact matches (e.g. keyword
based) - Performance can be improved further, if the
semantic topology reflects the semantic
similarity of the expertise of the peers
- The Perfect topology Perfect results, if the
semantic topology coincides with a distribution
of the documents according to the shared ontology
38Experimental Settings
- Setting 1 baseline - naively selects random
peers
- Setting 2 expertise based selection using
similarity measure
- Setting 3 peers accept advertisements that are
semantically similar to their own expertise
- Setting 4 perfect topology where the topology
coincides with the ACM topic hierarchy
39Recall (Peer Selection)
40Precision (Peer Selection)
41Number of Messages
42Simulation Results
43Advertisement-based Approach
- Expertise-based peer selection improves
performance of peer selection by an order of
magnitude
- Ontology-based similarity measure allows further
improvements
- Semantic topology that mirrors the domain
ontology yields best results
- Test driven in http//bibster.semanticweb.org
44....many open question
- Still an eager approach,
- What about real data
- What about changes in the data?
- Now a lazy approach!
- Learning and Recommending Shortcuts in
Semantic Peer-to-Peer Networks INGA
45Social expert network
I am searching forpublications aboutSemantic
Web.
Bibster Network
Do you have items about Semantics?
Here is an entry of the book Handbook on
Ontologies.
Bootstrapping shortcut
Contentshortcut
Experts.expert
Expert
Recommender shortcut
Experts expert
I know a peersharing metadata about Semantics.
46Semantic overlay network
I am searching forpublications aboutSemantic
Web.
Query independent shortcut
Contentshortcut
Recommender shortcut
47Semantic overlay network
I am searching forpublications aboutSemantic
Web.
Contentshortcut
48Semantic overlay network
I am searching forpublications aboutLogics.
Recommender shortcut
49Semantic overlay network
I am searching forpublications aboutRobotics.
Query independent shortcut
50Semantic overlay network
I am new to the network and search for archeology.
Baseline (e.g. JXTA visibility)
51Build content shortcut index
- Send query using most promising available layer
of semantic overlay topology
- Evaluate result of query
- Update shortcut index
52Content Provider Shortcut Creation
53Shortcut Index
54Build recommender shortcut index
- Active
- When answers are returned including the query
message path
- The one butlast in the path is a recommender peer
- Passive
- Listen to incoming queries
- If query is relevant to ones interests add
querying peer as recommender
55Recommender Shortcut Creation
56Shortcut Index - 2
57Query independent shortcut
58Limit index size
- Retain only a small number of shortcuts in the
index (e.g. 40 in our experiments)
- Delete based on least utility
59while forwarding/answering a query
- Active forwarding of Pq.Bo Current message
contains Pq.Bo of querying peer ? compare
against Pi.Bo and use if better
- Interest based IndexingIf similarity(query,conte
nti) threshold then add Pq to our list of
recommender peers
- Add own Pid to message
60Query routing
- Greedy search preferring query dependent
shortcuts
- Query independent and baseline shortcuts for
fallback
Fireworks in regions of high similarity between
content and query
61Random contribution to query routing
- Greedy search preferring query dependent
shortcuts
- Query independent and baseline shortcuts for
fallback
Fireworks in regions of high similarity between
content and query
62Experimental hypotheses
- INGA performs at least equal in terms of recall
than the naive algorithm, KUNWADEE
(Sripanidkulchai et al.) and REMINDIN
- INGA performs better in terms of messages per
query the naive algorithm, KUNWADEE and
REMINDIN.
- The gain in efficiency can be attributed to equal
account the different layers
- A dynamic combination of query dependent and
independent search strategies reduces the number
of consumed per query while it retains a high
recall.
63Comparison of Query Routing Algorithms (recall)
64Comparison of Routing Algorithms ( messages)
65Contribution of different layers (peer f-measure)
66Contribution of different layers to message
reduction (messages)
67Lessons learned
- Focus on interest based shortcuts.
- Interest based Listening
- High Degree Shortcuts
- Scrutinize the result message of ones issued
queries to create content provider and
recommender shortcuts
- Prefer a query dependent search strategy
- Greedy
- top-k
- Use a highest out degree strategy for baseline
selection
68Relevant Publications
- Peer-to-Peer-like systems
- Ontology Learning
- Automatic mapping
- Self-adaptive query routing
- Peer-to-peer services
- ISWC-041
-
- WWW-041, ECAI-04,
- SIGKDD Expl., WWW-051 submit
-
- ISWC-042, WWW-052 submit
- WWW-042, WWW-053 submit
- EU IST Integrated Project Adaptive Services
Grid
Thank You!
69Thank You!