Selforganization and the Semantic Web - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Selforganization and the Semantic Web

Description:

ISWeb Informationssysteme & Semantic Web. Estimations of Data Sizes ... Mayotte island 31540. EU country 28035. UNESCO organization 27739. Austria group 24266 ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 55

Provided by: steffen2

Category:

more less

Transcript and Presenter's Notes

Title: Selforganization and the Semantic Web

1
Self-organization and the Semantic Web

Steffen Staab
New Trends in Semantic Web
December 2, 2004

2
Estimations of Data Sizes

My personal data about 30GByte
SAP 104 tables
Large insurance company 5000 databases
Google 8,000,000,000 URLs
about 90 of web content from underlying
databases
95 of data is not in databases (files, etc.)

3
Data Integration Purpose

ERP 104 tables

WWW 1010 documents

Find Condense Content

eLearning 106 schools, colleges ...

Email Staab 24874

Content Management 106 documents

Laptop file system 17150 data files

4
Data Integration Capabilities
Self-organising systems

Manual data integration technology and
maintenance feasible for up to 102 databases

5
Dimensions of Self-organization

Peer-to-Peer-like systems
Ontology Learning Population
Automatic mapping
Self-adaptive query routing
Peer-to-peer services

Autonomy
Terminology
Terminology mapping
Query routing
Self-organising services

6
Dimensions of Self-organization

Peer-to-Peer-like systems
Ontology Learning Population
Automatic mapping
Self-adaptive query routing
Peer-to-peer services

Autonomy
Terminology
Terminology mapping
Query routing
Self-organising services

7
The OL Layer Cake
Rules
Relations
cure(domDOCTOR,rangeDISEASE)
Concept Hierarchies
is_a(DOCTOR,PERSON)
Concepts
DISEASE
disease,illness
Terms
disease, illness, hospital
8
The ontology population/semantic annotation
problem in 4 cartoons
9
The annotation problem from a scientific point
of view
10
The annotation problem in practice
11
The viscious cycle
12
Current State-of-the-art

Large-scale IE SemTagSeeker_at_WWW03
only disambiguation w.r.t TAP
Standard IE (MUC)
need of handcrafted rules
ML-based IE (e.g.Amilcare_at_OntoMat,MnM)
need of hand-annotated training corpus
does not scale to large numbers of concepts
rule induction takes time
KnowItAll (Etzioni et al. WWW04)
shallow (pattern-matching-based) approach

13
The Self-Annotating Web

There is a huge amount of implicit knowledge in
the Web
Make use of this implicit knowledge together with
statistical information to propose formal
annotations and overcome the viscious cycle
semantics syntax statistics?
Annotation by maximal statistical evidence

PANKOW Pattern-based ANotation by Knowledge On
the Web
14
A small quiz
What is Laksa?
A dish
B city
C temple
D mountain
15
Asking Google!

cities such as Laksa 0 hits
dishes such as Laksa 10 hits
mountains such as Laksa 0 hits
temples such as Laksa 0 hits
Google knows more than all of you together!
Example of using syntactic information
statistics to derive semantic information

16
Patterns

HEARST1 s such as
HEARST2 such s as
HEARST3 s, (especially/including)
HEARST4 (and/or) other s
Examples
dishes such as Laksa
such dishes as Laksa
dishes, especially Laksa
dishes, including Laksa
Laksa and other dishes
Laksa or other dishes

17
Patterns (Contd)

DEFINITE1 the
DEFINITE2 the
APPOSITION, a
COPULA is a
Examples
the Laksa dish
the dish Laksa
Laksa, a dish
Laksa is a dish

18
PANKOW Process
19
Asking Google (more formally)

Instance i?I, concept c ?C, pattern p ?
Hearst1,...,Copula count(i,c,p) returns the
number of Google hits of instantiated pattern
E.g. count(Laksa,dish)count(Laksa,dish,def1)...
Restrict to the best ones beyond threshold

20
Examples
Atlantic city 1520837 Bahamas island 649166 USA
country 582275 Connecticut state 302814 Caribbea
n sea 227279 Mediterranean sea 212284 Canada cou
ntry 176783 Guatemala city 174439 Africa region
131063 Australia country 128607 France country 1
25863 Germany country 124421 Easter island 96585
St Lawrence river 65095 Commonwealth state 4969
2 New Zealand island 40711 Adriatic sea 39726 N
etherlands country 37926
St John church 34021 Belgium country 33847 San J
uan island 31994 Mayotte island 31540 EU country
28035 UNESCO organization 27739 Austria group 2
4266 Greece island 23021 Malawi lake 21081 Isra
el country 19732 Perth street 17880 Luxembourg c
ity 16393 Nigeria state 15650 St Croix river 149
52 Nakuru lake 14840 Kenya country 14382 Benin
city 14126 Cape Town city 13768
21
Evaluation Scenario

Corpus 45 texts from http//www.lonelyplanet.com/
destinations
Ontology tourism ontology from GETESS project
concepts original 1043 pruned 682
Manual Annotation by two subjects
A 436 instance/concept assignments
B 392 instance/concept assignments
Overlap 277 instances (Gold Standard)
A and B used 59 different concepts
Categorial (Kappa) agreement on 277 instances
63.5

22
Results
23
Comparison
24
Dimensions of Self-organization

Peer-to-Peer-like systems
Ontology Learning Population
Automatic mapping
Self-adaptive query routing
Peer-to-peer services

Autonomy
Terminology
Terminology mapping
Query routing
Self-organising services

25
Bibliography Use Case
I am searching forpublications aboutSemantics.
Do you have items about Semantics?
Bibster Network
I know a peersharing metadata about Semantics.
26
Bibster Screenshot
Open Source http//bibster.sourceforge.net/
27
Sample BibTeX Entry

_at_ARTICLEcodd81relational,
author Edgar F. Codd,
title The capabilities of relational
database management systems,
journal IBM Research Report, San Jose,
California,
volume RJ3132,
year 1981

28
Sample Entry
29
BIBSTER Lifecycle

Wrapping / Scraping
RDF Store Sesame
SeRQL
INGA Interest-based Node Grouping
Architecture
Duplicate Detection

Generation of Data _at_ Peer
Storage _at_ Peer
Querying _at_ Peer
Query Routingin Network
Answering to Peer

Expertise-based Peer Selection

31
Expertise-Based Peer Selection

Expertise Abstract semantic description of the
knowledge base of a peer, expressed using a
shared ontology
Advertisements to promote semantic descriptions
of expertise in the network
Peer Selection ranks peers according to
similarity between their expertise and query
subject wrt. shared ontology

32
Expertise-Based Peer Selection
SimilarityFunction
Find articles by Codd aboutDatabase Management

Peer 1
Peer 2
33
Semantic topology

Advertising strategy determines
whom to send advertisements (e.g. random,
semantically close)
which advertisements to accept (e.g. all,
semantically close)
Semantic topology formed by the knowledge about
the expertise of other peers
Idea Cluster peers with similar expertise
Route queries along gradient of increasing
similarity between expertise and query subject

34
Semantic Topologies
Peer
Peer
Peer
QueryResult
DigitalLibraries
DigitalLibraries
DigitalLibraries
DigitalLibraries
DigitalLibraries
DigitalLibraries
DatabaseManagement
Information Searchand Retrieval
Peer
InformationSystems
Peer
Peer
ArtificialIntelligence
Information Storageand Retrieval
Peer
Find articles by Codd aboutDatabase Management

Robotics
35
Simulation of the Scenario

DBLP data set (380440 publications)
Document Classification using ACM topic hierarchy
(based on title), classified subset of 126247
publications
Document Distribution
Topic Distributions one peer for each of the ACM
Topics (1287 peers)
Proceedings Distribution according to
proceedings and journals (2335 peers)
Simulation Steps
Setup network topology
Advertise Knowledge
Query Processing

36
Evaluation Criteria

Output Parameters
Peer Selection (Peer Level)
Recall How many of the relevant peers were
reached
Precision How many of the reached peers were
relevant
Query Answering (Document Level)
Recall How many of the relevant documents where
returned
Number of messages
Input Parameters
Distribution of documents
Peer selection function
Advertising strategy
Maximum number of hops

37
Hypotheses for Simulation

Expertise based selection is better than a naive
broadcast approach based on random selection.
Using a shared ontology with a metric for
semantic similarity improves the system compared
with an approach with exact matches (e.g. keyword
based)
Performance can be improved further, if the
semantic topology reflects the semantic
similarity of the expertise of the peers
The Perfect topology Perfect results, if the
semantic topology coincides with a distribution
of the documents according to the shared ontology

38
Experimental Settings

Setting 1 baseline - naively selects random
peers
Setting 2 expertise based selection using
similarity measure
Setting 3 peers accept advertisements that are
semantically similar to their own expertise
Setting 4 perfect topology where the topology
coincides with the ACM topic hierarchy

39
Recall (Peer Selection)
40
Precision (Peer Selection)
41
Number of Messages
42
Simulation Results
43
Advertisement-based Approach

Expertise-based peer selection improves
performance of peer selection by an order of
magnitude
Ontology-based similarity measure allows further
improvements
Semantic topology that mirrors the domain
ontology yields best results
Test driven in http//bibster.semanticweb.org

44
....many open question

Still an eager approach,
What about real data
What about changes in the data?
Now a lazy approach!
Learning and Recommending Shortcuts in
Semantic Peer-to-Peer Networks INGA

45
Social expert network
I am searching forpublications aboutSemantic
Web.
Bibster Network
Do you have items about Semantics?
Here is an entry of the book Handbook on
Ontologies.
Bootstrapping shortcut
Contentshortcut
Experts.expert
Expert
Recommender shortcut
Experts expert
I know a peersharing metadata about Semantics.
46
Semantic overlay network
I am searching forpublications aboutSemantic
Web.
Query independent shortcut
Contentshortcut
Recommender shortcut
47
Semantic overlay network
I am searching forpublications aboutSemantic
Web.
Contentshortcut
48
Semantic overlay network
I am searching forpublications aboutLogics.
Recommender shortcut
49
Semantic overlay network
I am searching forpublications aboutRobotics.
Query independent shortcut
50
Semantic overlay network
I am new to the network and search for archeology.
Baseline (e.g. JXTA visibility)
51
Build content shortcut index

Send query using most promising available layer
of semantic overlay topology
Evaluate result of query
Update shortcut index

52
Content Provider Shortcut Creation
53
Shortcut Index
54
Build recommender shortcut index

Active
When answers are returned including the query
message path
The one butlast in the path is a recommender peer

Passive
Listen to incoming queries
If query is relevant to ones interests add
querying peer as recommender

55
Recommender Shortcut Creation
56
Shortcut Index - 2
57
Query independent shortcut
58
Limit index size

Retain only a small number of shortcuts in the
index (e.g. 40 in our experiments)
Delete based on least utility

59
while forwarding/answering a query

Active forwarding of Pq.Bo Current message
contains Pq.Bo of querying peer ? compare
against Pi.Bo and use if better
Interest based IndexingIf similarity(query,conte
nti) threshold then add Pq to our list of
recommender peers
Add own Pid to message

60
Query routing

Greedy search preferring query dependent
shortcuts
Query independent and baseline shortcuts for
fallback

Fireworks in regions of high similarity between
content and query
61
Random contribution to query routing

Greedy search preferring query dependent
shortcuts
Query independent and baseline shortcuts for
fallback

Fireworks in regions of high similarity between
content and query
62
Experimental hypotheses

INGA performs at least equal in terms of recall
than the naive algorithm, KUNWADEE
(Sripanidkulchai et al.) and REMINDIN
INGA performs better in terms of messages per
query the naive algorithm, KUNWADEE and
REMINDIN.
The gain in efficiency can be attributed to equal
account the different layers
A dynamic combination of query dependent and
independent search strategies reduces the number
of consumed per query while it retains a high
recall.

63
Comparison of Query Routing Algorithms (recall)
64
Comparison of Routing Algorithms ( messages)
65
Contribution of different layers (peer f-measure)
66
Contribution of different layers to message
reduction (messages)
67
Lessons learned

Focus on interest based shortcuts.
Interest based Listening
High Degree Shortcuts
Scrutinize the result message of ones issued
queries to create content provider and
recommender shortcuts
Prefer a query dependent search strategy
Greedy
top-k
Use a highest out degree strategy for baseline
selection

68
Relevant Publications