Labeling and Indexing Schemes and Algorithms for the Semantic Web presentation

About This Presentation

Transcript and Presenter's Notes

Title: Labeling and Indexing Schemes and Algorithms for the Semantic Web

1
Labeling and Indexing Schemes and Algorithms for
the Semantic Web
1 Stuckenschmidt, H. Vdovjak, R. Houben, G.
Broekstra, J. Index Structures and Algorithms
for Querying Distributed RDF Repositories 2
Christophides, V. Plexousakis, D. Scholl, M.
Tourtounis, S, On Labeling Schemes for Semantic
Web
Presentation for CSCI 8350 By Scott Patterson
2
The Problem

The Semantic Web is distributed by nature,
therefore
Information may be duplicated at different
locations
The information that is duplicated may be
expressed in different ways
The information desired from a query may be
stored in fragments

3
The Problem

Suppose you would like to search for titles of
articles written by employees of organizations
that have projects in the area of RDF
Realistically this information is not stored in
one place. Perhaps there are there RDF stores
(in this case databases) containing different and
similar information.

4
The Problem

Suppose the first db contains information on
articles, titles, authors, and their affiliations
The second contains information about industrial
projects, topics and organizations
The third contains all the relevant information
but is a research portal

5
The Problem

How can the information we desire be extracted
and joined in such a way that..
We have not lost any information
We have not duplicated any information
and it is done in a computationally efficient
manner

6
The Architecture

Extend an existing RDF storage and retrieval
system, Sesame
Queries are passed from the query engine to SAIL
(an RDF API) which abstracts from the repository
In a distributed environment, the repositories
may be implemented in different ways
Hence the introduction of another RDF API between
SAIL and the db

7
The Architecture

Now that the heterogeneity of the environment is
abstracted we are back to the problem
How to locate the relevant information, retrieve
it, and put it together for an answer to our
query
For this another new component is added between
Sesame and SAIL

8
The Architecture

This component is the Mediator SAIL
The job of the Mediator is to determine where to
send which parts of the query and how to optimize
the query plan

1
9
The Solution

The solution to this problem has two parts
First, to use join indices
Second, find an algorithm that can efficiently
optimize a query plan that consists of joins

10
Join Indices

Create additional database tables that contain
the result of a join over a specific property
Rather than computing a join, a system can simply
access the index
The property here is a path

11
Join Indices

Because the information that makes up a path may
be distributed across different stores, it is
necessary to use an index structure that contains
information about sub-paths.
A source index is used here to determine where
instances of a path are stored.
This determines where to forward the query

12
Join Indices

The join index hierarchy is an adaptation of join
indices
The root of the hierarchy is an index table for
elements in the path p0, p1,pn-1 of
length n
The next level has two paths of length n-1
The last level has n paths of length 1

13
Join Indices

The hierarchy contains information about every
possible sub-path
These sub-paths may later be combined to answer
the query

14
Join Indices

From our running example we have
P0..3 (author, affiliation, carriesOut, topic)
P0..2 (author, affiliation, carriesOut),
p1..3 (affiliation, carriesOut, topic)
p0..1(author, affiliation),
p1..2(affiliation, carriesOut),
p2..3(carriesOut, topic)
p0(author),
p1(affiliation),
p2(carriesOut),
p3(topic)

1
15
Remember The Problem

Suppose the first db contains information on
articles, titles, authors, and their affiliations
The second contains information about industrial
projects, topics and organizations
The third contains all the relevant information
but is a research portal

16
Join Indices

The time complexity of using the join index
hierarchy is
O(s n2) ,where s is the number of sources and
n is the length of the path, which is polynomial
time
The length of the path is a significant factor

17
Answering Algorithm

The algorithm must do several things
Determine all possible combinations of sub-paths
For each combination determine the source
containing the results for the sub-paths
Join the results into one results for the
complete path

18
Answering Algorithm

The algorithm must guarantee that all possible
combinations of sub-paths have been investigated
To do this, a tree-recursion algorithm is used
Splitting a complete path into all possible
combinations of sub-paths and then joining the
results is not computationally reasonable

19
Answering Algorithm

The solution is to use the tree-recursion
algorithm along with source information from the
index hierarchy.

1
20
Answering Algorithm

This is an algorithm for a distributed system, so
communications costs must be taken into account.
Since it is over an IP network the communication
costs will contribute significantly to the over
all processing costs

21
Answering Algorithm

Data is joined by the Mediator, therefore
minimization of the data that is transferred is
important.
There may be dependencies which allow those joins
which do not contribute to the result to be
pruned.
Human interaction is necessary.

22
Answering Algorithm

Joins need to be ordered in such a way that the
overall response time is minimized.
This is an NP-hard problem.
Evaluating all possible combinations of joins is
impossible
Therefore a Good Enough algorithm is used

23
Answering Algorithm

The objective of the algorithm now is to avoid a
bad query plan and not to find the optimal query
plan.
In most cases the optimal plan only improves the
solution marginally.
In order to achieve this goal, experience from
the database community is used

24
Answering Algorithm

A two phase strategy is applied
Iterative Improvement (II)
Simulated Annealing (SA)
The II algorithm
Randomly generates several solutions
These are used as starting points in the
traversal
The traversal is done by applying a series of
random moves from a predefined set

25
Answering Algorithm

The cost is evaluated for each move
The best solution is kept in memory
In the SA each sub-optimal solution is explored
further
Like the II, random moves are preformed
Lower cost moves are accepted by the system
Unlike the II, higher cost moves can also be
accepted

26
Answering Algorithm

Acceptance of a higher cost move depends on the
temperature of the systems and the cost
difference
This is because initially the system is hot and
easily accepts moves which yield a higher cost
This solution generates a Good Enough query
plan and guarantees completeness of the result

27
Labeling

How can we label the information to increase the
efficiency of subsumption queries?
The data may be stored in a database, but we
should try to visualize it as a tree or graph
Labeling scheme should have a minimal complexity

28
Labeling
2
29
Labeling

Bit-Vector
Label is a vector of n-bits, n is the number of
nodes
A 1 bit in some position is used to uniquely
identify nodes
A node inherits bits from its ancestors
This allows subsumption checking in constant time
The construction of the Labels is linear to the
number of nodes

30
Labeling
2
31
Labeling

Prefix
A node is labeled with the parent nodes
identification
This allows subsumption checking in constant time
NCA can also be determined in constant time
Labels can be created in Linear time

32
Labeling
2
33
Labeling

Interval
A node is labeled with an interval consisting of
its preorder and postorder number or some
variation
For node u pre(start u), post(end u)
An ancestor node of u is before u in preorder and
after in postorder
pre(v) lt pre(u) and post(v) gt post(u)
Subsumption constant
Labels linear to the number of nodes
-variations may be polynomial

Labeling and Indexing Schemes and Algorithms for the Semantic Web PowerPoint PPT Presentation