Labeling and Indexing Schemes and Algorithms for the Semantic Web PowerPoint PPT Presentation

presentation player overlay
1 / 35
About This Presentation
Transcript and Presenter's Notes

Title: Labeling and Indexing Schemes and Algorithms for the Semantic Web


1
Labeling and Indexing Schemes and Algorithms for
the Semantic Web
1 Stuckenschmidt, H. Vdovjak, R. Houben, G.
Broekstra, J. Index Structures and Algorithms
for Querying Distributed RDF Repositories 2
Christophides, V. Plexousakis, D. Scholl, M.
Tourtounis, S, On Labeling Schemes for Semantic
Web
Presentation for CSCI 8350 By Scott Patterson
2
The Problem
  • The Semantic Web is distributed by nature,
    therefore
  • Information may be duplicated at different
    locations
  • The information that is duplicated may be
    expressed in different ways
  • The information desired from a query may be
    stored in fragments

3
The Problem
  • Suppose you would like to search for titles of
    articles written by employees of organizations
    that have projects in the area of RDF
  • Realistically this information is not stored in
    one place. Perhaps there are there RDF stores
    (in this case databases) containing different and
    similar information.

4
The Problem
  • Suppose the first db contains information on
    articles, titles, authors, and their affiliations
  • The second contains information about industrial
    projects, topics and organizations
  • The third contains all the relevant information
    but is a research portal

5
The Problem
  • How can the information we desire be extracted
    and joined in such a way that..
  • We have not lost any information
  • We have not duplicated any information
  • and it is done in a computationally efficient
    manner

6
The Architecture
  • Extend an existing RDF storage and retrieval
    system, Sesame
  • Queries are passed from the query engine to SAIL
    (an RDF API) which abstracts from the repository
  • In a distributed environment, the repositories
    may be implemented in different ways
  • Hence the introduction of another RDF API between
    SAIL and the db

7
The Architecture
  • Now that the heterogeneity of the environment is
    abstracted we are back to the problem
  • How to locate the relevant information, retrieve
    it, and put it together for an answer to our
    query
  • For this another new component is added between
    Sesame and SAIL

8
The Architecture
  • This component is the Mediator SAIL
  • The job of the Mediator is to determine where to
    send which parts of the query and how to optimize
    the query plan

1
9
The Solution
  • The solution to this problem has two parts
  • First, to use join indices
  • Second, find an algorithm that can efficiently
    optimize a query plan that consists of joins

10
Join Indices
  • Create additional database tables that contain
    the result of a join over a specific property
  • Rather than computing a join, a system can simply
    access the index
  • The property here is a path

11
Join Indices
  • Because the information that makes up a path may
    be distributed across different stores, it is
    necessary to use an index structure that contains
    information about sub-paths.
  • A source index is used here to determine where
    instances of a path are stored.
  • This determines where to forward the query

12
Join Indices
  • The join index hierarchy is an adaptation of join
    indices
  • The root of the hierarchy is an index table for
    elements in the path p0, p1,pn-1 of
    length n
  • The next level has two paths of length n-1
  • The last level has n paths of length 1

13
Join Indices
  • The hierarchy contains information about every
    possible sub-path
  • These sub-paths may later be combined to answer
    the query

14
Join Indices
  • From our running example we have
  • P0..3 (author, affiliation, carriesOut, topic)
  • P0..2 (author, affiliation, carriesOut),
    p1..3 (affiliation, carriesOut, topic)
  • p0..1(author, affiliation),
    p1..2(affiliation, carriesOut),
    p2..3(carriesOut, topic)
  • p0(author),
    p1(affiliation),
    p2(carriesOut),
    p3(topic)

1
15
Remember The Problem
  • Suppose the first db contains information on
    articles, titles, authors, and their affiliations
  • The second contains information about industrial
    projects, topics and organizations
  • The third contains all the relevant information
    but is a research portal

16
Join Indices
  • The time complexity of using the join index
    hierarchy is
  • O(s n2) ,where s is the number of sources and
    n is the length of the path, which is polynomial
    time
  • The length of the path is a significant factor

17
Answering Algorithm
  • The algorithm must do several things
  • Determine all possible combinations of sub-paths
  • For each combination determine the source
    containing the results for the sub-paths
  • Join the results into one results for the
    complete path

18
Answering Algorithm
  • The algorithm must guarantee that all possible
    combinations of sub-paths have been investigated
  • To do this, a tree-recursion algorithm is used
  • Splitting a complete path into all possible
    combinations of sub-paths and then joining the
    results is not computationally reasonable

19
Answering Algorithm
  • The solution is to use the tree-recursion
    algorithm along with source information from the
    index hierarchy.

1
20
Answering Algorithm
  • This is an algorithm for a distributed system, so
    communications costs must be taken into account.
  • Since it is over an IP network the communication
    costs will contribute significantly to the over
    all processing costs

21
Answering Algorithm
  • Data is joined by the Mediator, therefore
    minimization of the data that is transferred is
    important.
  • There may be dependencies which allow those joins
    which do not contribute to the result to be
    pruned.
  • Human interaction is necessary.

22
Answering Algorithm
  • Joins need to be ordered in such a way that the
    overall response time is minimized.
  • This is an NP-hard problem.
  • Evaluating all possible combinations of joins is
    impossible
  • Therefore a Good Enough algorithm is used

23
Answering Algorithm
  • The objective of the algorithm now is to avoid a
    bad query plan and not to find the optimal query
    plan.
  • In most cases the optimal plan only improves the
    solution marginally.
  • In order to achieve this goal, experience from
    the database community is used

24
Answering Algorithm
  • A two phase strategy is applied
  • Iterative Improvement (II)
  • Simulated Annealing (SA)
  • The II algorithm
  • Randomly generates several solutions
  • These are used as starting points in the
    traversal
  • The traversal is done by applying a series of
    random moves from a predefined set

25
Answering Algorithm
  • The cost is evaluated for each move
  • The best solution is kept in memory
  • In the SA each sub-optimal solution is explored
    further
  • Like the II, random moves are preformed
  • Lower cost moves are accepted by the system
  • Unlike the II, higher cost moves can also be
    accepted

26
Answering Algorithm
  • Acceptance of a higher cost move depends on the
    temperature of the systems and the cost
    difference
  • This is because initially the system is hot and
    easily accepts moves which yield a higher cost
  • This solution generates a Good Enough query
    plan and guarantees completeness of the result

27
Labeling
  • How can we label the information to increase the
    efficiency of subsumption queries?
  • The data may be stored in a database, but we
    should try to visualize it as a tree or graph
  • Labeling scheme should have a minimal complexity

28
Labeling
2
29
Labeling
  • Bit-Vector
  • Label is a vector of n-bits, n is the number of
    nodes
  • A 1 bit in some position is used to uniquely
    identify nodes
  • A node inherits bits from its ancestors
  • This allows subsumption checking in constant time
  • The construction of the Labels is linear to the
    number of nodes

30
Labeling
2
31
Labeling
  • Prefix
  • A node is labeled with the parent nodes
    identification
  • This allows subsumption checking in constant time
  • NCA can also be determined in constant time
  • Labels can be created in Linear time

32
Labeling
2
33
Labeling
  • Interval
  • A node is labeled with an interval consisting of
    its preorder and postorder number or some
    variation
  • For node u pre(start u), post(end u)
  • An ancestor node of u is before u in preorder and
    after in postorder
  • pre(v) lt pre(u) and post(v) gt post(u)
  • Subsumption constant
    Labels linear to the number of nodes
    -variations may be polynomial

34
Labeling
2
35
Questions
  • ???????
Write a Comment
User Comments (0)
About PowerShow.com