Interconnection Semantics for Keyword Search in XML - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Interconnection Semantics for Keyword Search in XML

Description:

Publishing ... A default schema can be obtained from the document ... This subtree of the document. shows Pmin(S)-interconnection! ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 77
Provided by: csHu
Category:

less

Transcript and Presenter's Notes

Title: Interconnection Semantics for Keyword Search in XML


1
Interconnection Semantics for Keyword Search in
XML
  • Sara Cohen
  • Technion Israel

Yaron Kanza University of Toronto
Benny Kimelfeld Hebrew University
Yehoshua Sagiv Hebrew University
CIKM 2005 Bremen, Germany
2
Goal
  • We present a general framework to efficiently
    determine when different parts of an XML document
    are semantically related

Who cares?
  • A step towards bridging the gap between keyword
    search and database querying
  • For example
  • Simplification and enhanced flexibility of query
    languages for XML (e.g., schema free)
  • Taking into account structural properties for
    ranking/filtering results of keyword search

3
Schema-Free Queries
Can be formulated in the tools proposed by Cohen
et al., ICDT03 and Li et al., VLDB04
SELECT article/title FROM interconnected
//article, //author, //publisher
WHERE author/nameA. Cohen and
publisher/nameAm. Publishing
Find the titles of articles that were written by
A. Cohen and were published by Am. Publishing
4
Schema-Free Queries
SELECT article/title FROM interconnected
//article, //author, //publisher
WHERE author/nameA. Hunt and
publisher/nameAm. Publishing
5
XML Structure 1
author is nested inside article
6
XML Structure 2
article is nested inside author
7
XML Structure 3
article and author are connected through ID
references
8
Keyword Search over XML
  • Typically, ranking of XML fragments is determined
    by the frequencies of the keywords in the
    fragments
  • We propose to take into account the answer to the
    following test
  • Do the keywords appear in semantically related
    parts of the fragment?

9
Keyword-Search Example
Cohen , IR
10
A Result
Cohen , IR
Cohen and IR are in the same department
11
Another Result
Cohen , IR
Cohen and IR are in the same department and
Cohen wrote an article about IR
This fragment should have a higher rank
Identifying meaningful relationships can improve
ranking in keyword search
12
Existing Approaches
  • Each of the existing approaches proposes a
    specific method for deriving meaningful
    relationships
  • ID references are ignored
  • That is, documents are always trees
  • The schema is ignored
  • Therefore, missing information is not taken into
    account

13
Our Contribution
  • We propose a framework that enables a large
    variety of approaches
  • A single method is not likely to be appropriate
    in all cases
  • Both ID references and the schema can be taken
    into account
  • We propose specific semantics that are
    automatically derived from schemas
  • We present efficient algorithms for applying
    interconnection semantics

14
Contents
  • Introduction
  • Formal Framework
  • Derived Semantics
  • Computing Interconnectivity
  • Undirected and Universal Semantics
  • Conclusion and Future Work

15
Contents
  • Introduction
  • Formal Framework
  • Derived Semantics
  • Computing Interconnectivity
  • Undirected and Universal Semantics
  • Conclusion and Future Work

16
An XML Document
A rooted directed graph
A node is an object that has a label and
(possibly) a value
Edges represent element nesting and ID references
17
A Schema
A rooted directed graph
A node is a label
18
A Schema Defines the Document Structure
The root of the schema is the label of the root
of the document
19
A Schema Defines the Document Structure
An edge in the document is allowed only if an
edge between the corresponding labels appears in
the schema
20
Relationships and Trees
  • In our framework, relationships among objects are
    represented by trees
  • Trees represent atomic relationships, i.e.,
    relationships that are indivisible
  • Our trees are directed
  • That is, represent hierarchies
  • Relaxed later

21
Patterns
  • In the formal framework, patterns are the basic
    building blocks

Formally, a pattern is a pair (L,C)
C is a tree of labels
L is a set of labels
(
)
,
title,publication,author
  • C contains L
  • C has no redundant edges

22
Interconnection by Patterns
  • A pattern defines when a set O of objects with a
    given set L of labels is interconnected
  • O is interconnected if the objects are in a tree
    that is isomorphic to the pattern

(
)
,
title,publication,author
23
Interconnection by Patterns
24
Interconnection by Patterns
Interconnected
25
Interconnection Semantics
  • An interconnection semantics P is a set of
    patterns
  • A set of objects is interconnected by P if it is
    interconnected by a pattern of P

(title,name , )
(title,name , )
26
A Framework of Semantics
  • Various approaches for deriving semantic
    relationships can be represented by means of
    interconnection semantics
  • Interconnection semantics can be given explicitly
    (e.g., by manually defining the patters)
  • Alternatively, interconnection semantics can be
    given implicitly (i.e., be derived) by defining
    conditions that characterize patterns
  • Derived semantics are discussed next

27
Contents
  • Introduction
  • Formal Framework
  • Derived Semantics
  • Computing Interconnectivity
  • Undirected and Universal Semantics
  • Conclusion and Future Work

28
Derived Semantics
  • In principle, an interconnection semantics can be
    given explicitly by authors (or users) of XML
    documents
  • However, generating a semantics may be a
    cumbersome task
  • Moreover, the semantics may be very large

We need semantics that are automatically derived
29
Derived Semantics (cont.)
  • A derived semantics is essentially a rule that
    states which subtrees of the schema appear in
    patterns
  • A default schema can be obtained from the
    document
  • One semantics that always captures our intuition
    is not likely to exist, therefore it is of
    interest to explore several approaches for
    deriving semantics
  • We explore a wide spectrum of specific derived
    semantics
  • Our derived semantics demonstrate the strengths
    and weaknesses of different approaches for
    obtaining such semantics

30
A Simple Example
title,publication
31
The Interconnection Semantics Pall
  • The semantics Pall generalizes the approach of
    Cohen et al., ICDT03 to graph (rather than
    tree) documents
  • Given a schema S, the semantics Pall(S) is the
    set of all patterns (L,C), where C is a subtree
    of S
  • Note that a set of objects is Pall(S)-interconnect
    ed if it is contained in a uniquely labeled
    subtree of the document

32
The Subtrees of title,publication
33
The Subtrees of title,author
34
The Subtrees of title,author
?
Is this what we mean?
35
The Interconnection Semantics Pmin
  • The semantics Pall(S) may contain patterns that
    imply rather weak relationships
  • We follow the convention that small trees imply
    strong relationships
  • The semantics Pmin(S) contains, for each set L of
    labels, only the patterns (L,C) of Pall(S), such
    that C is of minimal size

36
The Subtrees of title,author
This subtree of the schema is minimal w.r.t.
title,author
37
Pmin(S)-Interconnected title and author
This subtree of the document shows
Pmin(S)-interconnection!
38
The Subtrees of title,author
This subtree of the schema is not minimal w.r.t.
title,author
39
Pmin(S)-Interconnected title and author
This subtree does not show Pmin(S)-interconnectio
n!
40
What about these title and author?
41
What about these title and author?
This subtree does not show Pmin(S)-interconnectio
n
42
What about these title and author?
This subtree does not show Pmin(S)-interconnectio
n
because this subtree is smaller
43
The Interconnection Semantics Puca
  • We consider a different notion of minimality that
    is based on structure
  • Consider an acyclic schema S and a pattern
    p(L,C) in Pall(S)
  • Intuitively, p is structurally minimal if
    internal nodes in C cannot be roots of trees
    containing L
  • Formally, p is structurally minimal if only the
    root of C is a common ancestor of L in S
  • The semantics Puca(S) is the set of all
    structurally minimal patterns in Pall(S)

44
Some Examples
45
One Structurally Minimal Pattern
(title,author , )
In the schema, article is the only
common ancestor of title,author
46
Another Structurally Minimal Pattern
(title,author , )
In the schema, inproc. is the only
common ancestor of title,author
47
A Third Structurally Minimal Pattern
(title,author , )
In the schema, inproc. is the only
common ancestor of title,author
48
Not a Structurally Minimal Pattern
(title,author , )
In the schema, department, publications and
incproc. are all common ancestors of
title,author
49
Back to the Document
50
Puca(S)-Interconnected title and author
This subtree shows Puca(S)-interconnection!
51
Puca(S)-Interconnected title and author
This subtree shows Puca(S)-interconnection!
52
Not Puca(S)-Interconnected
This subtree does not show Puca(S)-interconnectio
n!
53
Contents
  • Introduction
  • Formal Framework
  • Derived Semantics
  • Computing Interconnectivity
  • Undirected and Universal Semantics
  • Conclusion and Future Work

54
Interconnectivity Problems
  • Given an interconnection semantics, we are
    interested in solving two problems
  • Determine whether a given set of objects is
    interconnected
  • Arises in keyword search
  • Generate all interconnected sets of objects
    having a given set of labels
  • Arises in evaluation of queries that incorporate
    interconnectivity

55
Computing Explicit Semantics
  • When the interconnection semantics is given
    explicitly, there are highly efficient algorithms
    for solving the interconnectivity problems
  • I.e., polynomial under query-and-data complexity
  • In query evaluation, the first k results can be
    enumerated quickly
  • I.e., evaluation in incremental polynomial time
  • We can solve the interconnectivity problems by
    translating patterns into queries of standard
    languages, e.g., XQuery
  • Thus, existing engines can be used

56
Computing Derived Semantics
  • There are two approaches for solving
    interconnectivity problems when the semantic is
    derived (implicit)
  • Pattern Extraction
  • Direct Computation (w/o extracting patterns)

57
1st Approach Pattern-Extraction
  • This approach is basically a reduction to the
    case where interconnection semantics are given
    explicitly
  • In particular, given an implicit semantics P and
    a set L of labels, we extract from the schema all
    patterns (L,C) of P
  • Thus, we create an explicit representation of the
    implicit semantics and use the corresponding
    explicit-semantics algorithms

58
Puca(S) Patterns for title,author
59
Extracting the Patters
  • Given a set L of labels, we can extract the
    relevant patterns of our derived semantics very
    efficiently, when measuring the complexity in
    terms of the combined size of input and the
    output
  • I.e., with polynomial delay
  • For Pall(S) and Puca(S), we can use the
    keyword-search algorithms given in Kimelfeld and
    Sagiv, DBPL05
  • An algorithm for Pmin(S) is given in the
    proceedings

60
2nd Approach Direct Computation
  • There are highly efficient algorithms for
    evaluating the extracted patterns
  • However, there may be a large (i.e., exponential)
    number of relevant patterns
  • When the number of patterns is large, we are
    interested in solutions that compute
    interconnectivity directly, i.e., without
    extracting the patterns first

61
Algorithms for Direct Computation
  • There are efficient algorithms for direct
    computation of our derived semantics
  • An exception is Pall(S) in cyclic schemas, which
    is intractable
  • We use data complexity as a yardstick of
    efficiency, which is conventional in DB theory
  • In some common cases, interconnectivity problems
    are more tractable
  • I.e., there are efficient algorithms under
    query-and-data complexity (like tree join vs.
    general join)
  • Usually, when either the schema is a tree or the
    document has no ID references
  • Details are in the proceedings

62
Contents
  • Introduction
  • Formal Framework
  • Derived Semantics
  • Computing Interconnectivity
  • Undirected and Universal Semantics
  • Conclusion and Future Work

63
Undirected Relationships
An article was written by authors from two
different departments
64
Undirected Relationships
65
Undirected Relationships
66
Undirected Patterns and Semantics
Undirected Patterns
  • (L,C)
  • L is a set of labels
  • C is an undirected tree of labels

An undirected interconnection semantics consists
of undirected patterns
67
Universal Interconnectivity
  • Until now, an interconnection semantics was
    applied in an existential manner
  • That is, a set of nodes is interconnected if
    there is an evidence for interconnectivity (e.g.,
    a uniquely labeled subtree that contains the
    given objects)
  • When applying interconnection semantics in a
    universal manner, we consider all subtrees of the
    document (contexts) that contain the given
    objects
  • A set of objects is universally interconnected if
    every subtree contains an evidence for
    interconnectivity

68
An Example
69
An Example
This subtree shows interconnectivity
70
An Example
The two objects are NOT universally
interconnected!
This subtree does not show interconnectivity
71
Derived Semantics
  • We considered two undirected derived semantics
  • Pu-all(S) all undirected patterns of a schema S
  • Pu-min(S) all minimal patterns in Pu-all(S)
  • We considered the universal versions of the
    semantics Pall(S) and Pu-all(S)
  • Usually, these derived semantics can be computed
    efficiently
  • ?Pall(S) is harder than Pall(S) and ?Pu-all(S) is
    easier
  • However, completely different proof techniques
    are required
  • More details, including algorithms and complexity
    results, are in the proceedings

72
Contents
  • Introduction
  • Formal Framework
  • Derived Semantics
  • Computing Interconnectivity
  • Undirected and Universal Semantics
  • Conclusion and Future Work

73
Interconnection Semantics
  • A framework for determining meaningful
    relationships among XML objects
  • Interconnection semantics can be either
    explicitly constructed or automatically
    (implicitly) derived
  • Users can tweak implicit semantics by adding
    specific patterns
  • The framework enables many different types of
    semantics, with different strengths

74
Containments among our Derived Semantics
75
Complexity
  • Explicit semantics can be computed very
    efficiently
  • Under reasonable conditions, our derived
    semantics can also be efficiently computed
  • Different semantics require totally different
    techniques
  • Two computation approaches
  • Pattern extraction
  • Direct computation

76
Future Work
  • Experimentation
  • Which semantics works better in practical
    scenarios?
  • As a result of the experimentation, discover new
    (better) semantics
  • Information Retrieval
  • Combine our approach with IR techniques for
    ranking XML fragments
  • Richer Data Model
  • That is, richer schema and document models
Write a Comment
User Comments (0)
About PowerShow.com