Title: Interconnection Semantics for Keyword Search in XML
1Interconnection Semantics for Keyword Search in
XML
- Sara Cohen
- Technion Israel
Yaron Kanza University of Toronto
Benny Kimelfeld Hebrew University
Yehoshua Sagiv Hebrew University
CIKM 2005 Bremen, Germany
2Goal
- We present a general framework to efficiently
determine when different parts of an XML document
are semantically related
Who cares?
- A step towards bridging the gap between keyword
search and database querying - For example
- Simplification and enhanced flexibility of query
languages for XML (e.g., schema free) - Taking into account structural properties for
ranking/filtering results of keyword search
3Schema-Free Queries
Can be formulated in the tools proposed by Cohen
et al., ICDT03 and Li et al., VLDB04
SELECT article/title FROM interconnected
//article, //author, //publisher
WHERE author/nameA. Cohen and
publisher/nameAm. Publishing
Find the titles of articles that were written by
A. Cohen and were published by Am. Publishing
4Schema-Free Queries
SELECT article/title FROM interconnected
//article, //author, //publisher
WHERE author/nameA. Hunt and
publisher/nameAm. Publishing
5XML Structure 1
author is nested inside article
6XML Structure 2
article is nested inside author
7XML Structure 3
article and author are connected through ID
references
8Keyword Search over XML
- Typically, ranking of XML fragments is determined
by the frequencies of the keywords in the
fragments - We propose to take into account the answer to the
following test - Do the keywords appear in semantically related
parts of the fragment?
9Keyword-Search Example
Cohen , IR
10A Result
Cohen , IR
Cohen and IR are in the same department
11Another Result
Cohen , IR
Cohen and IR are in the same department and
Cohen wrote an article about IR
This fragment should have a higher rank
Identifying meaningful relationships can improve
ranking in keyword search
12Existing Approaches
- Each of the existing approaches proposes a
specific method for deriving meaningful
relationships - ID references are ignored
- That is, documents are always trees
- The schema is ignored
- Therefore, missing information is not taken into
account
13Our Contribution
- We propose a framework that enables a large
variety of approaches - A single method is not likely to be appropriate
in all cases - Both ID references and the schema can be taken
into account - We propose specific semantics that are
automatically derived from schemas - We present efficient algorithms for applying
interconnection semantics
14Contents
- Introduction
- Formal Framework
- Derived Semantics
- Computing Interconnectivity
- Undirected and Universal Semantics
- Conclusion and Future Work
15Contents
- Introduction
- Formal Framework
- Derived Semantics
- Computing Interconnectivity
- Undirected and Universal Semantics
- Conclusion and Future Work
16An XML Document
A rooted directed graph
A node is an object that has a label and
(possibly) a value
Edges represent element nesting and ID references
17A Schema
A rooted directed graph
A node is a label
18A Schema Defines the Document Structure
The root of the schema is the label of the root
of the document
19A Schema Defines the Document Structure
An edge in the document is allowed only if an
edge between the corresponding labels appears in
the schema
20Relationships and Trees
- In our framework, relationships among objects are
represented by trees - Trees represent atomic relationships, i.e.,
relationships that are indivisible - Our trees are directed
- That is, represent hierarchies
- Relaxed later
21Patterns
- In the formal framework, patterns are the basic
building blocks
Formally, a pattern is a pair (L,C)
C is a tree of labels
L is a set of labels
(
)
,
title,publication,author
- C contains L
- C has no redundant edges
22Interconnection by Patterns
- A pattern defines when a set O of objects with a
given set L of labels is interconnected - O is interconnected if the objects are in a tree
that is isomorphic to the pattern
(
)
,
title,publication,author
23Interconnection by Patterns
24Interconnection by Patterns
Interconnected
25Interconnection Semantics
- An interconnection semantics P is a set of
patterns - A set of objects is interconnected by P if it is
interconnected by a pattern of P
(title,name , )
(title,name , )
26A Framework of Semantics
- Various approaches for deriving semantic
relationships can be represented by means of
interconnection semantics - Interconnection semantics can be given explicitly
(e.g., by manually defining the patters) - Alternatively, interconnection semantics can be
given implicitly (i.e., be derived) by defining
conditions that characterize patterns - Derived semantics are discussed next
27Contents
- Introduction
- Formal Framework
- Derived Semantics
- Computing Interconnectivity
- Undirected and Universal Semantics
- Conclusion and Future Work
28Derived Semantics
- In principle, an interconnection semantics can be
given explicitly by authors (or users) of XML
documents - However, generating a semantics may be a
cumbersome task - Moreover, the semantics may be very large
We need semantics that are automatically derived
29Derived Semantics (cont.)
- A derived semantics is essentially a rule that
states which subtrees of the schema appear in
patterns - A default schema can be obtained from the
document - One semantics that always captures our intuition
is not likely to exist, therefore it is of
interest to explore several approaches for
deriving semantics - We explore a wide spectrum of specific derived
semantics - Our derived semantics demonstrate the strengths
and weaknesses of different approaches for
obtaining such semantics
30A Simple Example
title,publication
31The Interconnection Semantics Pall
- The semantics Pall generalizes the approach of
Cohen et al., ICDT03 to graph (rather than
tree) documents - Given a schema S, the semantics Pall(S) is the
set of all patterns (L,C), where C is a subtree
of S - Note that a set of objects is Pall(S)-interconnect
ed if it is contained in a uniquely labeled
subtree of the document
32The Subtrees of title,publication
33The Subtrees of title,author
34The Subtrees of title,author
?
Is this what we mean?
35The Interconnection Semantics Pmin
- The semantics Pall(S) may contain patterns that
imply rather weak relationships - We follow the convention that small trees imply
strong relationships - The semantics Pmin(S) contains, for each set L of
labels, only the patterns (L,C) of Pall(S), such
that C is of minimal size
36The Subtrees of title,author
This subtree of the schema is minimal w.r.t.
title,author
37Pmin(S)-Interconnected title and author
This subtree of the document shows
Pmin(S)-interconnection!
38The Subtrees of title,author
This subtree of the schema is not minimal w.r.t.
title,author
39Pmin(S)-Interconnected title and author
This subtree does not show Pmin(S)-interconnectio
n!
40What about these title and author?
41What about these title and author?
This subtree does not show Pmin(S)-interconnectio
n
42What about these title and author?
This subtree does not show Pmin(S)-interconnectio
n
because this subtree is smaller
43The Interconnection Semantics Puca
- We consider a different notion of minimality that
is based on structure - Consider an acyclic schema S and a pattern
p(L,C) in Pall(S) - Intuitively, p is structurally minimal if
internal nodes in C cannot be roots of trees
containing L - Formally, p is structurally minimal if only the
root of C is a common ancestor of L in S - The semantics Puca(S) is the set of all
structurally minimal patterns in Pall(S)
44Some Examples
45One Structurally Minimal Pattern
(title,author , )
In the schema, article is the only
common ancestor of title,author
46Another Structurally Minimal Pattern
(title,author , )
In the schema, inproc. is the only
common ancestor of title,author
47A Third Structurally Minimal Pattern
(title,author , )
In the schema, inproc. is the only
common ancestor of title,author
48Not a Structurally Minimal Pattern
(title,author , )
In the schema, department, publications and
incproc. are all common ancestors of
title,author
49Back to the Document
50Puca(S)-Interconnected title and author
This subtree shows Puca(S)-interconnection!
51Puca(S)-Interconnected title and author
This subtree shows Puca(S)-interconnection!
52Not Puca(S)-Interconnected
This subtree does not show Puca(S)-interconnectio
n!
53Contents
- Introduction
- Formal Framework
- Derived Semantics
- Computing Interconnectivity
- Undirected and Universal Semantics
- Conclusion and Future Work
54Interconnectivity Problems
- Given an interconnection semantics, we are
interested in solving two problems - Determine whether a given set of objects is
interconnected - Arises in keyword search
- Generate all interconnected sets of objects
having a given set of labels - Arises in evaluation of queries that incorporate
interconnectivity
55Computing Explicit Semantics
- When the interconnection semantics is given
explicitly, there are highly efficient algorithms
for solving the interconnectivity problems - I.e., polynomial under query-and-data complexity
- In query evaluation, the first k results can be
enumerated quickly - I.e., evaluation in incremental polynomial time
- We can solve the interconnectivity problems by
translating patterns into queries of standard
languages, e.g., XQuery - Thus, existing engines can be used
56Computing Derived Semantics
- There are two approaches for solving
interconnectivity problems when the semantic is
derived (implicit) - Pattern Extraction
- Direct Computation (w/o extracting patterns)
571st Approach Pattern-Extraction
- This approach is basically a reduction to the
case where interconnection semantics are given
explicitly - In particular, given an implicit semantics P and
a set L of labels, we extract from the schema all
patterns (L,C) of P - Thus, we create an explicit representation of the
implicit semantics and use the corresponding
explicit-semantics algorithms
58Puca(S) Patterns for title,author
59Extracting the Patters
- Given a set L of labels, we can extract the
relevant patterns of our derived semantics very
efficiently, when measuring the complexity in
terms of the combined size of input and the
output - I.e., with polynomial delay
- For Pall(S) and Puca(S), we can use the
keyword-search algorithms given in Kimelfeld and
Sagiv, DBPL05 - An algorithm for Pmin(S) is given in the
proceedings
602nd Approach Direct Computation
- There are highly efficient algorithms for
evaluating the extracted patterns - However, there may be a large (i.e., exponential)
number of relevant patterns - When the number of patterns is large, we are
interested in solutions that compute
interconnectivity directly, i.e., without
extracting the patterns first
61Algorithms for Direct Computation
- There are efficient algorithms for direct
computation of our derived semantics - An exception is Pall(S) in cyclic schemas, which
is intractable - We use data complexity as a yardstick of
efficiency, which is conventional in DB theory - In some common cases, interconnectivity problems
are more tractable - I.e., there are efficient algorithms under
query-and-data complexity (like tree join vs.
general join) - Usually, when either the schema is a tree or the
document has no ID references - Details are in the proceedings
62Contents
- Introduction
- Formal Framework
- Derived Semantics
- Computing Interconnectivity
- Undirected and Universal Semantics
- Conclusion and Future Work
63Undirected Relationships
An article was written by authors from two
different departments
64Undirected Relationships
65Undirected Relationships
66Undirected Patterns and Semantics
Undirected Patterns
- (L,C)
- L is a set of labels
- C is an undirected tree of labels
An undirected interconnection semantics consists
of undirected patterns
67Universal Interconnectivity
- Until now, an interconnection semantics was
applied in an existential manner - That is, a set of nodes is interconnected if
there is an evidence for interconnectivity (e.g.,
a uniquely labeled subtree that contains the
given objects) - When applying interconnection semantics in a
universal manner, we consider all subtrees of the
document (contexts) that contain the given
objects - A set of objects is universally interconnected if
every subtree contains an evidence for
interconnectivity
68An Example
69An Example
This subtree shows interconnectivity
70An Example
The two objects are NOT universally
interconnected!
This subtree does not show interconnectivity
71Derived Semantics
- We considered two undirected derived semantics
- Pu-all(S) all undirected patterns of a schema S
- Pu-min(S) all minimal patterns in Pu-all(S)
- We considered the universal versions of the
semantics Pall(S) and Pu-all(S) - Usually, these derived semantics can be computed
efficiently - ?Pall(S) is harder than Pall(S) and ?Pu-all(S) is
easier - However, completely different proof techniques
are required - More details, including algorithms and complexity
results, are in the proceedings
72Contents
- Introduction
- Formal Framework
- Derived Semantics
- Computing Interconnectivity
- Undirected and Universal Semantics
- Conclusion and Future Work
73Interconnection Semantics
- A framework for determining meaningful
relationships among XML objects - Interconnection semantics can be either
explicitly constructed or automatically
(implicitly) derived - Users can tweak implicit semantics by adding
specific patterns - The framework enables many different types of
semantics, with different strengths
74Containments among our Derived Semantics
75Complexity
- Explicit semantics can be computed very
efficiently - Under reasonable conditions, our derived
semantics can also be efficiently computed - Different semantics require totally different
techniques - Two computation approaches
- Pattern extraction
- Direct computation
76Future Work
- Experimentation
- Which semantics works better in practical
scenarios? - As a result of the experimentation, discover new
(better) semantics - Information Retrieval
- Combine our approach with IR techniques for
ranking XML fragments - Richer Data Model
- That is, richer schema and document models