Title: XSEarch: A Semantic Search Engine for XML
1XSEarch A Semantic Search Engine for XML
- Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua
Sagiv - The Hebrew University of Jerusalem
- Presented by
- Deniz Kasap Sarp Baran Özkan
2XSEarch an XML Search Engine
- Goal
- Find the relevant XML fragments,
- given tag names and keywords
3Introduction
- It is becoming increasingly popular to publish
data on the Web in the form of XML documents. - Current search engines, which are an
indispensable tool for finding HTML documents,
have two main drawbacks when it comes to
searching for XML documents. - It is not possible to pose queries that
explicitly refer to XML tags. - Search engines return references (i.e. links) to
documents and not specific fragments thereof.
This is problematic, since large XML documents
may contain thousands of elements storing many
pieces of information that are not necessarily
related to each other.
4Excerpt from the XML Version of DBLP
- ltproceedingsgt
- ltinproceedingsgt
- ltauthorgtMoshe Y. Vardilt/authorgt
- lttitlegtQuerying Logical Databaseslt/titlegt
- lt/inproceedingsgt
- ltinproceedingsgt
- ltauthorgtVictor Vianult/authorgt
- lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt - lt/inproceedingsgt
- lt/proceedingsgt
5A Search Example
- Find papers by Vianu on the topic of
- logical databases
How can we find such papers?
6Attempt 1 Standard Search Engine
A document containing some of the three query
terms is considered as a result.
7The document is not relevant to the query. This
does not work!!!
- ltproceedingsgt
- ltinproceedingsgt
- ltauthorgtMoshe Y. Vardilt/authorgt
- lttitlegtQuerying Logical Databaseslt/titlegt
- lt/inproceedingsgt
- ltinproceedingsgt
- ltauthorgtVictor Vianult/authorgt
- lttitlegtA Web Odyssey From Codd to
- XMLlt/titlegt
- lt/inproceedingsgt
- lt/proceedingsgt
8- Since a reference to whole XML document is
usually not a useful answer, the granularity of
the search should be refined. - Instead of returning entire document, an XML
search engine should return fragments of XML
documents.
9-
- A query language for XML, such as XQuery, can be
used to extract data from XML documents. - However, such a query language is not an
alternative to an XML search engine for several
reasons. - The syntax of XQuery is more complicated than the
syntax of a standart search query. Hence, it is
not appropriate for a naive user. - Extensive knowledge of the document structure is
required in order to correctly formulate a query.
Thus, queries must be formulated on a per
document basis. - XQuery lacks any mechanism for ranking answers.
10Attempt 2 XML Query Language
- FOR i IN document(bib.xml)//inproceedings
- WHERE i/author contains Vianu
- AND i/title contains Logical
- AND i/title contains Databases
- RETURN ltresultgt
- ltauthorgt i/author lt/authorgt
- lttitlegt i/title lt/titlegt
- lt/resultgt
This does work, BUT
- Complicated syntax
- Extensive knowledge of the document structure
required to write the query - No mechanism for ranking results
11Our Requirements from the Search Tool
- A simple syntax that can be used by naive users
- Search results should include XML fragments and
not necessarily full documents - The XML fragments in an answer, should be
semantically related - For example, a paper and an author should be in
an answer only if the paper was written by this
author - Search results should be ranked
- Search results should be returned in reasonable
time
12- The design and implementation of XSEarch involved
several challenges. - A syntax is suitable for a naive user.
- The theoretical results were adapted so that
XSEarch always returns as answers. - Answers are highly relevant to the keywords of
the query. - Suitable ranking mechanism that takes into
account both the degree of the semantic
relationship and the relevance of the keywords
have been developed. - Index structures and evaluation algorithms that
allow the system to deal efficiently with large
documents have been developed. - The implemantation of XSEarch is extensible in
the sense that it can easily accommodate
different type of semantic relationships.
13Query Syntax
- The query language of a standart search engine is
simply a list of keywords. - Keywords with a plus () sign must appear in a
satisfying document, whereas keywords without a
plus sign may or may not appear in a satisfying
document. (but the appearance of such keywords is
desirable)
14- The query language of XSEarch is a simple
extension of the language described below. In
addition to specify labels and keyword-label
combinations that must or may appear in a
satisfying document. - A search term may have a plus sign prepended, in
which case it is a required term. Otherwise, it
is an optional term. - We use t, t1, t2, etc., as an abstract notation
for required and optional term. - A query has the form Q(S) where S t1,...,tm is
a sequence of required and optional search terms.
15- Formally, a search term has the form
- lk, l, k
- where
- l is a label and k is a keyword.
16Example
- Find papers by Vianu on the topic of logical
databases
logical database inproceedings authorVianu
Note that the different document fragments
matching these query terms must be semantically
related
17Query Semantics
- This section presents the semantics of our
queries. - In order to satisfy a query Q, each of the
required terms in Q must be satisfied. - In addition, the elements satisfying Q must be
meaningfully related.
18XSEarch
authorVianu title
- ltproceedingsgt
- ltinproceedingsgt
- ltauthorgtMoshe Y. Vardilt/authorgt
- lttitlegtQuerying Logical Databaseslt/titlegt
- lt/inproceedingsgt
- ltinproceedingsgt
- ltauthorgtVictor Vianult/authorgt
- lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt - lt/inproceedingsgt
- lt/proceedingsgt
ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to XMLlt/titlegt
Good Result! title and author elements ARE
semantically related
19XSEarch
authorVianu title
- ltproceedingsgt
- ltinproceedingsgt
- ltauthorgtMoshe Y. Vardilt/authorgt
- lttitlegtQuerying Logical Databaseslt/titlegt
- lt/inproceedingsgt
- ltinproceedingsgt
- ltauthorgtVictor Vianult/authorgt
- lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt - lt/inproceedingsgt
- lt/proceedingsgt
lttitlegtQuerying Logical Databaseslt/titlegt
ltauthorgtVictor Vianult/authorgt
Bad Result! title and author elements ARE NOT
semantically related
20Satisfaction of a Search Term
- XML documents are modeled as trees in the
standard fashion. - Each interior node is associated with a label and
each leaf node is associated with the sequence of
keywords. - If k is a keyword in the sequence associated with
n, n contains k is said. - In Figure 1 there is a tree that represents a
small portion of the Sigmod Record. - We will refer to this tree as Tsr
21(No Transcript)
22- Let n be an interior node in a tree T.
- We say that n satisfies the search term
- lk if n is labeled with l and a descendent that
contains the keyword k. - l if n is labeled with l.
- k if n has a leaf child that contains the
keyword k. - Example
- In the tree Tsr,
- node number 14 satisfies Kempster
- node number 9 satisfies authorsKempster.
- node 9 does not satisfy Kempster, position or
position.
23Meaningfully Related Sets of Nodes
- Let T be a tree and R be a binary, reflexive and
symmetric relationship on the nodes in T. - We assume that R contains pairs of nodes that are
meaningfully related. - We present two different way to extend R to
arbitrary sets of nodes
24- A set of nodes N is all-pairs R-related, if
(n1,n2) is in R, for every pair of nodes n1, n2. - This states that a set of nodes is meaningfully
related if every pair of nodes in the set is
meaningfully related. - N is star R-related, if there is a node n ? N
such that the pair (n,n) is in R, for all nodes
n ? N. - This states that the nodes of a set are
meaningfully related if all these nodes are
meaningfully related to a node in the set. - Depending on the structure of the documents,
either the all-pairs relationship or star
relation-ship may be more appropriate.
25Query Answers
- Let Q(t1,,tm) be a query.
- A sequence N n1,,nm of nodes and null values
is an all-pairs R-answer for Q if the nodes in N
are all-pairs R-related and for all 1 ? i ? m - ni is not the null value if ti is a required
term - ni satisfies ti if it is not the null value.
- Similarly, N is star R-answer, when the nodes in
N are star R-related.
26- We use
- Ansa,R(Q) to denote the set of all-pairs R-answer
for the query Q over a tree T and - Ansts,R(Q) to denote the set of star R-answers
for Q over T. - MaxAnsa,R to denote the set of maximal answers in
Ansa,R(Q)
27The Interconnection Relationship
- We present a relation which can be used to
determine whether a pair of nodes is meaningfully
related. - Let T be tree an n1 and n2 be nodes in T.
- The shortest undirected path between n1 and n2
consists of the paths from the lowest common
ancestor of n1 and n2 to n1 and n2.
28- We denote the tree consisting of these two paths
as Tn1,n2. - This tree describes the relationship between the
nodes n1 and n2. - For example in Tsr, the tree T8,13 consists of
the nodes 7, 8, 9, 12 and 13.
29Relationship Trees
Lowest common ancestor of n1, n2, , nk
nk
n1
n2
30Our Semantic Relation Interconnection
- n1,..., nk are interconnected if either
- relationship tree of n1,..., nk does not contain
two nodes with the same label, or - the only nodes with the same label in the
relationship tree of n1,..., nk, are among
n1,..., nk
31Example (1)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to different inproceedings
entities. They ARE NOT interconnected!
32Example (2)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to the same inproceedings
entity. They ARE interconnected!
33Example (3)
Lowest common ancestor of circled nodes
proceedings
Relationship tree
inproceedings
inproceedings
title
author
title
author
author
Moshe Y. Vardi
Victor Vianu
Serge Abiteboul
Queries and Computation on the Web
Querying Logical Databases
Circled nodes belong to the same inproceedings
entity, but are labeled with the same tag. They
ARE interconnected.
34Example 1 of Query Semantics
- Consider the query Q1 defined as
- Q1(title, author).
- The query Q1 finds pairs of titles and authors,
belonging to the same article. - Only tuples where the title is non-null will be
returned. - The answers created for Tsr are
- (8,10) , (8,12) , (8,14) , (17,18) and (25, ?)
35Example 2 of Query Semantics
- The answers for Q1 over this document would
consists of - (6,3) and (6,4)
36Query Processing
- Document fragments are extracted using the
interconnection index and other indices - Extracted fragments are returned ranked by the
estimated relevance
37Ranker
38Ranking Factors
- Several factors increase the rank of a result
- Similarity between query and result
- Weight of labels appearing in the result
- Characteristics of result tree
39Query and Result Similarity
- TFILF
- Extension of TFIDF, classical in IR
- Term Frequency number of occurrences of a query
term in a fragment - Inverse Leaf Frequency number of leaves
containing a query term divided by number of
leaves in the corpus
40TFILF
- Term frequency of keyword k in a leaf node nl
- Inverse leaf frequency
TFILF is the product between tf and ilf
41Weight of Labels
- Some labels are considered more important than
others - Text under an element labeled with title is more
important than text under element labeled with
section - Label weights can be
- system generated
- user defined
42Relationship between Nodes
- Size of the relationship tree small fragment
indicates that its nodes are closer, and thus,
probably, more related
article titleXML
43Relationship between Nodes
- Ancestor-descendant relationships between a pair
of nodes in a fragment, indicates strong
relation between these nodes
section titleXML
44Combining the Factors
- Given a query Q and an answer N, we use the
measures - sim(Q,N),
- tsize(N)
- and anc-des(N)
- to determine the ranking of the answer. We
experimented with the following combination of
factors by varying the values of a , ß and ? - sim(Q,N)a / tsize(N)ß x (1 ? x anc-des(N))
45System Implementation
- The architecture of the XSEarch system is
depicted in the following figure
46(No Transcript)
47- The basic follow of information is as follows
- The user enters a query using a browser.
- The Search-Query Processor parses the query into
a list of search terms. - The Index Repository is used to find nodes that
satisfy that satisfy the search terms and to find
whether pairs of nodes are interconnected. - It responds by checking the stored indices.
- If these indices do not contain sufficient
information, the Indexer is used to augment the
current indices. - Once the relevant information is returned to the
Search-Query Processor, it creates the answers,
which are ranked, sorted and then returned. - The Indexer creates several different indices in
the Index Repository based on a set of XML
documents.
48- We focus on the most important and novel index
structures - The interconnection index
- Path index
- The interconnection index allows for rapid
checking of the interconnection relationship. - Path index allow us to create first answers with
higher estimated ranking.
49Dynamic Offline Interconnection Indexing
- Checking for interconnection of nodes online is
expensive. - Hence, it is decided that at the first to create
a node-interconnection index that would store
information about the interconnection
relationship between each pair of nodes. - This requires solving the following problem
- Given a document T, for all pairs of nodes n and
n in T, determine whether n and n are
interconnected. - The algorithm which is the solution of this
problem, is based on the following Lemma
50- Lemma (Interconnection Characterization)
- Let T be a document and let n and n be nodes in
T. - If n is ancestor of n, then n and n are
interconnected if and only if the following hold - The parent of n is strongly-interconnected with
n - The child of n on the path to n is
strongly-interconnected with n. - If n is not an ancestor of n and n is not an
ancestor of n, then n and n are interconnected
if and only if the following hold - The parent of n is strongly-interconnected with
n - The parent of n is strongly-interconnected with
n.
51- In the XSearch system, we have explored the
possibilities of storing the node-interconnection
index in either a hashtable or a symmetric
matrix. - When implemented as a hastable, the
node-interconnection index contains pairs of ids
of interconnected nodes. - When implemented as a symmetric matrix, the
node-interconnection index contains a boolean
value for each pairs of nodes, indicating whether
they are interconnected or not. - A comparison of time and space efficiency of
these structures will be explained.
52Dynamic Online Interconnection Indexing
- Offline computation of the node-interconnection
index may be expensive. - In order to amortize the cost of computing this
index over the queries received, we have also
considered an online indexing method. - When indexing online, for each pair of nodes n
and n, we compute the section of the node
interconnection index corresponding to Tn,n
53- We use a hashtable to store the part of the part
of index that has already been computed at any
given moment. - The hashtable contains a boolean value for each
pair of nodes whose interconnection has already
been checked. - The boolean value indicates whether the nodes are
interconnected or not.
54- During query processing, usually only a small
part of the node-interconnection index will be
created, thus the slowdown in response time is
not large. - In addition, queries tend to be similar in the
parts of the document that they must access. - Therefore, even after many queries have been
evaluated, it is likely for the
node-interconnection index to be only partially
computed. - This speeds up execution time when loading the
index into main memory.
55Experimental Results
56Hardware and Software Used
- Language Java
- Processor 1.6 GHZ Pentium 4
- RAM 2 GB (limited to 1.46 GB by JVM)
- OS Windows XP
57Interconnection Index
- Is built offline
- Allows for checking interconnection between two
nodes, during query processing, in O(1) time - We have two implementations
- as a hash table
- as a symmetric matrix
- The Indexer is responsible for building the
Interconnection Index
58Choosing the Implementation for the
Interconnection Index
- We have experimented the two implementations of
the interconnection index - 1. IIH the index is an hash table
- 2. IIM the index is a symmetric matrix
- We compare the two implementations
- Cost of building the index
- Cost of query processing, i.e., using the index
59Time For Building Indices
- Both implementations are reasonable
- IIM is better than IIH, because of the additional
overhead of hashing
60On the Fly Indexing (OFI)
- Fully building the indices as a preprocess of
querying is expensive in memory for huge
corpuses! - Also expensive in time because of the additional
overhead of using virtual memory - Instead, compute interconnection index
incrementally on-the-fly during query processing
for each pair that must be checked - By how much will query processing be slowed down?
61Time For Building Indices Comparing IIH, IIM, OFI
For these corpuses, OFI time is less than 10 ms.
Actually it is the time to build all the indices
other than the interconnection index.
62(No Transcript)
63Query Execution Time
- We generated 1000 random queries for the Sigmod
Record corpus - Each query had
- At most 3 optional search terms
- At most 3 required search terms
- We checked time with IIH, IIM and OFI
64IIH/IIM Query Processing Time
- Note Logarithmic scale
- Both approaches lead to similar results
- Average run time for queries 35 ms
65OFI Query Processing Time
- After processing the 1000 queries, 0.75 of all
pairs of nodes were checked for interconnection. - Space saved in main memory
Slowdown in response time not too large! Locality
property queries tend to be similar in the parts
of the document that they may access
- More than 50 of the queries processed in under
10 ms
66How Good are the Results?
- We measured recall precision for the query
- Find papers written by Buneman that contain the
keyword database in the title - We tried two different queries that reflect
different amounts of user knowledge - Kw Buneman database (classical search engine
query) - Tag-kw authorBuneman titledatabase
- Corpus Sigmod, DBLP
67Precision and Recall
- We computed the "correct answers" using XQuery
- Recall
- ?Perfect recall, i.e., XSEarch returns all the
correct answers - Precision at n
68Precision at 5, 10 and 20
Sigmod Perfect precision DBLP 0.8/0.9 for query
containing only keywords
Combining tags and keywords leads to perfect
precision
69Related Work
- Numerous query languages for XML have been
developed. - For example, the XQuery working group is
considering how to add full-text search features
and ranking to XQuery. Such capabilities have
already been added to various XML query
languages. But these languages are not suitable
for naïve user, since the query syntax is always
complex. - A recent related work is the XRANK system for
keyword searching in XML documents
70Conclusions
- The main contribution of this paper is in laying
the foundations for a semantic search engine over
XML documents. - XSearch returns semantically related fragmants,
ranked by estimated relevance. - This system is extensible and can easily
accommodate different types of relationships
between nodes. - We have shown that it is possible to combine
these qualities with an efficient, scalable and
modular system. - Thus, XSearch can be seen as a general framework
for semantic searching in XML documents.
71- Efficient index structures
- IIM/IIH for small documents
- OFI for big documents
- Efficient evaluation algorithms
- Dynamic algorithm for computing interconnection
- Extensible implementation
- The system can easily accommodate different types
of semantic relations between nodes, other than
interconnection
72