XSEarch: A Semantic Search Engine for XML - PowerPoint PPT Presentation

1 / 72

About This Presentation

Title:

XSEarch: A Semantic Search Engine for XML

Description:

A search term may have a plus sign prepended, in which case it is a required term. ... tag author, in the fragment, increases the rank of this fragment. The ... – PowerPoint PPT presentation

Number of Views:670

Avg rating:3.0/5.0

Slides: 73

Provided by: winx158

Category:

more less

Transcript and Presenter's Notes

Title: XSEarch: A Semantic Search Engine for XML

1
XSEarch A Semantic Search Engine for XML

Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua
Sagiv
The Hebrew University of Jerusalem
Presented by
Deniz Kasap Sarp Baran Özkan

2
XSEarch an XML Search Engine

Goal
Find the relevant XML fragments,
given tag names and keywords

3
Introduction

It is becoming increasingly popular to publish
data on the Web in the form of XML documents.
Current search engines, which are an
indispensable tool for finding HTML documents,
have two main drawbacks when it comes to
searching for XML documents.
It is not possible to pose queries that
explicitly refer to XML tags.
Search engines return references (i.e. links) to
documents and not specific fragments thereof.
This is problematic, since large XML documents
may contain thousands of elements storing many
pieces of information that are not necessarily
related to each other.

4
Excerpt from the XML Version of DBLP

ltproceedingsgt
ltinproceedingsgt
ltauthorgtMoshe Y. Vardilt/authorgt
lttitlegtQuerying Logical Databaseslt/titlegt
lt/inproceedingsgt
ltinproceedingsgt
ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt
lt/inproceedingsgt
lt/proceedingsgt

5
A Search Example

Find papers by Vianu on the topic of
logical databases

How can we find such papers?
6
Attempt 1 Standard Search Engine
A document containing some of the three query
terms is considered as a result.
7
The document is not relevant to the query. This
does not work!!!

ltproceedingsgt
ltinproceedingsgt
ltauthorgtMoshe Y. Vardilt/authorgt
lttitlegtQuerying Logical Databaseslt/titlegt
lt/inproceedingsgt
ltinproceedingsgt
ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt
lt/inproceedingsgt
lt/proceedingsgt

Since a reference to whole XML document is
usually not a useful answer, the granularity of
the search should be refined.
Instead of returning entire document, an XML
search engine should return fragments of XML
documents.

A query language for XML, such as XQuery, can be
used to extract data from XML documents.
However, such a query language is not an
alternative to an XML search engine for several
reasons.
The syntax of XQuery is more complicated than the
syntax of a standart search query. Hence, it is
not appropriate for a naive user.
Extensive knowledge of the document structure is
required in order to correctly formulate a query.
Thus, queries must be formulated on a per
document basis.
XQuery lacks any mechanism for ranking answers.

10
Attempt 2 XML Query Language

FOR i IN document(bib.xml)//inproceedings
WHERE i/author contains Vianu
AND i/title contains Logical
AND i/title contains Databases
RETURN ltresultgt
ltauthorgt i/author lt/authorgt
lttitlegt i/title lt/titlegt
lt/resultgt

This does work, BUT

Complicated syntax
Extensive knowledge of the document structure
required to write the query
No mechanism for ranking results

11
Our Requirements from the Search Tool

A simple syntax that can be used by naive users
Search results should include XML fragments and
not necessarily full documents
The XML fragments in an answer, should be
semantically related
For example, a paper and an author should be in
an answer only if the paper was written by this
author
Search results should be ranked
Search results should be returned in reasonable
time

The design and implementation of XSEarch involved
several challenges.
A syntax is suitable for a naive user.
The theoretical results were adapted so that
XSEarch always returns as answers.
Answers are highly relevant to the keywords of
the query.
Suitable ranking mechanism that takes into
account both the degree of the semantic
relationship and the relevance of the keywords
have been developed.
Index structures and evaluation algorithms that
allow the system to deal efficiently with large
documents have been developed.
The implemantation of XSEarch is extensible in
the sense that it can easily accommodate
different type of semantic relationships.

13
Query Syntax

The query language of a standart search engine is
simply a list of keywords.
Keywords with a plus () sign must appear in a
satisfying document, whereas keywords without a
plus sign may or may not appear in a satisfying
document. (but the appearance of such keywords is
desirable)

The query language of XSEarch is a simple
extension of the language described below. In
addition to specify labels and keyword-label
combinations that must or may appear in a
satisfying document.
A search term may have a plus sign prepended, in
which case it is a required term. Otherwise, it
is an optional term.
We use t, t1, t2, etc., as an abstract notation
for required and optional term.
A query has the form Q(S) where S t1,...,tm is
a sequence of required and optional search terms.

Formally, a search term has the form
lk, l, k
where
l is a label and k is a keyword.

16
Example

Find papers by Vianu on the topic of logical
databases

logical database inproceedings authorVianu
Note that the different document fragments
matching these query terms must be semantically
related
17
Query Semantics

This section presents the semantics of our
queries.
In order to satisfy a query Q, each of the
required terms in Q must be satisfied.
In addition, the elements satisfying Q must be
meaningfully related.

18
XSEarch
authorVianu title

ltproceedingsgt
ltinproceedingsgt
ltauthorgtMoshe Y. Vardilt/authorgt
lttitlegtQuerying Logical Databaseslt/titlegt
lt/inproceedingsgt
ltinproceedingsgt
ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt
lt/inproceedingsgt
lt/proceedingsgt

ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to XMLlt/titlegt
Good Result! title and author elements ARE
semantically related
19
XSEarch
authorVianu title

ltproceedingsgt
ltinproceedingsgt
ltauthorgtMoshe Y. Vardilt/authorgt
lttitlegtQuerying Logical Databaseslt/titlegt
lt/inproceedingsgt
ltinproceedingsgt
ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to
XMLlt/titlegt
lt/inproceedingsgt
lt/proceedingsgt

lttitlegtQuerying Logical Databaseslt/titlegt
ltauthorgtVictor Vianult/authorgt
Bad Result! title and author elements ARE NOT
semantically related
20
Satisfaction of a Search Term

XML documents are modeled as trees in the
standard fashion.
Each interior node is associated with a label and
each leaf node is associated with the sequence of
keywords.
If k is a keyword in the sequence associated with
n, n contains k is said.
In Figure 1 there is a tree that represents a
small portion of the Sigmod Record.
We will refer to this tree as Tsr

21
(No Transcript)
22

Let n be an interior node in a tree T.
We say that n satisfies the search term
lk if n is labeled with l and a descendent that
contains the keyword k.
l if n is labeled with l.
k if n has a leaf child that contains the
keyword k.
Example
In the tree Tsr,
node number 14 satisfies Kempster
node number 9 satisfies authorsKempster.
node 9 does not satisfy Kempster, position or
position.

23
Meaningfully Related Sets of Nodes

Let T be a tree and R be a binary, reflexive and
symmetric relationship on the nodes in T.
We assume that R contains pairs of nodes that are
meaningfully related.
We present two different way to extend R to
arbitrary sets of nodes

A set of nodes N is all-pairs R-related, if
(n1,n2) is in R, for every pair of nodes n1, n2.
This states that a set of nodes is meaningfully
related if every pair of nodes in the set is
meaningfully related.
N is star R-related, if there is a node n ? N
such that the pair (n,n) is in R, for all nodes
n ? N.
This states that the nodes of a set are
meaningfully related if all these nodes are
meaningfully related to a node in the set.
Depending on the structure of the documents,
either the all-pairs relationship or star
relation-ship may be more appropriate.

25
Query Answers

Let Q(t1,,tm) be a query.
A sequence N n1,,nm of nodes and null values
is an all-pairs R-answer for Q if the nodes in N
are all-pairs R-related and for all 1 ? i ? m
ni is not the null value if ti is a required
term
ni satisfies ti if it is not the null value.
Similarly, N is star R-answer, when the nodes in
N are star R-related.

We use
Ansa,R(Q) to denote the set of all-pairs R-answer
for the query Q over a tree T and
Ansts,R(Q) to denote the set of star R-answers
for Q over T.
MaxAnsa,R to denote the set of maximal answers in
Ansa,R(Q)

27
The Interconnection Relationship

We present a relation which can be used to
determine whether a pair of nodes is meaningfully
related.
Let T be tree an n1 and n2 be nodes in T.
The shortest undirected path between n1 and n2
consists of the paths from the lowest common
ancestor of n1 and n2 to n1 and n2.

We denote the tree consisting of these two paths
as Tn1,n2.
This tree describes the relationship between the
nodes n1 and n2.
For example in Tsr, the tree T8,13 consists of
the nodes 7, 8, 9, 12 and 13.

29
Relationship Trees
Lowest common ancestor of n1, n2, , nk

nk
n1
n2
30
Our Semantic Relation Interconnection

n1,..., nk are interconnected if either
relationship tree of n1,..., nk does not contain
two nodes with the same label, or
the only nodes with the same label in the
relationship tree of n1,..., nk, are among
n1,..., nk

31
Example (1)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to different inproceedings
entities. They ARE NOT interconnected!
32
Example (2)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to the same inproceedings
entity. They ARE interconnected!
33
Example (3)
Lowest common ancestor of circled nodes
proceedings
Relationship tree
inproceedings
inproceedings
title
author
title
author
author
Moshe Y. Vardi
Victor Vianu
Serge Abiteboul
Queries and Computation on the Web
Querying Logical Databases
Circled nodes belong to the same inproceedings
entity, but are labeled with the same tag. They
ARE interconnected.
34
Example 1 of Query Semantics

Consider the query Q1 defined as
Q1(title, author).
The query Q1 finds pairs of titles and authors,
belonging to the same article.
Only tuples where the title is non-null will be
returned.
The answers created for Tsr are
(8,10) , (8,12) , (8,14) , (17,18) and (25, ?)

35
Example 2 of Query Semantics

The answers for Q1 over this document would
consists of
(6,3) and (6,4)

36
Query Processing

Document fragments are extracted using the
interconnection index and other indices
Extracted fragments are returned ranked by the
estimated relevance

37
Ranker
38
Ranking Factors

Several factors increase the rank of a result
Similarity between query and result
Weight of labels appearing in the result
Characteristics of result tree

39
Query and Result Similarity

TFILF
Extension of TFIDF, classical in IR
Term Frequency number of occurrences of a query
term in a fragment
Inverse Leaf Frequency number of leaves
containing a query term divided by number of
leaves in the corpus

40
TFILF

Term frequency of keyword k in a leaf node nl
Inverse leaf frequency

TFILF is the product between tf and ilf
41
Weight of Labels

Some labels are considered more important than
others
Text under an element labeled with title is more
important than text under element labeled with
section
Label weights can be
system generated
user defined

42
Relationship between Nodes

Size of the relationship tree small fragment
indicates that its nodes are closer, and thus,
probably, more related

article titleXML
43
Relationship between Nodes

Ancestor-descendant relationships between a pair
of nodes in a fragment, indicates strong
relation between these nodes

section titleXML
44
Combining the Factors

Given a query Q and an answer N, we use the
measures
sim(Q,N),
tsize(N)
and anc-des(N)
to determine the ranking of the answer. We
experimented with the following combination of
factors by varying the values of a , ß and ?
sim(Q,N)a / tsize(N)ß x (1 ? x anc-des(N))

45
System Implementation

The architecture of the XSEarch system is
depicted in the following figure

46
(No Transcript)
47

The basic follow of information is as follows
The user enters a query using a browser.
The Search-Query Processor parses the query into
a list of search terms.
The Index Repository is used to find nodes that
satisfy that satisfy the search terms and to find
whether pairs of nodes are interconnected.
It responds by checking the stored indices.
If these indices do not contain sufficient
information, the Indexer is used to augment the
current indices.
Once the relevant information is returned to the
Search-Query Processor, it creates the answers,
which are ranked, sorted and then returned.
The Indexer creates several different indices in
the Index Repository based on a set of XML
documents.

We focus on the most important and novel index
structures
The interconnection index
Path index
The interconnection index allows for rapid
checking of the interconnection relationship.
Path index allow us to create first answers with
higher estimated ranking.

49
Dynamic Offline Interconnection Indexing

Checking for interconnection of nodes online is
expensive.
Hence, it is decided that at the first to create
a node-interconnection index that would store
information about the interconnection
relationship between each pair of nodes.
This requires solving the following problem
Given a document T, for all pairs of nodes n and
n in T, determine whether n and n are
interconnected.
The algorithm which is the solution of this
problem, is based on the following Lemma

Lemma (Interconnection Characterization)
Let T be a document and let n and n be nodes in
T.
If n is ancestor of n, then n and n are
interconnected if and only if the following hold
The parent of n is strongly-interconnected with
n
The child of n on the path to n is
strongly-interconnected with n.
If n is not an ancestor of n and n is not an
ancestor of n, then n and n are interconnected
if and only if the following hold
The parent of n is strongly-interconnected with
n
The parent of n is strongly-interconnected with
n.

In the XSearch system, we have explored the
possibilities of storing the node-interconnection
index in either a hashtable or a symmetric
matrix.
When implemented as a hastable, the
node-interconnection index contains pairs of ids
of interconnected nodes.
When implemented as a symmetric matrix, the
node-interconnection index contains a boolean
value for each pairs of nodes, indicating whether
they are interconnected or not.
A comparison of time and space efficiency of
these structures will be explained.

52
Dynamic Online Interconnection Indexing

Offline computation of the node-interconnection
index may be expensive.
In order to amortize the cost of computing this
index over the queries received, we have also
considered an online indexing method.
When indexing online, for each pair of nodes n
and n, we compute the section of the node
interconnection index corresponding to Tn,n

We use a hashtable to store the part of the part
of index that has already been computed at any
given moment.
The hashtable contains a boolean value for each
pair of nodes whose interconnection has already
been checked.
The boolean value indicates whether the nodes are
interconnected or not.

During query processing, usually only a small
part of the node-interconnection index will be
created, thus the slowdown in response time is
not large.
In addition, queries tend to be similar in the
parts of the document that they must access.
Therefore, even after many queries have been
evaluated, it is likely for the
node-interconnection index to be only partially
computed.
This speeds up execution time when loading the
index into main memory.

55
Experimental Results
56
Hardware and Software Used

Language Java
Processor 1.6 GHZ Pentium 4
RAM 2 GB (limited to 1.46 GB by JVM)
OS Windows XP

57
Interconnection Index

Is built offline
Allows for checking interconnection between two
nodes, during query processing, in O(1) time
We have two implementations
as a hash table
as a symmetric matrix
The Indexer is responsible for building the
Interconnection Index

58
Choosing the Implementation for the
Interconnection Index

We have experimented the two implementations of
the interconnection index
1. IIH the index is an hash table
2. IIM the index is a symmetric matrix
We compare the two implementations
Cost of building the index
Cost of query processing, i.e., using the index

59
Time For Building Indices

Both implementations are reasonable

IIM is better than IIH, because of the additional
overhead of hashing

60
On the Fly Indexing (OFI)

Fully building the indices as a preprocess of
querying is expensive in memory for huge
corpuses!
Also expensive in time because of the additional
overhead of using virtual memory
Instead, compute interconnection index
incrementally on-the-fly during query processing
for each pair that must be checked
By how much will query processing be slowed down?

61
Time For Building Indices Comparing IIH, IIM, OFI
For these corpuses, OFI time is less than 10 ms.
Actually it is the time to build all the indices
other than the interconnection index.
62
(No Transcript)
63
Query Execution Time

We generated 1000 random queries for the Sigmod
Record corpus
Each query had
At most 3 optional search terms
At most 3 required search terms
We checked time with IIH, IIM and OFI

64
IIH/IIM Query Processing Time

Note Logarithmic scale
Both approaches lead to similar results
Average run time for queries 35 ms

65
OFI Query Processing Time

After processing the 1000 queries, 0.75 of all
pairs of nodes were checked for interconnection.
Space saved in main memory

Slowdown in response time not too large! Locality
property queries tend to be similar in the parts
of the document that they may access

More than 50 of the queries processed in under
10 ms

66
How Good are the Results?

We measured recall precision for the query
Find papers written by Buneman that contain the
keyword database in the title
We tried two different queries that reflect
different amounts of user knowledge
Kw Buneman database (classical search engine
query)
Tag-kw authorBuneman titledatabase
Corpus Sigmod, DBLP

67
Precision and Recall

We computed the "correct answers" using XQuery
Recall
?Perfect recall, i.e., XSEarch returns all the
correct answers
Precision at n

68
Precision at 5, 10 and 20
Sigmod Perfect precision DBLP 0.8/0.9 for query
containing only keywords
Combining tags and keywords leads to perfect
precision
69
Related Work

Numerous query languages for XML have been
developed.
For example, the XQuery working group is
considering how to add full-text search features
and ranking to XQuery. Such capabilities have
already been added to various XML query
languages. But these languages are not suitable
for naïve user, since the query syntax is always
complex.
A recent related work is the XRANK system for
keyword searching in XML documents

70
Conclusions

The main contribution of this paper is in laying
the foundations for a semantic search engine over
XML documents.
XSearch returns semantically related fragmants,
ranked by estimated relevance.
This system is extensible and can easily
accommodate different types of relationships
between nodes.
We have shown that it is possible to combine
these qualities with an efficient, scalable and
modular system.
Thus, XSearch can be seen as a general framework
for semantic searching in XML documents.

Efficient index structures
IIM/IIH for small documents
OFI for big documents
Efficient evaluation algorithms
Dynamic algorithm for computing interconnection
Extensible implementation
The system can easily accommodate different types
of semantic relations between nodes, other than
interconnection