XSEarch: A Semantic Search Engine for XML - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

XSEarch: A Semantic Search Engine for XML

Description:

A search term may have a plus sign prepended, in which case it is a required term. ... tag author, in the fragment, increases the rank of this fragment. The ... – PowerPoint PPT presentation

Number of Views:670
Avg rating:3.0/5.0
Slides: 73
Provided by: winx158
Category:
Tags: xml | do | engine | engines | hebrew | how | in | list | my | name | names | of | search | semantic | sign | star | tagged | write | xsearch

less

Transcript and Presenter's Notes

Title: XSEarch: A Semantic Search Engine for XML


1
XSEarch A Semantic Search Engine for XML
  • Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua
    Sagiv
  • The Hebrew University of Jerusalem
  • Presented by
  • Deniz Kasap Sarp Baran Özkan

2
XSEarch an XML Search Engine
  • Goal
  • Find the relevant XML fragments,
  • given tag names and keywords

3
Introduction
  • It is becoming increasingly popular to publish
    data on the Web in the form of XML documents.
  • Current search engines, which are an
    indispensable tool for finding HTML documents,
    have two main drawbacks when it comes to
    searching for XML documents.
  • It is not possible to pose queries that
    explicitly refer to XML tags.
  • Search engines return references (i.e. links) to
    documents and not specific fragments thereof.
    This is problematic, since large XML documents
    may contain thousands of elements storing many
    pieces of information that are not necessarily
    related to each other.

4
Excerpt from the XML Version of DBLP
  • ltproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtMoshe Y. Vardilt/authorgt
  • lttitlegtQuerying Logical Databaseslt/titlegt
  • lt/inproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtVictor Vianult/authorgt
  • lttitlegtA Web Odyssey From Codd to
    XMLlt/titlegt
  • lt/inproceedingsgt
  • lt/proceedingsgt

5
A Search Example
  • Find papers by Vianu on the topic of
  • logical databases

How can we find such papers?
6
Attempt 1 Standard Search Engine
A document containing some of the three query
terms is considered as a result.
7
The document is not relevant to the query. This
does not work!!!
  • ltproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtMoshe Y. Vardilt/authorgt
  • lttitlegtQuerying Logical Databaseslt/titlegt
  • lt/inproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtVictor Vianult/authorgt
  • lttitlegtA Web Odyssey From Codd to
  • XMLlt/titlegt
  • lt/inproceedingsgt
  • lt/proceedingsgt

8
  • Since a reference to whole XML document is
    usually not a useful answer, the granularity of
    the search should be refined.
  • Instead of returning entire document, an XML
    search engine should return fragments of XML
    documents.

9
  • A query language for XML, such as XQuery, can be
    used to extract data from XML documents.
  • However, such a query language is not an
    alternative to an XML search engine for several
    reasons.
  • The syntax of XQuery is more complicated than the
    syntax of a standart search query. Hence, it is
    not appropriate for a naive user.
  • Extensive knowledge of the document structure is
    required in order to correctly formulate a query.
    Thus, queries must be formulated on a per
    document basis.
  • XQuery lacks any mechanism for ranking answers.

10
Attempt 2 XML Query Language
  • FOR i IN document(bib.xml)//inproceedings
  • WHERE i/author contains Vianu
  • AND i/title contains Logical
  • AND i/title contains Databases
  • RETURN ltresultgt
  • ltauthorgt i/author lt/authorgt
  • lttitlegt i/title lt/titlegt
  • lt/resultgt

This does work, BUT
  • Complicated syntax
  • Extensive knowledge of the document structure
    required to write the query
  • No mechanism for ranking results

11
Our Requirements from the Search Tool
  • A simple syntax that can be used by naive users
  • Search results should include XML fragments and
    not necessarily full documents
  • The XML fragments in an answer, should be
    semantically related
  • For example, a paper and an author should be in
    an answer only if the paper was written by this
    author
  • Search results should be ranked
  • Search results should be returned in reasonable
    time

12
  • The design and implementation of XSEarch involved
    several challenges.
  • A syntax is suitable for a naive user.
  • The theoretical results were adapted so that
    XSEarch always returns as answers.
  • Answers are highly relevant to the keywords of
    the query.
  • Suitable ranking mechanism that takes into
    account both the degree of the semantic
    relationship and the relevance of the keywords
    have been developed.
  • Index structures and evaluation algorithms that
    allow the system to deal efficiently with large
    documents have been developed.
  • The implemantation of XSEarch is extensible in
    the sense that it can easily accommodate
    different type of semantic relationships.

13
Query Syntax
  • The query language of a standart search engine is
    simply a list of keywords.
  • Keywords with a plus () sign must appear in a
    satisfying document, whereas keywords without a
    plus sign may or may not appear in a satisfying
    document. (but the appearance of such keywords is
    desirable)

14
  • The query language of XSEarch is a simple
    extension of the language described below. In
    addition to specify labels and keyword-label
    combinations that must or may appear in a
    satisfying document.
  • A search term may have a plus sign prepended, in
    which case it is a required term. Otherwise, it
    is an optional term.
  • We use t, t1, t2, etc., as an abstract notation
    for required and optional term.
  • A query has the form Q(S) where S t1,...,tm is
    a sequence of required and optional search terms.

15
  • Formally, a search term has the form
  • lk, l, k
  • where
  • l is a label and k is a keyword.

16
Example
  • Find papers by Vianu on the topic of logical
    databases

logical database inproceedings authorVianu
Note that the different document fragments
matching these query terms must be semantically
related
17
Query Semantics
  • This section presents the semantics of our
    queries.
  • In order to satisfy a query Q, each of the
    required terms in Q must be satisfied.
  • In addition, the elements satisfying Q must be
    meaningfully related.

18
XSEarch
authorVianu title
  • ltproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtMoshe Y. Vardilt/authorgt
  • lttitlegtQuerying Logical Databaseslt/titlegt
  • lt/inproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtVictor Vianult/authorgt
  • lttitlegtA Web Odyssey From Codd to
    XMLlt/titlegt
  • lt/inproceedingsgt
  • lt/proceedingsgt

ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to XMLlt/titlegt
Good Result! title and author elements ARE
semantically related
19
XSEarch
authorVianu title
  • ltproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtMoshe Y. Vardilt/authorgt
  • lttitlegtQuerying Logical Databaseslt/titlegt
  • lt/inproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtVictor Vianult/authorgt
  • lttitlegtA Web Odyssey From Codd to
    XMLlt/titlegt
  • lt/inproceedingsgt
  • lt/proceedingsgt

lttitlegtQuerying Logical Databaseslt/titlegt
ltauthorgtVictor Vianult/authorgt
Bad Result! title and author elements ARE NOT
semantically related
20
Satisfaction of a Search Term
  • XML documents are modeled as trees in the
    standard fashion.
  • Each interior node is associated with a label and
    each leaf node is associated with the sequence of
    keywords.
  • If k is a keyword in the sequence associated with
    n, n contains k is said.
  • In Figure 1 there is a tree that represents a
    small portion of the Sigmod Record.
  • We will refer to this tree as Tsr

21
(No Transcript)
22
  • Let n be an interior node in a tree T.
  • We say that n satisfies the search term
  • lk if n is labeled with l and a descendent that
    contains the keyword k.
  • l if n is labeled with l.
  • k if n has a leaf child that contains the
    keyword k.
  • Example
  • In the tree Tsr,
  • node number 14 satisfies Kempster
  • node number 9 satisfies authorsKempster.
  • node 9 does not satisfy Kempster, position or
    position.

23
Meaningfully Related Sets of Nodes
  • Let T be a tree and R be a binary, reflexive and
    symmetric relationship on the nodes in T.
  • We assume that R contains pairs of nodes that are
    meaningfully related.
  • We present two different way to extend R to
    arbitrary sets of nodes

24
  • A set of nodes N is all-pairs R-related, if
    (n1,n2) is in R, for every pair of nodes n1, n2.
  • This states that a set of nodes is meaningfully
    related if every pair of nodes in the set is
    meaningfully related.
  • N is star R-related, if there is a node n ? N
    such that the pair (n,n) is in R, for all nodes
    n ? N.
  • This states that the nodes of a set are
    meaningfully related if all these nodes are
    meaningfully related to a node in the set.
  • Depending on the structure of the documents,
    either the all-pairs relationship or star
    relation-ship may be more appropriate.

25
Query Answers
  • Let Q(t1,,tm) be a query.
  • A sequence N n1,,nm of nodes and null values
    is an all-pairs R-answer for Q if the nodes in N
    are all-pairs R-related and for all 1 ? i ? m
  • ni is not the null value if ti is a required
    term
  • ni satisfies ti if it is not the null value.
  • Similarly, N is star R-answer, when the nodes in
    N are star R-related.

26
  • We use
  • Ansa,R(Q) to denote the set of all-pairs R-answer
    for the query Q over a tree T and
  • Ansts,R(Q) to denote the set of star R-answers
    for Q over T.
  • MaxAnsa,R to denote the set of maximal answers in
    Ansa,R(Q)

27
The Interconnection Relationship
  • We present a relation which can be used to
    determine whether a pair of nodes is meaningfully
    related.
  • Let T be tree an n1 and n2 be nodes in T.
  • The shortest undirected path between n1 and n2
    consists of the paths from the lowest common
    ancestor of n1 and n2 to n1 and n2.

28
  • We denote the tree consisting of these two paths
    as Tn1,n2.
  • This tree describes the relationship between the
    nodes n1 and n2.
  • For example in Tsr, the tree T8,13 consists of
    the nodes 7, 8, 9, 12 and 13.

29
Relationship Trees
Lowest common ancestor of n1, n2, , nk

nk
n1
n2
30
Our Semantic Relation Interconnection
  • n1,..., nk are interconnected if either
  • relationship tree of n1,..., nk does not contain
    two nodes with the same label, or
  • the only nodes with the same label in the
    relationship tree of n1,..., nk, are among
    n1,..., nk

31
Example (1)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to different inproceedings
entities. They ARE NOT interconnected!
32
Example (2)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to the same inproceedings
entity. They ARE interconnected!
33
Example (3)
Lowest common ancestor of circled nodes
proceedings
Relationship tree
inproceedings
inproceedings
title
author
title
author
author
Moshe Y. Vardi
Victor Vianu
Serge Abiteboul
Queries and Computation on the Web
Querying Logical Databases
Circled nodes belong to the same inproceedings
entity, but are labeled with the same tag. They
ARE interconnected.
34
Example 1 of Query Semantics
  • Consider the query Q1 defined as
  • Q1(title, author).
  • The query Q1 finds pairs of titles and authors,
    belonging to the same article.
  • Only tuples where the title is non-null will be
    returned.
  • The answers created for Tsr are
  • (8,10) , (8,12) , (8,14) , (17,18) and (25, ?)

35
Example 2 of Query Semantics
  • The answers for Q1 over this document would
    consists of
  • (6,3) and (6,4)

36
Query Processing
  • Document fragments are extracted using the
    interconnection index and other indices
  • Extracted fragments are returned ranked by the
    estimated relevance

37
Ranker
38
Ranking Factors
  • Several factors increase the rank of a result
  • Similarity between query and result
  • Weight of labels appearing in the result
  • Characteristics of result tree

39
Query and Result Similarity
  • TFILF
  • Extension of TFIDF, classical in IR
  • Term Frequency number of occurrences of a query
    term in a fragment
  • Inverse Leaf Frequency number of leaves
    containing a query term divided by number of
    leaves in the corpus

40
TFILF
  • Term frequency of keyword k in a leaf node nl
  • Inverse leaf frequency

TFILF is the product between tf and ilf
41
Weight of Labels
  • Some labels are considered more important than
    others
  • Text under an element labeled with title is more
    important than text under element labeled with
    section
  • Label weights can be
  • system generated
  • user defined

42
Relationship between Nodes
  • Size of the relationship tree small fragment
    indicates that its nodes are closer, and thus,
    probably, more related

article titleXML
43
Relationship between Nodes
  • Ancestor-descendant relationships between a pair
    of nodes in a fragment, indicates strong
    relation between these nodes

section titleXML
44
Combining the Factors
  • Given a query Q and an answer N, we use the
    measures
  • sim(Q,N),
  • tsize(N)
  • and anc-des(N)
  • to determine the ranking of the answer. We
    experimented with the following combination of
    factors by varying the values of a , ß and ?
  • sim(Q,N)a / tsize(N)ß x (1 ? x anc-des(N))

45
System Implementation
  • The architecture of the XSEarch system is
    depicted in the following figure

46
(No Transcript)
47
  • The basic follow of information is as follows
  • The user enters a query using a browser.
  • The Search-Query Processor parses the query into
    a list of search terms.
  • The Index Repository is used to find nodes that
    satisfy that satisfy the search terms and to find
    whether pairs of nodes are interconnected.
  • It responds by checking the stored indices.
  • If these indices do not contain sufficient
    information, the Indexer is used to augment the
    current indices.
  • Once the relevant information is returned to the
    Search-Query Processor, it creates the answers,
    which are ranked, sorted and then returned.
  • The Indexer creates several different indices in
    the Index Repository based on a set of XML
    documents.

48
  • We focus on the most important and novel index
    structures
  • The interconnection index
  • Path index
  • The interconnection index allows for rapid
    checking of the interconnection relationship.
  • Path index allow us to create first answers with
    higher estimated ranking.

49
Dynamic Offline Interconnection Indexing
  • Checking for interconnection of nodes online is
    expensive.
  • Hence, it is decided that at the first to create
    a node-interconnection index that would store
    information about the interconnection
    relationship between each pair of nodes.
  • This requires solving the following problem
  • Given a document T, for all pairs of nodes n and
    n in T, determine whether n and n are
    interconnected.
  • The algorithm which is the solution of this
    problem, is based on the following Lemma

50
  • Lemma (Interconnection Characterization)
  • Let T be a document and let n and n be nodes in
    T.
  • If n is ancestor of n, then n and n are
    interconnected if and only if the following hold
  • The parent of n is strongly-interconnected with
    n
  • The child of n on the path to n is
    strongly-interconnected with n.
  • If n is not an ancestor of n and n is not an
    ancestor of n, then n and n are interconnected
    if and only if the following hold
  • The parent of n is strongly-interconnected with
    n
  • The parent of n is strongly-interconnected with
    n.

51
  • In the XSearch system, we have explored the
    possibilities of storing the node-interconnection
    index in either a hashtable or a symmetric
    matrix.
  • When implemented as a hastable, the
    node-interconnection index contains pairs of ids
    of interconnected nodes.
  • When implemented as a symmetric matrix, the
    node-interconnection index contains a boolean
    value for each pairs of nodes, indicating whether
    they are interconnected or not.
  • A comparison of time and space efficiency of
    these structures will be explained.

52
Dynamic Online Interconnection Indexing
  • Offline computation of the node-interconnection
    index may be expensive.
  • In order to amortize the cost of computing this
    index over the queries received, we have also
    considered an online indexing method.
  • When indexing online, for each pair of nodes n
    and n, we compute the section of the node
    interconnection index corresponding to Tn,n

53
  • We use a hashtable to store the part of the part
    of index that has already been computed at any
    given moment.
  • The hashtable contains a boolean value for each
    pair of nodes whose interconnection has already
    been checked.
  • The boolean value indicates whether the nodes are
    interconnected or not.

54
  • During query processing, usually only a small
    part of the node-interconnection index will be
    created, thus the slowdown in response time is
    not large.
  • In addition, queries tend to be similar in the
    parts of the document that they must access.
  • Therefore, even after many queries have been
    evaluated, it is likely for the
    node-interconnection index to be only partially
    computed.
  • This speeds up execution time when loading the
    index into main memory.

55
Experimental Results
56
Hardware and Software Used
  • Language Java
  • Processor 1.6 GHZ Pentium 4
  • RAM 2 GB (limited to 1.46 GB by JVM)
  • OS Windows XP

57
Interconnection Index
  • Is built offline
  • Allows for checking interconnection between two
    nodes, during query processing, in O(1) time
  • We have two implementations
  • as a hash table
  • as a symmetric matrix
  • The Indexer is responsible for building the
    Interconnection Index

58
Choosing the Implementation for the
Interconnection Index
  • We have experimented the two implementations of
    the interconnection index
  • 1. IIH the index is an hash table
  • 2. IIM the index is a symmetric matrix
  • We compare the two implementations
  • Cost of building the index
  • Cost of query processing, i.e., using the index

59
Time For Building Indices
  • Both implementations are reasonable
  • IIM is better than IIH, because of the additional
    overhead of hashing

60
On the Fly Indexing (OFI)
  • Fully building the indices as a preprocess of
    querying is expensive in memory for huge
    corpuses!
  • Also expensive in time because of the additional
    overhead of using virtual memory
  • Instead, compute interconnection index
    incrementally on-the-fly during query processing
    for each pair that must be checked
  • By how much will query processing be slowed down?

61
Time For Building Indices Comparing IIH, IIM, OFI
For these corpuses, OFI time is less than 10 ms.
Actually it is the time to build all the indices
other than the interconnection index.
62
(No Transcript)
63
Query Execution Time
  • We generated 1000 random queries for the Sigmod
    Record corpus
  • Each query had
  • At most 3 optional search terms
  • At most 3 required search terms
  • We checked time with IIH, IIM and OFI

64
IIH/IIM Query Processing Time
  • Note Logarithmic scale
  • Both approaches lead to similar results
  • Average run time for queries 35 ms

65
OFI Query Processing Time
  • After processing the 1000 queries, 0.75 of all
    pairs of nodes were checked for interconnection.
  • Space saved in main memory

Slowdown in response time not too large! Locality
property queries tend to be similar in the parts
of the document that they may access
  • More than 50 of the queries processed in under
    10 ms

66
How Good are the Results?
  • We measured recall precision for the query
  • Find papers written by Buneman that contain the
    keyword database in the title
  • We tried two different queries that reflect
    different amounts of user knowledge
  • Kw Buneman database (classical search engine
    query)
  • Tag-kw authorBuneman titledatabase
  • Corpus Sigmod, DBLP

67
Precision and Recall
  • We computed the "correct answers" using XQuery
  • Recall
  • ?Perfect recall, i.e., XSEarch returns all the
    correct answers
  • Precision at n

68
Precision at 5, 10 and 20
Sigmod Perfect precision DBLP 0.8/0.9 for query
containing only keywords
Combining tags and keywords leads to perfect
precision
69
Related Work
  • Numerous query languages for XML have been
    developed.
  • For example, the XQuery working group is
    considering how to add full-text search features
    and ranking to XQuery. Such capabilities have
    already been added to various XML query
    languages. But these languages are not suitable
    for naïve user, since the query syntax is always
    complex.
  • A recent related work is the XRANK system for
    keyword searching in XML documents

70
Conclusions
  • The main contribution of this paper is in laying
    the foundations for a semantic search engine over
    XML documents.
  • XSearch returns semantically related fragmants,
    ranked by estimated relevance.
  • This system is extensible and can easily
    accommodate different types of relationships
    between nodes.
  • We have shown that it is possible to combine
    these qualities with an efficient, scalable and
    modular system.
  • Thus, XSearch can be seen as a general framework
    for semantic searching in XML documents.

71
  • Efficient index structures
  • IIM/IIH for small documents
  • OFI for big documents
  • Efficient evaluation algorithms
  • Dynamic algorithm for computing interconnection
  • Extensible implementation
  • The system can easily accommodate different types
    of semantic relations between nodes, other than
    interconnection

72
  • Thank You.
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com