XSEarch: A Semantic Search Engine for XML - PowerPoint PPT Presentation

About This Presentation
Title:

XSEarch: A Semantic Search Engine for XML

Description:

How can we find such papers? Attempt 1: Standard Search Engine ... Attempt 2: XML Query Language. FOR $i IN document('bib.xml')//inproceedings ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 55
Provided by: winx58
Category:

less

Transcript and Presenter's Notes

Title: XSEarch: A Semantic Search Engine for XML


1
XSEarch A Semantic Search Engine for XML
  • Sara Cohen
  • Jonathan Mamou
  • Yaron Kanza
  • Yehoshua Sagiv
  • Presented at VLDB 2003, Germany

2
XSEarch an XML Search Engine
  • Our Goal
  • Find the relevant XML fragments,
  • given tag names and keywords

3
Excerpt from the XML Version of DBLP
  • ltproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtMoshe Y. Vardilt/authorgt
  • lttitlegtQuerying Logical Databaseslt/titlegt
  • lt/inproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtVictor Vianult/authorgt
  • lttitlegtA Web Odyssey From Codd to
    XMLlt/titlegt
  • lt/inproceedingsgt
  • lt/proceedingsgt

4
A Search Example
  • Find papers by Vianu on the topic of
  • logical databases

How can we find such papers?
5
Attempt 1 Standard Search Engine
Each document in the corpus is treated as an
integral unit. A document containing some of the
three query terms is considered as a result.
6
The document is not relevant to the query. This
does not work!!!
  • ltproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtMoshe Y. Vardilt/authorgt
  • lttitlegtQuerying Logical Databaseslt/titlegt
  • lt/inproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtVictor Vianult/authorgt
  • lttitlegtA Web Odyssey From Codd to
  • XMLlt/titlegt
  • lt/inproceedingsgt
  • lt/proceedingsgt

7
Attempt 2 XML Query Language
  • FOR i IN document(bib.xml)//inproceedings
  • WHERE i/author contains Vianu
  • AND i/title contains Logical
  • AND i/title contains Databases
  • RETURN ltresultgt
  • ltauthorgt i/author lt/authorgt
  • lttitlegt i/title lt/titlegt
  • lt/resultgt

This does work, BUT
  • Complicated syntax
  • Extensive knowledge of the document structure
    required to write the query
  • No mechanism for ranking results

8
Our Requirements from the Search Tool
  • A simple syntax that can be used by naive users
  • Search results should include XML fragments and
    not necessarily full documents
  • The XML fragments in an answer, should be
    semantically related
  • For example, a paper and an author should be in
    an answer only if the paper was written by this
    author
  • Search results should be ranked
  • Search results should be returned in reasonable
    time

9
Overall Architecture
User Interface
Query Processor
Ranker
1
L1
L2
2
L3
3
L4
4
Index Repository
XML Files
Indices
Indexer
10
Query Syntax and Semantics
User Interface
Query Processor
Ranker
1
L1
L2
2
L3
3
L4
4
Index Repository
XML Files
Indices
Indexer
11
XSEarch Query Syntax
  • A query is a list of query terms
  • A query term can be a
  • Keyword, e.g., database
  • Tag, e.g., inproceedings
  • Tag-keyword combination, e.g., authorVianu
  • Optionally preceded by a

12
Example
  • Find papers by Vianu on the topic of logical
    databases

logical database inproceedings authorVianu
Note that the different document fragments
matching these query terms must be semantically
related
13
Semantic Relation The Intuition
14
XSEarch
authorVianu title
  • ltproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtMoshe Y. Vardilt/authorgt
  • lttitlegtQuerying Logical Databaseslt/titlegt
  • lt/inproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtVictor Vianult/authorgt
  • lttitlegtA Web Odyssey From Codd to
    XMLlt/titlegt
  • lt/inproceedingsgt
  • lt/proceedingsgt

ltauthorgtVictor Vianult/authorgt
lttitlegtA Web Odyssey From Codd to XMLlt/titlegt
Good Result! title and author elements ARE
semantically related
15
XSEarch
authorVianu title
  • ltproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtMoshe Y. Vardilt/authorgt
  • lttitlegtQuerying Logical Databaseslt/titlegt
  • lt/inproceedingsgt
  • ltinproceedingsgt
  • ltauthorgtVictor Vianult/authorgt
  • lttitlegtA Web Odyssey From Codd to
    XMLlt/titlegt
  • lt/inproceedingsgt
  • lt/proceedingsgt

lttitlegtQuerying Logical Databaseslt/titlegt
ltauthorgtVictor Vianult/authorgt
Bad Result! title and author elements ARE NOT
semantically related
16
Semantic Relation Formalization
17
Data Model Document Tree
Tags are colored in green
proceedings
Data is colored in red
inproceedings
inproceedings
title
author
title
author
Moshe Y. Vardi
Victor Vianu
A Web Odyssey From Codd to XML
Querying Logical Databases
GOAL Find pairs of semantically related titles
and authors.
18
Relationship Trees
Lowest common ancestor of n1, n2, , nk

nk
n1
n2
19
Our Semantic Relation Interconnection
  • n1,..., nk are strongly interconnected if the
    relationship tree of n1,..., nk does not contain
    two nodes with the same label
  • n1,..., nk are interconnected if either
  • they are strongly interconnected, or
  • the only nodes with the same label in the
    relationship tree of n1,..., nk, are among
    n1,..., nk

20
Example (1)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to different inproceedings
entities. They ARE NOT strongly interconnected
nor interconnected!
21
Example (2)
Lowest common ancestor of circled nodes
Relationship tree
Circled nodes belong to the same inproceedings
entity. They ARE strongly interconnected, thus,
interconnected!
22
Example (3)
Lowest common ancestor of circled nodes
proceedings
Relationship tree
inproceedings
inproceedings
title
author
title
author
author
Moshe Y. Vardi
Victor Vianu
Serge Abiteboul
Queries and Computation on the Web
Querying Logical Databases
We can see the advantage of using interconnection
rather than strong interconnection. These two
author nodes ARE semantically related.
Circled nodes belong to the same inproceedings
entity, but are labeled with the same tag. They
ARE interconnected, BUT NOT strongly
interconnected!
23
Interconnection
  • Based on theoretical results of Generating
    relations from XML documents, S. Cohen, Y.Kanza,
    Y. Sagiv, ICDT 2003.
  • Three types of interconnection
  • We have implemented two types of interconnection
  • XSEarch can easily accommodate different types of
    interconnection, or other semantic relations
    between nodes

24
Checking Whether Two Nodes Are Interconnected
  • During query processing, we need to check
    whether pairs of nodes are interconnected
  • Given a document T, it is possible to check
    whether nodes n and n are interconnected in
    O(T) time
  • Too expensive to do it during query processing!

25
Interconnection Index
  • Is built offline
  • Allows for checking interconnection between two
    nodes, during query processing, in O(1) time
  • We have two implementations
  • as a hash table
  • as a symmetric matrix
  • The Indexer is responsible for building the
    Interconnection Index

26
Indexer
27
Building the Interconnection Index Naïve Approach
  • For each pair of nodes, check whether this pair
    is interconnected
  • There are O(T2) pairs
  • Checking interconnection is in O(T) time
  • As a result, checking for interconnection of all
    pairs of nodes in T is in O(T3) time
  • ?Too expensive also if it is done offline!!!

28
Building the Interconnection Index Dynamic
Programming Approach
  • Idea Checking whether two nodes are
    interconnected can be done by checking
    interconnection between their parents/children
  • There are two characterizations of nodes
    interconnection
  • For child-ancestor nodes
  • For non child-ancestor nodes

29
Interconnection Characterization n is an
ancestor of n
  • n and n are interconnected
  • if and only if
  • the parent of n is strongly
  • interconnected with n
  • the child of n on the path to n
  • is strongly interconnected with n

n
child of n

parent of n
n
30
Interconnection Characterization n is not an
ancestor of n
  • n and n are interconnected
  • if and only if
  • the parent of n is strongly
  • interconnected with n
  • the parent of n is strongly
  • interconnected with n



parent of n
parent of n
31
Building the Interconnection Index Using Dynamic
Programming
  • Theorem Let T be a document. Then it is possible
    to determine interconnection of all pairs of
    nodes in T in O(T2) time
  • Proof hint
  • Derive nodes numbers in T by a depth-first
    traversal of T
  • Compute the index using dynamic programming,
    based on the characterizations

32
Query Processing
  • Document fragments are extracted using the
    interconnection index and other indices
  • Extracted fragments are returned ranked by the
    estimated relevance

33
Ranker
34
Ranking Factors
  • Several factors increase the rank of a result
  • Similarity between query and result
  • Weight of labels appearing in the result
  • Characteristics of result tree

35
Query and Result Similarity
  • TFILF
  • Extension of TFIDF, classical in IR
  • Term Frequency number of occurrences of a query
    term in a fragment
  • Inverse Leaf Frequency number of leaves
    containing a query term divided by number of
    leaves in the corpus

36
TFILF
  • Term frequency of keyword k in a leaf node nl
  • Inverse leaf frequency

TFILF is the product between tf and ilf
37
Weight of Labels
  • Some labels are considered more important than
    others
  • Text under an element labeled with title is more
    important than text under element labeled with
    section
  • Label weights can be
  • system generated
  • user defined

38
Relationship between Nodes
  • Size of the relationship tree small fragment
    indicates that its nodes are closer, and thus,
    probably, more related

article titleXML
39
Relationship between Nodes
  • Ancestor-descendant relationships between a pair
    of nodes in a fragment, indicates strong
    relation between these nodes

section titleXML
40
Experimental Results
41
Hardware and Software Used
  • Language Java
  • Processor 1.6 GHZ Pentium 4
  • RAM 2 GB (limited to 1.46 GB by JVM)
  • OS Windows XP

42
Choosing the Implementation for the
Interconnection Index
  • We have experimented the two implementations of
    the interconnection index
  • 1. IIH the index is an hash table
  • 2. IIM the index is a symmetric matrix
  • We compare the two implementations
  • Cost of building the index
  • Cost of query processing, i.e., using the index

43
Time For Building Indices
IIH time (ms) IIM time (ms) Number of nodes Size (KB) XML corpus
36 29 3,360 146 Dream
185 114 6,635 281 Hamlet
1,729 1,552 21,246 704 Sigmod
7,837 6,231 49,422 1,198 Mondial
  • Both implementations are reasonable
  • IIM is better than IIH, because of the additional
    overhead of hashing

44
On the Fly Indexing (OFI)
  • Fully building the indices as a preprocess of
    querying is expensive in memory for huge
    corpuses!
  • Also expensive in time because of the additional
    overhead of using virtual memory
  • Instead, compute interconnection index
    incrementally on-the-fly during query processing
    for each pair that must be checked
  • By how much will query processing be slowed down?

45
Time For Building Indices Comparing IIH, IIM, OFI
IIM time (ms) IIH time (ms) OFI time (ms) Number of nodes Size (KB) XML corpus
29 36 0.6 3,360 146 Dream
114 185 1.1 6,635 281 Hamlet
1,552 1,729 2.2 21,246 704 Sigmod
6,231 7,837 10.0 49,422 1,198 Mondial
For these corpuses, OFI time is less than 10 ms.
Actually it is the time to build all the indices
other than the interconnection index.
46
Query Execution Time
  • We generated 1000 random queries for the Sigmod
    Record corpus
  • Each query had
  • At most 3 optional search terms
  • At most 3 required search terms
  • We checked time with IIH, IIM and OFI

47
IIH/IIM Query Processing Time
  • Note Logarithmic scale
  • Both approaches lead to similar results
  • Average run time for queries 35 ms

48
OFI Query Processing Time
  • After processing the 1000 queries, 0.75 of all
    pairs of nodes were checked for interconnection.
  • Space saved in main memory

Slowdown in response time not too large! Locality
property queries tend to be similar in the parts
of the document that they may access
  • More than 50 of the queries processed in under
    10 ms

49
How Good are the Results?
  • We measured recall precision for the query
  • Find papers written by Buneman that contain the
    keyword database in the title
  • We tried two different queries that reflect
    different amounts of user knowledge
  • Kw Buneman database (classical search engine
    query)
  • Tag-kw authorBuneman titledatabase
  • Corpus Sigmod, DBLP

50
Precision and Recall
  • We computed the "correct answers" using XQuery
  • Recall
  • ?Perfect recall, i.e., XSEarch returns all the
    correct answers
  • Precision at n

51
Precision at 5, 10 and 20
Sigmod Perfect precision DBLP 0.8/0.9 for query
containing only keywords
Combining tags and keywords leads to perfect
precision
52
Conclusions
  • Paradigm for querying XML combining IR and
    database techniques
  • Returns semantically related fragments, ranked by
    estimated relevance
  • Combining tags and keywords in the query leads to
    good results

53
Conclusions
  • Efficient index structures
  • IIM/IIH for small documents
  • OFI for big documents
  • Efficient evaluation algorithms
  • Dynamic algorithm for computing interconnection
  • Extensible implementation
  • The system can easily accommodate different types
    of semantic relations between nodes, other than
    interconnection

54
  • Thank You.
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com