XSEarch XML Search Engine - PowerPoint PPT Presentation

About This Presentation
Title:

XSEarch XML Search Engine

Description:

Brown Bear $13.95. Tree Representation. We need to find tuples of related title and ... Brown Bear $13.95. Intuition: The nodes belong to same book entity ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 44
Provided by: csHu
Category:
Tags: xml | engine | search | xsearch

less

Transcript and Presenter's Notes

Title: XSEarch XML Search Engine


1
XSEarchXML Search Engine
Jonathan MAMOU October 2002
2
Motivation
3
XML
  • Getting popular
  • Allows meta-data to be embedded into documents
  • Data-centric view exchange format for
    structured data meta data
  • Document-centric view Content text, meta data
  • Querying data and meta-data

4
One Fish Two Fish by John Meyer Peter
Smith Costs Only 7.95
Buy our Classic Childrens books.
Goodnight Moon by Margaret Brown Costs Only
10.55
Brown Bear by Bill Martin Jr. Costs Only 6.00
amazing.com
5
  • ltbookinfogt
  • ltbookgtlttitlegtOne Fish Two Fishlt/titlegt
  • ltauthorgtJohn Meyerlt/authorgt
  • lt author gtPeter Smithlt/authorgt
  • ltpricegt7.95lt/pricegtlt/bookgt
  • ltbookgtlttitlegtGoodnight Moonlt/titlegt
  • lt author gtMargaret Brownlt/authorgt
  • ltpricegt10.55lt/pricegtlt/bookgt ....
  • lt/bookinfogt

6
A query
  • Find titles and prices of books by Meyer or
    Smith

7
IR Approach
  • How to deal with tags?
  • Discard all tags
  • Simplicity
  • Loss of information (structure) ? lower retrieval
    performance
  • Keep tags as keyword
  • How to write the query?
  • Title price book author Meyer Smith

8
IR Approach (contd)
  • Cant specify that Meyer and Smith are the
    authors
  • Cant specify that title, price and author
    belongs to same book
  • Cant specify desired output (i.e., titles, price)

9
Database approach
  • FOR b IN document(bib.xml)//book
  • WHERE b/author contains Meyer OR b/author
    contains Smith
  • RETURN
  • ltresultgt
  • lttitlegt b/title lt/titlegt
  • ltpricegt b/price lt/pricegt
  • lt/resultgt
  • Difficult for naive user
  • Requires knowledge of document structure
  • Dependent on document structure

10
Our Goal
  • Combine IR and database techniques tags text
  • Simple language
  • Logical Structure, not physical
  • Require knowledge of tag names, not structure
  • Queries should work even if structure changes
  • Rank results

11
Framework
12
Tree Representation
bookinfo
book
book
price
title
price
author
title
author
Just Lost
5.75
Brown Bear
13.95
Mercy Meyer
Gina Meyer
We need to find tuples of related title and price
nodes.
13
Another Tree Representation
bookinfo
author
author
book
book
name
book
price
name
Dr. Meyer
price
title
title
title
12.50
Goodnight Moon
One Fish Two Fish
Cat in the Hat
14.95
M. Brown
Similar document, but with different hierarchical
structure from the previous. We need to find
tuples of related title, author and price nodes.
14
Interconnection
The lowest common ancestor of the circled nodes
Consider a title and price node Intuition The
nodes belong to different book entities
15
Interconnection (contd)
The lowest common ancestor of the circled nodes
title
Just Lost
13.95
Intuition The nodes belong to same book entity
16
Interconnection (contd)
title
Just Lost
13.95
Intuition The nodes belong to same book entity
17
Relationship tree
  • Nodes n1,n2
  • n their lowest common ancestor
  • Tn the subtree rooted at n
  • The relationship tree of n1,n2 is the tree
    obtained by pruning from Tn all nodes other than
    n1,n2 that are not ancestors of n1,n2

18
Interconnection
  • We say that n1,n2 are interconnected if
  • the relationship tree does not contain 2 distinct
    nodes with the same label
  • Or
  • the relationship tree contains exactly one pair
    of distinct nodes with the same label and this
    pair is comprised of n1,n2

19
All-Pairs Interconnection
  • A set of nodes is all-pairs interconnected if
    every pair of nodes are interconnected

20
Star interconnection
The 2 names are not interconnected
21
Star Interconnection (contd)
  • A set of nodes is star interconnected if all the
    nodes in the set are interconnected to the same
    node

22
Search terms, Search query
  • Search Term (l,k)
  • l label (context)
  • k keyword
  • Search Query ANDL1 ORL2
  • L1, L2 list of search terms
  • AND(title,)(price,)
  • OR(author,Meyer)(authorSmith)

23
Answer
  • ANDN1 ORN2
  • N1, N2 are list of nodes
  • Matching between N1,N2 and L1,L2
  • N1 and N2 are interconnected
  • All all-pair answers are star answers
  • Maximal answer

24
Example
bookinfo
book
book
price
title
price
author
title
author
Just Lost
5.75
Brown Bear
13.95
Mercy Meyer
Gina Meyer
(title,) (price,) (author,Meyer)
Find matchings of title, author and price to the
nodes in the tree
25
Computing answers
  • All-pairs
  • Determining whether the set of answers is empty
    is NP-complete
  • If L1 is empty, computing the set of answers is
    polynomial in the size of input and output
  • Star
  • computing the set of answers is polynomial in the
    size of input and output

26
Ranking results
  • Unstructured
  • Keyword weight (tfilf)
  • Tags weight
  • Result size
  • Structured
  • Nodes distance
  • Ancestor-descendant

27
Keyword Weight
  • Compute the weight of a keyword k within a given
    node n
  • Variation of the tfidf, one of the metric of
    Vector Space Model (classical model in IR)

28
Keyword Weight (contd)
  • Term Frequency (tf) number of appearances of k
    within n
  • tf(k,n) occ(k,n) / (max occ(k,n))
  • Inverse Leaf Frequency (ilf) inverse frequency
    of k among all the leafs in the corpus
  • idf(k) log(1N/Nk)
  • W(k,n) tf(k,n) idf(k)
  • Normalized per leave

29
Tag Weight
  • Give weight to tags according to their importance
  • E.g. give more weight to lttitlegt than to
    ltabstractgt

30
Result Size
  • Number of search terms appearing in the result
    (OR part)

31
Ranking-Structured
  • Nodes distance
  • size of the relationship tree
  • Ancestor-descendant relationship
  • more interconnected

32
System overview
33
XSEarch overview
Online
XML corpus with logical hierarchy
query
Indexer
Search
Offline
Results
34
Document Location array
  • Generate a unique id, did
  • Associate each did with the physical location of
    the corresponding document
  • Logical structure of the corpus

35
Node Encoding Array
  • Generate for each interior node a id, nid
  • Node encoding
  • Defined recursively
  • Node encoding of its parent
  • Index of the node among its siblings
  • Eg 13.8.1.9
  • Associate each nid with its node encoding

36
Node Label Array
  • Associate each nid with its label

37
Inverted Tag Index
  • For each tag, keep
  • posting list list of nodes labeled with this tag
  • weight

38
Inverted Keyword Index
  • For each kw, keep
  • posting list list of leafs containing this
    keyword
  • weight of the kw within the leaf (tfilf)

39
Node Interconnection Matrix
  • element ij contains
  • 1, if ni and nj are interconnected
  • 0, else
  • nn symmetric sparse matrix
  • Dynamic programming

40
Alternative
  • Hash set keep only interconnected nodes
  • Key pair (ni, nj)

41
Interconnection
  • Let n be the number of nodes
  • It is possible to determine whether n1 and n2 are
    interconnected in O(n) time
  • It is possible to determine interconnection of
    all pairs in O(n2)
  • Offline/Online computation

42
Interconnection
  • for (isize-1 igt0 i--)
  • for (ji1 jltsize j)
  • if i ancestor of j
  • connected(iChild,j) AND connected(i,jFather) AND
  • labelIChild ! labelJ AND labelI ! labelJFather
  • for (ji1 jltsize j)
  • if i not ancestor of j
  • connected(i,jFather) AND connected(iFather,j) AND
  • labelI ! labelJFather AND labelIFather ! labelJ

43
Demo
Write a Comment
User Comments (0)
About PowerShow.com