Title: XSEarch XML Search Engine
1XSEarchXML Search Engine
Jonathan MAMOU October 2002
2Motivation
3XML
- Getting popular
- Allows meta-data to be embedded into documents
- Data-centric view exchange format for
structured data meta data - Document-centric view Content text, meta data
- Querying data and meta-data
4One Fish Two Fish by John Meyer Peter
Smith Costs Only 7.95
Buy our Classic Childrens books.
Goodnight Moon by Margaret Brown Costs Only
10.55
Brown Bear by Bill Martin Jr. Costs Only 6.00
amazing.com
5- ltbookinfogt
- ltbookgtlttitlegtOne Fish Two Fishlt/titlegt
- ltauthorgtJohn Meyerlt/authorgt
- lt author gtPeter Smithlt/authorgt
- ltpricegt7.95lt/pricegtlt/bookgt
- ltbookgtlttitlegtGoodnight Moonlt/titlegt
- lt author gtMargaret Brownlt/authorgt
- ltpricegt10.55lt/pricegtlt/bookgt ....
- lt/bookinfogt
6A query
- Find titles and prices of books by Meyer or
Smith
7IR Approach
- How to deal with tags?
- Discard all tags
- Simplicity
- Loss of information (structure) ? lower retrieval
performance - Keep tags as keyword
- How to write the query?
- Title price book author Meyer Smith
8IR Approach (contd)
- Cant specify that Meyer and Smith are the
authors - Cant specify that title, price and author
belongs to same book - Cant specify desired output (i.e., titles, price)
9Database approach
- FOR b IN document(bib.xml)//book
- WHERE b/author contains Meyer OR b/author
contains Smith - RETURN
- ltresultgt
- lttitlegt b/title lt/titlegt
- ltpricegt b/price lt/pricegt
- lt/resultgt
- Difficult for naive user
- Requires knowledge of document structure
- Dependent on document structure
10Our Goal
- Combine IR and database techniques tags text
- Simple language
- Logical Structure, not physical
- Require knowledge of tag names, not structure
- Queries should work even if structure changes
- Rank results
11Framework
12Tree Representation
bookinfo
book
book
price
title
price
author
title
author
Just Lost
5.75
Brown Bear
13.95
Mercy Meyer
Gina Meyer
We need to find tuples of related title and price
nodes.
13Another Tree Representation
bookinfo
author
author
book
book
name
book
price
name
Dr. Meyer
price
title
title
title
12.50
Goodnight Moon
One Fish Two Fish
Cat in the Hat
14.95
M. Brown
Similar document, but with different hierarchical
structure from the previous. We need to find
tuples of related title, author and price nodes.
14Interconnection
The lowest common ancestor of the circled nodes
Consider a title and price node Intuition The
nodes belong to different book entities
15Interconnection (contd)
The lowest common ancestor of the circled nodes
title
Just Lost
13.95
Intuition The nodes belong to same book entity
16Interconnection (contd)
title
Just Lost
13.95
Intuition The nodes belong to same book entity
17Relationship tree
- Nodes n1,n2
- n their lowest common ancestor
- Tn the subtree rooted at n
- The relationship tree of n1,n2 is the tree
obtained by pruning from Tn all nodes other than
n1,n2 that are not ancestors of n1,n2
18Interconnection
- We say that n1,n2 are interconnected if
- the relationship tree does not contain 2 distinct
nodes with the same label - Or
- the relationship tree contains exactly one pair
of distinct nodes with the same label and this
pair is comprised of n1,n2
19All-Pairs Interconnection
- A set of nodes is all-pairs interconnected if
every pair of nodes are interconnected
20Star interconnection
The 2 names are not interconnected
21Star Interconnection (contd)
- A set of nodes is star interconnected if all the
nodes in the set are interconnected to the same
node
22Search terms, Search query
- Search Term (l,k)
- l label (context)
- k keyword
- Search Query ANDL1 ORL2
- L1, L2 list of search terms
- AND(title,)(price,)
- OR(author,Meyer)(authorSmith)
23Answer
- ANDN1 ORN2
- N1, N2 are list of nodes
- Matching between N1,N2 and L1,L2
- N1 and N2 are interconnected
- All all-pair answers are star answers
- Maximal answer
24Example
bookinfo
book
book
price
title
price
author
title
author
Just Lost
5.75
Brown Bear
13.95
Mercy Meyer
Gina Meyer
(title,) (price,) (author,Meyer)
Find matchings of title, author and price to the
nodes in the tree
25Computing answers
- All-pairs
- Determining whether the set of answers is empty
is NP-complete - If L1 is empty, computing the set of answers is
polynomial in the size of input and output - Star
- computing the set of answers is polynomial in the
size of input and output
26Ranking results
- Unstructured
- Keyword weight (tfilf)
- Tags weight
- Result size
- Structured
- Nodes distance
- Ancestor-descendant
27Keyword Weight
- Compute the weight of a keyword k within a given
node n - Variation of the tfidf, one of the metric of
Vector Space Model (classical model in IR)
28Keyword Weight (contd)
- Term Frequency (tf) number of appearances of k
within n - tf(k,n) occ(k,n) / (max occ(k,n))
- Inverse Leaf Frequency (ilf) inverse frequency
of k among all the leafs in the corpus - idf(k) log(1N/Nk)
- W(k,n) tf(k,n) idf(k)
- Normalized per leave
29Tag Weight
- Give weight to tags according to their importance
- E.g. give more weight to lttitlegt than to
ltabstractgt
30Result Size
- Number of search terms appearing in the result
(OR part)
31Ranking-Structured
- Nodes distance
- size of the relationship tree
- Ancestor-descendant relationship
- more interconnected
32System overview
33XSEarch overview
Online
XML corpus with logical hierarchy
query
Indexer
Search
Offline
Results
34Document Location array
- Generate a unique id, did
- Associate each did with the physical location of
the corresponding document - Logical structure of the corpus
35Node Encoding Array
- Generate for each interior node a id, nid
- Node encoding
- Defined recursively
- Node encoding of its parent
- Index of the node among its siblings
- Eg 13.8.1.9
- Associate each nid with its node encoding
36Node Label Array
- Associate each nid with its label
37Inverted Tag Index
- For each tag, keep
- posting list list of nodes labeled with this tag
- weight
38Inverted Keyword Index
- For each kw, keep
- posting list list of leafs containing this
keyword - weight of the kw within the leaf (tfilf)
39Node Interconnection Matrix
- element ij contains
- 1, if ni and nj are interconnected
- 0, else
- nn symmetric sparse matrix
- Dynamic programming
40Alternative
- Hash set keep only interconnected nodes
- Key pair (ni, nj)
41Interconnection
- Let n be the number of nodes
- It is possible to determine whether n1 and n2 are
interconnected in O(n) time - It is possible to determine interconnection of
all pairs in O(n2) - Offline/Online computation
42Interconnection
- for (isize-1 igt0 i--)
- for (ji1 jltsize j)
- if i ancestor of j
- connected(iChild,j) AND connected(i,jFather) AND
- labelIChild ! labelJ AND labelI ! labelJFather
- for (ji1 jltsize j)
- if i not ancestor of j
- connected(i,jFather) AND connected(iFather,j) AND
- labelI ! labelJFather AND labelIFather ! labelJ
43Demo