Title: Effective XML Keyword Search with Relevance Oriented Ranking
1Effective XML Keyword Search with Relevance
Oriented Ranking
- Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu
2Introduction
- XML Keyword search
- Inspired by IR style keyword search on the web
- Enables user to access information in XML
database - XML data modeled as a rooted, labeled tree
- Recent research efforts
- Efficiency
- Effectiveness
3Effectiveness
- Capture users search intention
- Identify the target that user intends to search
for - Infer the predicate constraint that user intends
to search via - Result ranking
- Rank the query results according to their
objective relevance to user search intention
4State of the Art
- Search semantics design
- LCA (Lowest Common Ancestor)
- Node v is a LCA of keyword set Kw1, w2,,wk if
the sub-tree rooted at v contains at least one
occurrence of all keywords in K, after excluding
the sub-elements that already contain all
keywords in K - SLCA (Smallest LCA)
- Node v is a SLCA of keyword set Kw1, w2,,wk
if - (1) v is a LCA of K
- (2) no proper descendant of v is LCA of K
- XSeek
- Infers the search intention based on the concept
of objects and an analysis of the matching
between keyword and data node
5State of the Art (cont)
- Efficient result retrieval
- Designed based on a certain search semantics
- XKSearch, Multiway SLCA etc.
- Result ranking
- XRANK, XKSEarch, EASE
- They only consider
- Structural compactness of matching results
- Keyword proximity
- Similarity at node level
6Problems Unaddressed
- Not address the user search intention adequately!
- Meaningfulness of query result
- SLCA is less meaningful in many cases
- Keyword Ambiguity Problems
- A keyword can appear both as an xml node type
and as the text value of some other nodes - A keyword can appear in the text values of
different xml node types and carry different
meanings
Neither SLCA nor Xseek can well address keyword
ambiguity
7Meaningfulness
Problems
- Keyword query rock music
- Search intention find customers interested in
rock music C3 - SLCA returns interest node of C3
8Keyword Ambiguity
Problems
- Q customer, interest, art
- Ambiguity 1 customer, interest Ambiguity 2
art - Intention find customer whose interest is art
- less relevant or irrelevant result to be returned
also --- C1,C3, B1s title
...
...
...
name
Oxford
customer
...
purchases
interests
name
ID
purchase
interest
C
2
street art
John Martin
9Keyword Ambiguity (cont)
Problems
- Q customer, art
- art can be the value of interest node(C2, C4),
name node(C3), or street node of customer(C1), or
title node of book(B1) - customer can be tag name of customer node, or
(part of) value of title of(B1)
- How to rank C1 to C4 and B1?
10Objectives Challenges
- Address the below as a single problem
- Search intention identification
- Query result retrieval
- Result ranking
- Extend original TFIDF from text database to XML
database, while capture the hierarchical
structure of XML data
- Challenges
- How to decide which sub-tree(s) with appropriate
node types can capture user desired information - How to return sub-trees of an appropriate size
(i.e. contain enough but non-overwhelming
information) - How to rank those sub-trees by their relevance
11Challenges
- Difficulty in applying TFIDF to XML
- XML DB carries semantic information while text DB
contains pure text information. XML TFIDF must
be aware of the underlying semantics. - All contents of XML data are stored in leaf nodes
only - What is analogy of flat document in XML?
- Sub-tree classified according to its prefix path
- Normalization factor is not simply the size of
sub-tree - Structure of sub-trees may also infest the ranks
12TFIDF Recap
- Rule 1 A keyword appearing in many documents
should not be regarded as more important than a
keyword appearing in a few. --- IDF - Rule 2 A document with more occurrences of a
query keyword should not be regarded as less
important for that keyword than a document that
has less. --- TF - Rule 3 A normalization factor is needed to
balance between long and short documents - as Rule 2 discriminates against short documents
which may have less chance to contain more
occurrences of keywords.
13Our Approach
- Extend IR-style keyword search techniques (like
TFIDF) from text database to XML database, in
order to capture the hierarchical structure of
xml document - by analyzing the knowledge of statistics of
underlying XML data - Major Contributions
- Identify users desired search-for node and
search-via node(s) in a heuristic way - Define XML TF (term frequency) and XML DF
(document frequency) - Confidence Formulas for search for/via candidates
- Define XML TFIDF Similarity
- Propose 3 guidelines specifically for xml keyword
search - Take keyword ambiguity problems into account
- Design a Keyword Search Engine XReal
14Data Model
- Node type - Two nodes are of same node type if
they share the same prefix path - /storeDB/customers/customer/name vs.
-
/storeDB/books/book/publisher/name
- Value node text values contained in leaf node
- Structural node
- Single-valued node type, multi-valued node type
- Grouping type all its children are of same
multi-valued type
storeDB
customers
books
...
...
book
customer
...
customer
...
ID
customer
ID
interests
publisher
title
name
authors
interests
...
...
...
ID
...
interest
C
3
name
name
interests
author
author
ID
interest
name
Art Smith
B
2
C
4
contact
address
...
rock music
Oxford
interest
book
C
1
Edward Martin
art
Rock Davis
customer
no
.
...
Sophia Jones
city
authors
...
1
purchases
street
title
ID
...
interests
name
ID
author
author
Mary Smith
B
1
purchase
interest
Art Street
fashion
John Williams
C
2
Art of Customer
Daniel Jones
Interest Care
street art
John Martin
15XML TF and IDF
- XML DF (document frequency)
- The number of T-typed nodes that contain keyword
k in their sub-trees in XML database. - Granularity of similarity measurement is
sub-trees of certain node type T - XML TF (term frequency)
- The number of occurrences of a keyword k in a
given value node a in XML database.
16Infer the desired search-for node
- Guidelines A node type T is considered as a
desired search for node if - T is intuitively related to every query keyword
- XML nodes of type T should be informative enough
to contain enough relevant information - XML nodes of type T should be not overwhelming to
contain too much irrelevant information -
- Confidence of T as the search for node w.r.t.
query q. - product instead of sum is used to follow 1st
guideline - log part designed to follow 3rd guideline
- exponential part designed to follow 2nd guideline
- r is a decay factor in (0,1.
17Infer the Search-Via Nodes
- Infer structural node to search via
- Structural node n is a good candidate if it is
related to as many (but not necessarily all)
keywords as possible - Search via node type normally is not unique
- Infer individual value node to search via
- Statistics alone is not adequate to infer the
likelihood of a value node as (part of) search
via node - Capture keyword co-occurrence
18Capture keyword co-occurrence
- E.g. Q customer, name, rock, interest, art
- Easy to find name and interest have high
confidence to be the search via nodes - But hard to know rock is value of name or
interest, art is
value of interest or name - How to differ customer C4 from C3?
19Capture keyword co-occurrence
- Proximity factors for a value node v of type kt
containing keyword k - Given a query q and a certain value node v, if
there are two keywords kt and k in q, s.t. kt
matches the type of an ancestor node of v and k
matches a keyword in v - In-Query distance
- Distance between keyword k and node type kt in
query q - Favors kt appears before k
- Structural distance
- Depth distance between v and the nearest kt typed
ancestor node of v - Value-Type distance
- Max of the above two
20Principles of XML keyword search
- Principle 1
- When searching for D-typed nodes via a
single-valued type V, ideally only the values and
structures nested in V-typed nodes can affect the
relevance, regardless of the size of other typed
nodes nested in D-typed nodes. - However, TFIDF similarity in IR normalizes the
relevance score of each document w.r.t. its size
- Principle 2 address keyword Ambiguity 2
- When searching for nodes of type D via a
multi-valued type V, the relevance of a D-typed
node which contains a query relevant V-typed
node should not be affected (i.e. normalized) too
much by other query-irrelevant V-typed nodes. - Example query art - C4 should not be
less relevant than C1
21Principles of XML keyword search
- Principle 1 and 2
- Especially useful for interpreting pure keyword
query - find search via node correctly - Principle 3
- The order of keywords in a query is important to
indicate the search intention - Incorporate the search via confidence Cvia we
defined before
22XML TFIDF Similarity
- To calculate the similarity between the search
for node and the query q - Base case similarity between value node a and q
- Apply original TFIDF directly since a contains
keywords only without any structure - Recursive case similarity between structural
node n and q - Based on similarities of its children c and the
confidence level of c as the node type to search
via
TF
IDF
Normalization factor
23XML TFIDF Similarity (cont.)
- Recursive Case
- Intuition 2. An internal node n is relevant to q,
if n has a child c such that the type of c has
high confidence to be a search via node w.r.t. q
(i.e. large Cvia(Tc , q)), and c is highly
relevant to q (i.e. large sim(q, c)). - Intuition 3. An internal node n is more relevant
to q if n has more query-relevant children when
all others being equal.
Weighted sum of all ns childrens similarity and
their confidence to be the search via node
Overall weight of node n w.r.t query q which
essentially plays the role of a normalization
factor
24Flowchart of answering a query
- Identify user search intention
- Compute the confidence of all possible candidate
node types and choose desired search for node
Tfor - Relevance-oriented ranking
- Compute XML TFIDF similarity in a bottom-up
approach from value nodes containing keywords up
to nodes of type Tfor - Return a ranked list of sub-trees rooted at nodes
of type Tfor - If more than one search for node type have
comparable confidence, a ranked list for each
search for node is returned
25Experimental Result
- Data set
- DBLP, XMark, WSU, eBay
- Comparison
- Compare XReal with SLCA, Xseek
- Equipment
- Implement in Java
- Run on 3.6GHz pentium IV, 1 GB memory PC with
Windows XP - Berkeley DB java edition for storing keyword
inverted lists and keyword frequency table
26Search Effectiveness
- Accuracy in inferring the search for node
- Conducted by user survey
- Tested queries contain at least one of the two
ambiguity problems - Conclusion
- XReal works well, especially when the search for
node is not given explicitly in the query
27Search Effectiveness
- Result effectiveness
- Measured by precision, recall, F-measure
- Observations
- XReal achieves higher precision than SLCA and
Xseek for queries that contain ambiguities - XReal Performs as well as XSeek when queries have
no ambiguity in XML data - XReal Top-100 precision higher than overall
precision - F-measure also shows good overall effectiveness
of both XReal and XSeek
28Ranking Effectiveness
- Metrics
- Number of Top-1 answers that are relevant
- Reciprocal Rank (R-Rank)
- Mean Average Precision (MAP)
29Efficiency Scalability
- Compare three adoptions of indices for XReal, and
SLCA - Dup
- Store only the dewey id and XML TF
- DupType
- Stores an extra node type (i.e. its prefix path)
- DupTypeNorm
- Stores an extra normalization factor Wa for value
node
30(No Transcript)
31QA
32(No Transcript)