Effective XML Keyword Search with Relevance Oriented Ranking presentation

About This Presentation

Transcript and Presenter's Notes

Title: Effective XML Keyword Search with Relevance Oriented Ranking

1
Effective XML Keyword Search with Relevance
Oriented Ranking

Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu

2
Introduction

XML Keyword search
Inspired by IR style keyword search on the web
Enables user to access information in XML
database
XML data modeled as a rooted, labeled tree
Recent research efforts
Efficiency
Effectiveness

3
Effectiveness

Capture users search intention
Identify the target that user intends to search
for
Infer the predicate constraint that user intends
to search via
Result ranking
Rank the query results according to their
objective relevance to user search intention

4
State of the Art

Search semantics design
LCA (Lowest Common Ancestor)
Node v is a LCA of keyword set Kw1, w2,,wk if
the sub-tree rooted at v contains at least one
occurrence of all keywords in K, after excluding
the sub-elements that already contain all
keywords in K
SLCA (Smallest LCA)
Node v is a SLCA of keyword set Kw1, w2,,wk
if
(1) v is a LCA of K
(2) no proper descendant of v is LCA of K
XSeek
Infers the search intention based on the concept
of objects and an analysis of the matching
between keyword and data node

5
State of the Art (cont)

Efficient result retrieval
Designed based on a certain search semantics
XKSearch, Multiway SLCA etc.
Result ranking
XRANK, XKSEarch, EASE
They only consider
Structural compactness of matching results
Keyword proximity
Similarity at node level

6
Problems Unaddressed

Not address the user search intention adequately!
Meaningfulness of query result
SLCA is less meaningful in many cases
Keyword Ambiguity Problems
A keyword can appear both as an xml node type
and as the text value of some other nodes
A keyword can appear in the text values of
different xml node types and carry different
meanings

Neither SLCA nor Xseek can well address keyword
ambiguity
7
Meaningfulness
Problems

Keyword query rock music
Search intention find customers interested in
rock music C3
SLCA returns interest node of C3

8
Keyword Ambiguity
Problems

Q customer, interest, art
Ambiguity 1 customer, interest Ambiguity 2
art
Intention find customer whose interest is art
less relevant or irrelevant result to be returned
also --- C1,C3, B1s title

...
...
...
name
Oxford
customer
...
purchases
interests
name
ID
purchase
interest
C
2

street art
John Martin
9
Keyword Ambiguity (cont)
Problems

Q customer, art
art can be the value of interest node(C2, C4),
name node(C3), or street node of customer(C1), or
title node of book(B1)
customer can be tag name of customer node, or
(part of) value of title of(B1)

- How to rank C1 to C4 and B1?
10
Objectives Challenges

Address the below as a single problem
Search intention identification
Query result retrieval
Result ranking
Extend original TFIDF from text database to XML
database, while capture the hierarchical
structure of XML data

Challenges
How to decide which sub-tree(s) with appropriate
node types can capture user desired information
How to return sub-trees of an appropriate size
(i.e. contain enough but non-overwhelming
information)
How to rank those sub-trees by their relevance

11
Challenges

Difficulty in applying TFIDF to XML
XML DB carries semantic information while text DB
contains pure text information. XML TFIDF must
be aware of the underlying semantics.
All contents of XML data are stored in leaf nodes
only
What is analogy of flat document in XML?
Sub-tree classified according to its prefix path
Normalization factor is not simply the size of
sub-tree
Structure of sub-trees may also infest the ranks

12
TFIDF Recap

Rule 1 A keyword appearing in many documents
should not be regarded as more important than a
keyword appearing in a few. --- IDF
Rule 2 A document with more occurrences of a
query keyword should not be regarded as less
important for that keyword than a document that
has less. --- TF
Rule 3 A normalization factor is needed to
balance between long and short documents
as Rule 2 discriminates against short documents
which may have less chance to contain more
occurrences of keywords.

13
Our Approach

Extend IR-style keyword search techniques (like
TFIDF) from text database to XML database, in
order to capture the hierarchical structure of
xml document
by analyzing the knowledge of statistics of
underlying XML data
Major Contributions
Identify users desired search-for node and
search-via node(s) in a heuristic way
Define XML TF (term frequency) and XML DF
(document frequency)
Confidence Formulas for search for/via candidates
Define XML TFIDF Similarity
Propose 3 guidelines specifically for xml keyword
search
Take keyword ambiguity problems into account
Design a Keyword Search Engine XReal

14
Data Model

Node type - Two nodes are of same node type if
they share the same prefix path
/storeDB/customers/customer/name vs.
/storeDB/books/book/publisher/name

Value node text values contained in leaf node
Structural node
Single-valued node type, multi-valued node type
Grouping type all its children are of same
multi-valued type

storeDB
customers
books
...
...
book
customer
...
customer
...
ID
customer
ID
interests
publisher
title
name
authors
interests
...
...
...
ID
...
interest
C
3

name
name
interests
author
author
ID
interest
name
Art Smith
B
2

C
4

contact
address
...
rock music
Oxford
interest
book
C
1

Edward Martin
art
Rock Davis
customer
no
.
...
Sophia Jones
city
authors
...

1

purchases
street
title
ID
...
interests
name
ID
author
author
Mary Smith
B
1

purchase
interest
Art Street
fashion
John Williams
C
2

Art of Customer
Daniel Jones
Interest Care
street art
John Martin
15
XML TF and IDF

XML DF (document frequency)
The number of T-typed nodes that contain keyword
k in their sub-trees in XML database.
Granularity of similarity measurement is
sub-trees of certain node type T
XML TF (term frequency)
The number of occurrences of a keyword k in a
given value node a in XML database.

16
Infer the desired search-for node

Guidelines A node type T is considered as a
desired search for node if
T is intuitively related to every query keyword
XML nodes of type T should be informative enough
to contain enough relevant information
XML nodes of type T should be not overwhelming to
contain too much irrelevant information
Confidence of T as the search for node w.r.t.
query q.
product instead of sum is used to follow 1st
guideline
log part designed to follow 3rd guideline
exponential part designed to follow 2nd guideline
r is a decay factor in (0,1.

17
Infer the Search-Via Nodes

Infer structural node to search via
Structural node n is a good candidate if it is
related to as many (but not necessarily all)
keywords as possible
Search via node type normally is not unique
Infer individual value node to search via
Statistics alone is not adequate to infer the
likelihood of a value node as (part of) search
via node
Capture keyword co-occurrence

18
Capture keyword co-occurrence

E.g. Q customer, name, rock, interest, art
Easy to find name and interest have high
confidence to be the search via nodes
But hard to know rock is value of name or
interest, art is
value of interest or name
How to differ customer C4 from C3?

19
Capture keyword co-occurrence

Proximity factors for a value node v of type kt
containing keyword k
Given a query q and a certain value node v, if
there are two keywords kt and k in q, s.t. kt
matches the type of an ancestor node of v and k
matches a keyword in v
In-Query distance
Distance between keyword k and node type kt in
query q
Favors kt appears before k
Structural distance
Depth distance between v and the nearest kt typed
ancestor node of v
Value-Type distance
Max of the above two

20
Principles of XML keyword search

Principle 1
When searching for D-typed nodes via a
single-valued type V, ideally only the values and
structures nested in V-typed nodes can affect the
relevance, regardless of the size of other typed
nodes nested in D-typed nodes.
However, TFIDF similarity in IR normalizes the
relevance score of each document w.r.t. its size
Principle 2 address keyword Ambiguity 2
When searching for nodes of type D via a
multi-valued type V, the relevance of a D-typed
node which contains a query relevant V-typed
node should not be affected (i.e. normalized) too
much by other query-irrelevant V-typed nodes.
Example query art - C4 should not be
less relevant than C1

21
Principles of XML keyword search

Principle 1 and 2
Especially useful for interpreting pure keyword
query - find search via node correctly
Principle 3
The order of keywords in a query is important to
indicate the search intention
Incorporate the search via confidence Cvia we
defined before

22
XML TFIDF Similarity

To calculate the similarity between the search
for node and the query q
Base case similarity between value node a and q
Apply original TFIDF directly since a contains
keywords only without any structure
Recursive case similarity between structural
node n and q
Based on similarities of its children c and the
confidence level of c as the node type to search
via

TF
IDF
Normalization factor
23
XML TFIDF Similarity (cont.)

Recursive Case
Intuition 2. An internal node n is relevant to q,
if n has a child c such that the type of c has
high confidence to be a search via node w.r.t. q
(i.e. large Cvia(Tc , q)), and c is highly
relevant to q (i.e. large sim(q, c)).
Intuition 3. An internal node n is more relevant
to q if n has more query-relevant children when
all others being equal.

Weighted sum of all ns childrens similarity and
their confidence to be the search via node
Overall weight of node n w.r.t query q which
essentially plays the role of a normalization
factor
24
Flowchart of answering a query

Identify user search intention
Compute the confidence of all possible candidate
node types and choose desired search for node
Tfor
Relevance-oriented ranking
Compute XML TFIDF similarity in a bottom-up
approach from value nodes containing keywords up
to nodes of type Tfor
Return a ranked list of sub-trees rooted at nodes
of type Tfor
If more than one search for node type have
comparable confidence, a ranked list for each
search for node is returned

25
Experimental Result

Data set
DBLP, XMark, WSU, eBay
Comparison
Compare XReal with SLCA, Xseek
Equipment
Implement in Java
Run on 3.6GHz pentium IV, 1 GB memory PC with
Windows XP
Berkeley DB java edition for storing keyword
inverted lists and keyword frequency table

26
Search Effectiveness

Accuracy in inferring the search for node
Conducted by user survey
Tested queries contain at least one of the two
ambiguity problems
Conclusion
XReal works well, especially when the search for
node is not given explicitly in the query

27
Search Effectiveness

Result effectiveness
Measured by precision, recall, F-measure
Observations
XReal achieves higher precision than SLCA and
Xseek for queries that contain ambiguities
XReal Performs as well as XSeek when queries have
no ambiguity in XML data
XReal Top-100 precision higher than overall
precision
F-measure also shows good overall effectiveness
of both XReal and XSeek

28
Ranking Effectiveness

Metrics
Number of Top-1 answers that are relevant
Reciprocal Rank (R-Rank)
Mean Average Precision (MAP)

29
Efficiency Scalability

Compare three adoptions of indices for XReal, and
SLCA
Dup
Store only the dewey id and XML TF
DupType
Stores an extra node type (i.e. its prefix path)
DupTypeNorm
Stores an extra normalization factor Wa for value
node

30
(No Transcript)
31
QA

Thank You

32
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Effective XML Keyword Search with Relevance Oriented Ranking PowerPoint PPT Presentation