Effective XML Keyword Search with Relevance Oriented Ranking - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Effective XML Keyword Search with Relevance Oriented Ranking

Description:

Inspired by IR style keyword search on the web ... node(C2, C4), name node(C3), or street node of customer(C1), or title node of book(B1) ... – PowerPoint PPT presentation

Number of Views:407
Avg rating:3.0/5.0
Slides: 33
Provided by: jiahe
Category:

less

Transcript and Presenter's Notes

Title: Effective XML Keyword Search with Relevance Oriented Ranking


1
Effective XML Keyword Search with Relevance
Oriented Ranking
  • Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu

2
Introduction
  • XML Keyword search
  • Inspired by IR style keyword search on the web
  • Enables user to access information in XML
    database
  • XML data modeled as a rooted, labeled tree
  • Recent research efforts
  • Efficiency
  • Effectiveness

3
Effectiveness
  • Capture users search intention
  • Identify the target that user intends to search
    for
  • Infer the predicate constraint that user intends
    to search via
  • Result ranking
  • Rank the query results according to their
    objective relevance to user search intention

4
State of the Art
  • Search semantics design
  • LCA (Lowest Common Ancestor)
  • Node v is a LCA of keyword set Kw1, w2,,wk if
    the sub-tree rooted at v contains at least one
    occurrence of all keywords in K, after excluding
    the sub-elements that already contain all
    keywords in K
  • SLCA (Smallest LCA)
  • Node v is a SLCA of keyword set Kw1, w2,,wk
    if
  • (1) v is a LCA of K
  • (2) no proper descendant of v is LCA of K
  • XSeek
  • Infers the search intention based on the concept
    of objects and an analysis of the matching
    between keyword and data node

5
State of the Art (cont)
  • Efficient result retrieval
  • Designed based on a certain search semantics
  • XKSearch, Multiway SLCA etc.
  • Result ranking
  • XRANK, XKSEarch, EASE
  • They only consider
  • Structural compactness of matching results
  • Keyword proximity
  • Similarity at node level

6
Problems Unaddressed
  • Not address the user search intention adequately!
  • Meaningfulness of query result
  • SLCA is less meaningful in many cases
  • Keyword Ambiguity Problems
  • A keyword can appear both as an xml node type
    and as the text value of some other nodes
  • A keyword can appear in the text values of
    different xml node types and carry different
    meanings

Neither SLCA nor Xseek can well address keyword
ambiguity
7
Meaningfulness
Problems
  • Keyword query rock music
  • Search intention find customers interested in
    rock music C3
  • SLCA returns interest node of C3

8
Keyword Ambiguity
Problems
  • Q customer, interest, art
  • Ambiguity 1 customer, interest Ambiguity 2
    art
  • Intention find customer whose interest is art
  • less relevant or irrelevant result to be returned
    also --- C1,C3, B1s title

...
...
...
name
Oxford
customer
...
purchases
interests
name
ID
purchase
interest
C
2

street art
John Martin
9
Keyword Ambiguity (cont)
Problems
  • Q customer, art
  • art can be the value of interest node(C2, C4),
    name node(C3), or street node of customer(C1), or
    title node of book(B1)
  • customer can be tag name of customer node, or
    (part of) value of title of(B1)

- How to rank C1 to C4 and B1?
10
Objectives Challenges
  • Address the below as a single problem
  • Search intention identification
  • Query result retrieval
  • Result ranking
  • Extend original TFIDF from text database to XML
    database, while capture the hierarchical
    structure of XML data
  • Challenges
  • How to decide which sub-tree(s) with appropriate
    node types can capture user desired information
  • How to return sub-trees of an appropriate size
    (i.e. contain enough but non-overwhelming
    information)
  • How to rank those sub-trees by their relevance

11
Challenges
  • Difficulty in applying TFIDF to XML
  • XML DB carries semantic information while text DB
    contains pure text information. XML TFIDF must
    be aware of the underlying semantics.
  • All contents of XML data are stored in leaf nodes
    only
  • What is analogy of flat document in XML?
  • Sub-tree classified according to its prefix path
  • Normalization factor is not simply the size of
    sub-tree
  • Structure of sub-trees may also infest the ranks

12
TFIDF Recap
  • Rule 1 A keyword appearing in many documents
    should not be regarded as more important than a
    keyword appearing in a few. --- IDF
  • Rule 2 A document with more occurrences of a
    query keyword should not be regarded as less
    important for that keyword than a document that
    has less. --- TF
  • Rule 3 A normalization factor is needed to
    balance between long and short documents
  • as Rule 2 discriminates against short documents
    which may have less chance to contain more
    occurrences of keywords.

13
Our Approach
  • Extend IR-style keyword search techniques (like
    TFIDF) from text database to XML database, in
    order to capture the hierarchical structure of
    xml document
  • by analyzing the knowledge of statistics of
    underlying XML data
  • Major Contributions
  • Identify users desired search-for node and
    search-via node(s) in a heuristic way
  • Define XML TF (term frequency) and XML DF
    (document frequency)
  • Confidence Formulas for search for/via candidates
  • Define XML TFIDF Similarity
  • Propose 3 guidelines specifically for xml keyword
    search
  • Take keyword ambiguity problems into account
  • Design a Keyword Search Engine XReal

14
Data Model
  • Node type - Two nodes are of same node type if
    they share the same prefix path
  • /storeDB/customers/customer/name vs.

  • /storeDB/books/book/publisher/name
  • Value node text values contained in leaf node
  • Structural node
  • Single-valued node type, multi-valued node type
  • Grouping type all its children are of same
    multi-valued type

storeDB
customers
books
...
...
book
customer
...
customer
...
ID
customer
ID
interests
publisher
title
name
authors
interests
...
...
...
ID
...
interest
C
3

name
name
interests
author
author
ID
interest
name
Art Smith
B
2

C
4

contact
address
...
rock music
Oxford
interest
book
C
1

Edward Martin
art
Rock Davis
customer
no
.
...
Sophia Jones
city
authors
...

1

purchases
street
title
ID
...
interests
name
ID
author
author
Mary Smith
B
1

purchase
interest
Art Street
fashion
John Williams
C
2

Art of Customer
Daniel Jones
Interest Care
street art
John Martin
15
XML TF and IDF
  • XML DF (document frequency)
  • The number of T-typed nodes that contain keyword
    k in their sub-trees in XML database.
  • Granularity of similarity measurement is
    sub-trees of certain node type T
  • XML TF (term frequency)
  • The number of occurrences of a keyword k in a
    given value node a in XML database.

16
Infer the desired search-for node
  • Guidelines A node type T is considered as a
    desired search for node if
  • T is intuitively related to every query keyword
  • XML nodes of type T should be informative enough
    to contain enough relevant information
  • XML nodes of type T should be not overwhelming to
    contain too much irrelevant information
  • Confidence of T as the search for node w.r.t.
    query q.
  • product instead of sum is used to follow 1st
    guideline
  • log part designed to follow 3rd guideline
  • exponential part designed to follow 2nd guideline
  • r is a decay factor in (0,1.

17
Infer the Search-Via Nodes
  • Infer structural node to search via
  • Structural node n is a good candidate if it is
    related to as many (but not necessarily all)
    keywords as possible
  • Search via node type normally is not unique
  • Infer individual value node to search via
  • Statistics alone is not adequate to infer the
    likelihood of a value node as (part of) search
    via node
  • Capture keyword co-occurrence

18
Capture keyword co-occurrence
  • E.g. Q customer, name, rock, interest, art
  • Easy to find name and interest have high
    confidence to be the search via nodes
  • But hard to know rock is value of name or
    interest, art is
    value of interest or name
  • How to differ customer C4 from C3?

19
Capture keyword co-occurrence
  • Proximity factors for a value node v of type kt
    containing keyword k
  • Given a query q and a certain value node v, if
    there are two keywords kt and k in q, s.t. kt
    matches the type of an ancestor node of v and k
    matches a keyword in v
  • In-Query distance
  • Distance between keyword k and node type kt in
    query q
  • Favors kt appears before k
  • Structural distance
  • Depth distance between v and the nearest kt typed
    ancestor node of v
  • Value-Type distance
  • Max of the above two

20
Principles of XML keyword search
  • Principle 1
  • When searching for D-typed nodes via a
    single-valued type V, ideally only the values and
    structures nested in V-typed nodes can affect the
    relevance, regardless of the size of other typed
    nodes nested in D-typed nodes.
  • However, TFIDF similarity in IR normalizes the
    relevance score of each document w.r.t. its size
  • Principle 2 address keyword Ambiguity 2
  • When searching for nodes of type D via a
    multi-valued type V, the relevance of a D-typed
    node which contains a query relevant V-typed
    node should not be affected (i.e. normalized) too
    much by other query-irrelevant V-typed nodes.
  • Example query art - C4 should not be
    less relevant than C1

21
Principles of XML keyword search
  • Principle 1 and 2
  • Especially useful for interpreting pure keyword
    query - find search via node correctly
  • Principle 3
  • The order of keywords in a query is important to
    indicate the search intention
  • Incorporate the search via confidence Cvia we
    defined before

22
XML TFIDF Similarity
  • To calculate the similarity between the search
    for node and the query q
  • Base case similarity between value node a and q
  • Apply original TFIDF directly since a contains
    keywords only without any structure
  • Recursive case similarity between structural
    node n and q
  • Based on similarities of its children c and the
    confidence level of c as the node type to search
    via

TF
IDF
Normalization factor
23
XML TFIDF Similarity (cont.)
  • Recursive Case
  • Intuition 2. An internal node n is relevant to q,
    if n has a child c such that the type of c has
    high confidence to be a search via node w.r.t. q
    (i.e. large Cvia(Tc , q)), and c is highly
    relevant to q (i.e. large sim(q, c)).
  • Intuition 3. An internal node n is more relevant
    to q if n has more query-relevant children when
    all others being equal.

Weighted sum of all ns childrens similarity and
their confidence to be the search via node
Overall weight of node n w.r.t query q which
essentially plays the role of a normalization
factor
24
Flowchart of answering a query
  • Identify user search intention
  • Compute the confidence of all possible candidate
    node types and choose desired search for node
    Tfor
  • Relevance-oriented ranking
  • Compute XML TFIDF similarity in a bottom-up
    approach from value nodes containing keywords up
    to nodes of type Tfor
  • Return a ranked list of sub-trees rooted at nodes
    of type Tfor
  • If more than one search for node type have
    comparable confidence, a ranked list for each
    search for node is returned

25
Experimental Result
  • Data set
  • DBLP, XMark, WSU, eBay
  • Comparison
  • Compare XReal with SLCA, Xseek
  • Equipment
  • Implement in Java
  • Run on 3.6GHz pentium IV, 1 GB memory PC with
    Windows XP
  • Berkeley DB java edition for storing keyword
    inverted lists and keyword frequency table

26
Search Effectiveness
  • Accuracy in inferring the search for node
  • Conducted by user survey
  • Tested queries contain at least one of the two
    ambiguity problems
  • Conclusion
  • XReal works well, especially when the search for
    node is not given explicitly in the query

27
Search Effectiveness
  • Result effectiveness
  • Measured by precision, recall, F-measure
  • Observations
  • XReal achieves higher precision than SLCA and
    Xseek for queries that contain ambiguities
  • XReal Performs as well as XSeek when queries have
    no ambiguity in XML data
  • XReal Top-100 precision higher than overall
    precision
  • F-measure also shows good overall effectiveness
    of both XReal and XSeek

28
Ranking Effectiveness
  • Metrics
  • Number of Top-1 answers that are relevant
  • Reciprocal Rank (R-Rank)
  • Mean Average Precision (MAP)

29
Efficiency Scalability
  • Compare three adoptions of indices for XReal, and
    SLCA
  • Dup
  • Store only the dewey id and XML TF
  • DupType
  • Stores an extra node type (i.e. its prefix path)
  • DupTypeNorm
  • Stores an extra normalization factor Wa for value
    node

30
(No Transcript)
31
QA
  • Thank You

32
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com