Title: Gkay Burak AKKUS Ece AKSU
1XRANK
- XRANK Ranked Keyword Search over
- XML Documents
- Ece AKSU
- Gökay Burak AKKUS
2This Paper...
- Describes the architecture, implementation and
evaluation of the XRANK system - The contributions of the paper are
- (a) the problem definition and system
architecture - (b) an algorithm for computing the ranking of
XML elements - (c) new inverted list index structures and
associated query processing algorithms - (d) an experimental evaluation of XRANK
3Overview
- Problem Efficiently producing ranked results for
keyword search queries over hierarchical XML
documents. - New challanges
- Returns deeply nested XML elements.
- Ranking is at the granularity of an XML element
(not the document) - Keyword proximity is more complex.
4Overview - 2
- This paper pesents XRANK system to handle these
features of XML keyword search. - XRANK offers both space performance benefits
- XRANK generalizes a hyperlink based HTML search
engine such as Google. - XRANK can be used to query both HTML and XML
documents.
5Keyword Search Querying - 1
- Keyword search querying
- Adv simple
- users do not have to learn a complex query
language - can issue queries without any prior knowledge
about the structure of the underlying data. - Consequence Interface is fexible
- Queries may not always be precise and can return
large number of query results.
6Keyword Search Querying - 2
- An important requirement for keyword search is to
rank the query results so that the most relevant
results appear first. - Certain limitations of the HTML data model make
such systems ineffective in many domains. - HTML is a presentation language
- HTML cannot capture much semantics
7Keyword Search Querying - 3
- The XML data model addresses this limitation by
allowing for extensible element tags. (Example
Figure.1)
8(No Transcript)
9Querying XML Documents
- One approach is the sophisticated query language
XQUERY - Effective in some cases
- Users have to learn a complex query language and
understand the schema of underlying XML - An alternative approach is XRANK
- Retain the simple keyword search query interface
- Exploit XMLs tagged and nested structure during
query processing. -
10New Challanges
- Keyword searching over XML introduces many new
challenges. - 1. The result of the keyword search query can be
a deeply nested XML element. - return the deepest node
- 2. Ranking is not solely based on hyperlinks.
- semantics of containment links (relating parent
and child elements) is very different from that
of hyperlinks (such as IDREFs and XLinks)
11New Challanges
- 3. The notion of proximity among keywords is
more complex - In HTML, proximity among keywords translates
directly to the distance between keywords in a
document. - For XML there is a 2-dimensional proximity
metric. - Keyword distance
- Ancestor distance
12XML Data Model
- XML is a hierarchical format for data
representation and exchange. - An XML document consists of
- Root element, nested sub-elements, attributes and
values, - supports intra-document and inter-document
references.
13XML Data Model-2
- Intra-document referencees are represented using
IDREFs. - Inter-document references are represented using
XLink. - Both IDREFs and XLinks are reffered as
hyperlinks! -
14Definitions
- A collection of hyperlinked XML documents can be
defined as a directed graph - G (N, CE, HE)
- N The set of nodes N NE U NV
- NE The set of elements
- NV The set of values
- CE The set of containment edges relating nodes
- HE The set of hyperlink edges relating nodes
15Definitions - 2
- The edge (u, v) ?CE iff v is a value/nested
sub-element of u. - The edge (u, v) ? HE iff u contains a hyperlink
reference to v. - An element u is a sub-element of an element v if
(v,u) ? CE. - An element u is the parent of node v if (u,v) ?
CE. - The predicate contains(v, k) is true if the node
v directly or indirectly contains the keyword k.
16Keyword Query Results
- There are two possible semantics for keyword
search queries - conjunctive keyword query semantics
- contain all of the query keywords are returned.
- disjunctive keyword query semantics
- contain at least one of the query keywords are
returned - This paper focuses on conjunctive keyword query
semantics.
17Keyword Query Results - 2
- Qk1,, kn.
- R0 v ?v ? NE ? ? k ? Q(contains(v,k))
- the set of elements that directly or indirectly
contain all of the query keywords. - Result(Q)v ? ? k ? Q ?c ? N ((v,c) ? CE ? c ?R0
? contains(c,k)) - ensures that only the most specific results are
returned. - ensures that an element that has multiple
independent occurrences of the query keywords is
returned, - CE are considered for result set, HE are
considered for ranking
18Keyword Query Results - 3
- XML elements provides more context information
- Also poses interesting user-interface challenges.
- One solution is to allow the user to navigate up
to the ancestors of the query result - Another solution, is to predefine a set of
answer nodes AN. - XRANK supports both
- may require knowledge of the domain and
underlying XML schema -
19Ranking Keyword Query Results
- Desired Properties of Ranking Function
- 1) Result specificity more specific results
higher than less specific results. one dimension
of result proximity. - 2) Keyword proximity another dimension of
result proximity. - 3) Hyperlink Awareness hyperlinked structure of
XML documents.
20Ranking Function Definition
- ElemRank is defined at the granularity of an
element and takes the nested structure of XML
into account. - Similar to Googles PageRank
- Q (k1, k2, , kn)
- R Result(Q)
- A result element v1 ? R
- First define the ranking of v1 with respect to
one query keyword ki, r(v1,ki) before defining
the overall rank, rank(v1, Q).
21Ranking with respect to one keyword
- There exists a sub-element/value node
- v2 of v1 such that
- v2 ?R0 and contains(v2, ki).
- There is a sequence of containment edges
- in CE of the form (v1, v2), (v2, v3), , (vt,
vt1) such that vt1 is a value node that
directly contains the keyword ki. -
22Ranking with respect to one keyword
- r(v1, ki) does not depend on the ElemRank of the
result node v1, except when v1 vt for 2
reasons - 1. less specific results indeed get lower ranks.
- 2. in fact related to ElemRank(v1) due to certain
properties of containment edges. - For multiple occurences of ki in v1 combined rank
is - f max
23Overall Ranking
- The overall ranking is the sum of the ranks with
respect to each query keyword, multiplied by a
measure of keyword proximity p(v1, k1, k2, ,
kn).
24 XRANK System Architecture
25 XRANK System Architecture-2
- ElemRank Computation Module
- Computes the ElemRanks of XML elements
- Combined with ancestor info
- HDIL
- Generates an index structure called HDIL
- The Query Evaluator Module
- Evaluates queries using HDIL
- Returns ranked results.
26ElemRank Computational Module
- ElemRank is a measure of the objective importance
of an XML element and is based on the hyperlinked
structure of XML docs. - PageRank function is sum of 2 probabilities
- Visiting v at random (d0.85)
- Visiting v by navigating
27ElemRank Computational Module
- PageRank is unidirectional
- Forward ElemRank propagation
- Paper ? section
- Reverse ElemRank propagation
- Paper -- gt workshop
28Refinements of PageRank
- Bi-directional transfer of ElemRanks
- Discrimination between containment and hyperlink
edges - Aggregate ElemRanks for reverse containment
relationships
29Bi-directional Transfer of ElemRanks
- A simple solution is to add reverse containment
edges, - does not distinguish between containment and
hyperlink edges
30Discrimination between containment and hyperlink
edges
- It weights forward and reverse containment
relationships similarly.
31Aggregate ElemRanks for reverse containment
relationships
32XRANK System
- Efficiently Evaluating XML Keyword Search Queries
33Efficiently Evaluating XML Keyword Search Queries
- Naïve Approach
- Dewey Inverted List (DIL)
- Ranked Dewey Inverted List (RDIL)
- Hybrid Dewey Inverted List (HDIL)
34Naïve Approach
- Main Difference between XML and HTML keyword
search - The granularity of query results
- XML keyword search returns elements
- HTML keyword search returns documents
- One way to do XML keyword search
- Treat each element as a document
35Problems of Naïve Approach
- Space Overhead
- Spurious Query Results
- Inaccurate ranking of results
36Space Overhead
- An inverted list contains for each keyword, the
list of documents that contain the keyword - For XML documents, the list of elements
- A large space overhead because each inverted
list contains - XML element that directly contains the keyword(1)
- All of (1)s ancestors redundantly
37Spurious Query Results
- The naïve approach ignores ancestor-descendant
relationships. - All elements treated as independent documents
- Results will not correspond to the desired
semantics for XML keyword search
38Inaccurate Ranking of Results
- Existing approaches do not take result
specificity into account when ranking results.
39Dewey Inverted List (DIL)
- Naïve approach has drawbacks
- Decouples representation of ancestors and
descendants. - Dewey encoding of Element IDs jointly captures
ancestor and descendant information.
40(No Transcript)
41DIL
- An interesting feature
- ID of an ancestor is a prefix of the ID of a
descendant. - Ancestor-descendant relationships are implicitly
captured in the Dewey ID.
42DIL Data Structure
- The inverted list for a keyword k contains the
Dewey IDs of all the XML elements that directly
contain the keyword k. - For multiple documents
- First component of each Dewey ID is the document
ID
43DIL Data Structure -2
- An entry in DIL
- ElemRank of corresponding XML element
- The list of all positions where the keyword k
appears in that element. - Entries are sorted by Dewey IDs
- The size of DIL is smaller than that of Naïve
Approach.
44(No Transcript)
45DIL Query Processing
- An algorithm that works in a single pass over the
query keyword inverted lists. - The key idea
- Merge the query keyword inverted lists
- Simultaneously compute the longest common prefix
of the Dewey IDs in different lists.
46(No Transcript)
47(No Transcript)
48Ranked Dewey Inverted List (RDIL)
- If inverted lists are long (due to common
keywords or large document collections) even the
cost of a single scan of the inverted list can be
expensive, especially if the users want only the
top few results.
49RDIL -2
- One solution
- Order the inverted lists by the ElemRank instead
of by the Dewey ID. - Higher ranked results will appear first in the
inverted list. - Threshold Algorithm.
50RDIL Data Structure
- RDIL is similar to DIL except that
- Inverted lists are ordered by ElemRank,
- Each inverted list has a B-tree index of the
Dewey ID field.
51(No Transcript)
52RDIL Query Processing
- Consider an entry retrieved from the inverted
list of keyword k i . - The entry contains the Dewey ID d of a top-ranked
element that directly contains the query keyword
k i . - To determine a query result the longest prefix of
d that also contains the other query keywords
needs to be determined.
53(No Transcript)
54Hybrid Dewey Inverted List (HDIL)
- In many cases RDIL is likely to perform well.
- It may perform worse than DIL when there is a
query where keywords are not correlated.
55HDIL -2
- The individual query keywords occur relatively
frequently in the document collection but rarely
occur together in the same document. - Since the number of results is small
- RDIL has to scan most (or all) of the inverted
lists to produce the output. - Can we combine the benefits of DIL and RDIL
without replicating the entire inverted list
index?
56(No Transcript)
57HDIL Query Processing
- An adaptive strategy
- Periodically monitor performance.
- Calculate
- Time spent t
- The number of results above the threshold r
- Estimated time remaining for RDIL (m-r)t/r
- m desired number of query results
- If estimated time is more than the expected time
for DIL, then switch to DIL.
58Experimental Evaluation
- Experimental Setup
- Quality and Ranking Function
- Space requirements
- Query Performance
- (1) the number of query keywords
- (2) the correlation between the keywords
- (3) the desired number of query results
- (4) the selectivity of the keywords.