Gkay Burak AKKUS Ece AKSU - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

Gkay Burak AKKUS Ece AKSU

Description:

Dewey encoding of Element IDs jointly captures ancestor and descendant information. ... ID of an ancestor is a prefix of the ID of a descendant. ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 59

Provided by: GOK9

Category:

more less

Transcript and Presenter's Notes

Title: Gkay Burak AKKUS Ece AKSU

1
XRANK

XRANK Ranked Keyword Search over
XML Documents
Ece AKSU
Gökay Burak AKKUS

2
This Paper...

Describes the architecture, implementation and
evaluation of the XRANK system
The contributions of the paper are
(a) the problem definition and system
architecture
(b) an algorithm for computing the ranking of
XML elements
(c) new inverted list index structures and
associated query processing algorithms
(d) an experimental evaluation of XRANK

3
Overview

Problem Efficiently producing ranked results for
keyword search queries over hierarchical XML
documents.
New challanges
Returns deeply nested XML elements.
Ranking is at the granularity of an XML element
(not the document)
Keyword proximity is more complex.

4
Overview - 2

This paper pesents XRANK system to handle these
features of XML keyword search.
XRANK offers both space performance benefits
XRANK generalizes a hyperlink based HTML search
engine such as Google.
XRANK can be used to query both HTML and XML
documents.

5
Keyword Search Querying - 1

Keyword search querying
Adv simple
users do not have to learn a complex query
language
can issue queries without any prior knowledge
about the structure of the underlying data.
Consequence Interface is fexible
Queries may not always be precise and can return
large number of query results.

6
Keyword Search Querying - 2

An important requirement for keyword search is to
rank the query results so that the most relevant
results appear first.
Certain limitations of the HTML data model make
such systems ineffective in many domains.
HTML is a presentation language
HTML cannot capture much semantics

7
Keyword Search Querying - 3

The XML data model addresses this limitation by
allowing for extensible element tags. (Example
Figure.1)

8
(No Transcript)
9
Querying XML Documents

One approach is the sophisticated query language
XQUERY
Effective in some cases
Users have to learn a complex query language and
understand the schema of underlying XML
An alternative approach is XRANK
Retain the simple keyword search query interface
Exploit XMLs tagged and nested structure during
query processing.

10
New Challanges

Keyword searching over XML introduces many new
challenges.
1. The result of the keyword search query can be
a deeply nested XML element.
return the deepest node
2. Ranking is not solely based on hyperlinks.
semantics of containment links (relating parent
and child elements) is very different from that
of hyperlinks (such as IDREFs and XLinks)

11
New Challanges

3. The notion of proximity among keywords is
more complex
In HTML, proximity among keywords translates
directly to the distance between keywords in a
document.
For XML there is a 2-dimensional proximity
metric.
Keyword distance
Ancestor distance

12
XML Data Model

XML is a hierarchical format for data
representation and exchange.
An XML document consists of
Root element, nested sub-elements, attributes and
values,
supports intra-document and inter-document
references.

13
XML Data Model-2

Intra-document referencees are represented using
IDREFs.
Inter-document references are represented using
XLink.
Both IDREFs and XLinks are reffered as
hyperlinks!

14
Definitions

A collection of hyperlinked XML documents can be
defined as a directed graph
G (N, CE, HE)
N The set of nodes N NE U NV
NE The set of elements
NV The set of values
CE The set of containment edges relating nodes
HE The set of hyperlink edges relating nodes

15
Definitions - 2

The edge (u, v) ?CE iff v is a value/nested
sub-element of u.
The edge (u, v) ? HE iff u contains a hyperlink
reference to v.
An element u is a sub-element of an element v if
(v,u) ? CE.
An element u is the parent of node v if (u,v) ?
CE.
The predicate contains(v, k) is true if the node
v directly or indirectly contains the keyword k.

16
Keyword Query Results

There are two possible semantics for keyword
search queries
conjunctive keyword query semantics
contain all of the query keywords are returned.
disjunctive keyword query semantics
contain at least one of the query keywords are
returned
This paper focuses on conjunctive keyword query
semantics.

17
Keyword Query Results - 2

Qk1,, kn.
R0 v ?v ? NE ? ? k ? Q(contains(v,k))
the set of elements that directly or indirectly
contain all of the query keywords.
Result(Q)v ? ? k ? Q ?c ? N ((v,c) ? CE ? c ?R0
? contains(c,k))
ensures that only the most specific results are
returned.
ensures that an element that has multiple
independent occurrences of the query keywords is
returned,
CE are considered for result set, HE are
considered for ranking

18
Keyword Query Results - 3

XML elements provides more context information
Also poses interesting user-interface challenges.
One solution is to allow the user to navigate up
to the ancestors of the query result
Another solution, is to predefine a set of
answer nodes AN.
XRANK supports both
may require knowledge of the domain and
underlying XML schema

19
Ranking Keyword Query Results

Desired Properties of Ranking Function
1) Result specificity more specific results
higher than less specific results. one dimension
of result proximity.
2) Keyword proximity another dimension of
result proximity.
3) Hyperlink Awareness hyperlinked structure of
XML documents.

20
Ranking Function Definition

ElemRank is defined at the granularity of an
element and takes the nested structure of XML
into account.
Similar to Googles PageRank
Q (k1, k2, , kn)
R Result(Q)
A result element v1 ? R
First define the ranking of v1 with respect to
one query keyword ki, r(v1,ki) before defining
the overall rank, rank(v1, Q).

21
Ranking with respect to one keyword

There exists a sub-element/value node
v2 of v1 such that
v2 ?R0 and contains(v2, ki).
There is a sequence of containment edges
in CE of the form (v1, v2), (v2, v3), , (vt,
vt1) such that vt1 is a value node that
directly contains the keyword ki.

22
Ranking with respect to one keyword

r(v1, ki) does not depend on the ElemRank of the
result node v1, except when v1 vt for 2
reasons
1. less specific results indeed get lower ranks.
2. in fact related to ElemRank(v1) due to certain
properties of containment edges.
For multiple occurences of ki in v1 combined rank
is
f max

23
Overall Ranking

The overall ranking is the sum of the ranks with
respect to each query keyword, multiplied by a
measure of keyword proximity p(v1, k1, k2, ,
kn).

24
XRANK System Architecture
25
XRANK System Architecture-2

ElemRank Computation Module
Computes the ElemRanks of XML elements
Combined with ancestor info
HDIL
Generates an index structure called HDIL
The Query Evaluator Module
Evaluates queries using HDIL
Returns ranked results.

26
ElemRank Computational Module

ElemRank is a measure of the objective importance
of an XML element and is based on the hyperlinked
structure of XML docs.
PageRank function is sum of 2 probabilities
Visiting v at random (d0.85)
Visiting v by navigating

27
ElemRank Computational Module

PageRank is unidirectional
Forward ElemRank propagation
Paper ? section
Reverse ElemRank propagation
Paper -- gt workshop

28
Refinements of PageRank

Bi-directional transfer of ElemRanks
Discrimination between containment and hyperlink
edges
Aggregate ElemRanks for reverse containment
relationships

29
Bi-directional Transfer of ElemRanks

A simple solution is to add reverse containment
edges,
does not distinguish between containment and
hyperlink edges

30
Discrimination between containment and hyperlink
edges

It weights forward and reverse containment
relationships similarly.

31
Aggregate ElemRanks for reverse containment
relationships
32
XRANK System

Efficiently Evaluating XML Keyword Search Queries

33
Efficiently Evaluating XML Keyword Search Queries

Naïve Approach
Dewey Inverted List (DIL)
Ranked Dewey Inverted List (RDIL)
Hybrid Dewey Inverted List (HDIL)

34
Naïve Approach

Main Difference between XML and HTML keyword
search
The granularity of query results
XML keyword search returns elements
HTML keyword search returns documents
One way to do XML keyword search
Treat each element as a document

35
Problems of Naïve Approach

Space Overhead
Spurious Query Results
Inaccurate ranking of results

36
Space Overhead

An inverted list contains for each keyword, the
list of documents that contain the keyword
For XML documents, the list of elements
A large space overhead because each inverted
list contains
XML element that directly contains the keyword(1)
All of (1)s ancestors redundantly

37
Spurious Query Results

The naïve approach ignores ancestor-descendant
relationships.
All elements treated as independent documents
Results will not correspond to the desired
semantics for XML keyword search

38
Inaccurate Ranking of Results

Existing approaches do not take result
specificity into account when ranking results.

39
Dewey Inverted List (DIL)

Naïve approach has drawbacks
Decouples representation of ancestors and
descendants.
Dewey encoding of Element IDs jointly captures
ancestor and descendant information.

40
(No Transcript)
41
DIL

An interesting feature
ID of an ancestor is a prefix of the ID of a
descendant.
Ancestor-descendant relationships are implicitly
captured in the Dewey ID.

42
DIL Data Structure

The inverted list for a keyword k contains the
Dewey IDs of all the XML elements that directly
contain the keyword k.
For multiple documents
First component of each Dewey ID is the document
ID

43
DIL Data Structure -2

An entry in DIL
ElemRank of corresponding XML element
The list of all positions where the keyword k
appears in that element.
Entries are sorted by Dewey IDs
The size of DIL is smaller than that of Naïve
Approach.

44
(No Transcript)
45
DIL Query Processing

An algorithm that works in a single pass over the
query keyword inverted lists.
The key idea
Merge the query keyword inverted lists
Simultaneously compute the longest common prefix
of the Dewey IDs in different lists.

46
(No Transcript)
47
(No Transcript)
48
Ranked Dewey Inverted List (RDIL)

If inverted lists are long (due to common
keywords or large document collections) even the
cost of a single scan of the inverted list can be
expensive, especially if the users want only the
top few results.

49
RDIL -2

One solution
Order the inverted lists by the ElemRank instead
of by the Dewey ID.
Higher ranked results will appear first in the
inverted list.
Threshold Algorithm.

50
RDIL Data Structure

RDIL is similar to DIL except that
Inverted lists are ordered by ElemRank,
Each inverted list has a B-tree index of the
Dewey ID field.

51
(No Transcript)
52
RDIL Query Processing

Consider an entry retrieved from the inverted
list of keyword k i .
The entry contains the Dewey ID d of a top-ranked
element that directly contains the query keyword
k i .
To determine a query result the longest prefix of
d that also contains the other query keywords
needs to be determined.

53
(No Transcript)
54
Hybrid Dewey Inverted List (HDIL)

In many cases RDIL is likely to perform well.
It may perform worse than DIL when there is a
query where keywords are not correlated.

55
HDIL -2

The individual query keywords occur relatively
frequently in the document collection but rarely
occur together in the same document.
Since the number of results is small
RDIL has to scan most (or all) of the inverted
lists to produce the output.
Can we combine the benefits of DIL and RDIL
without replicating the entire inverted list
index?

56
(No Transcript)
57
HDIL Query Processing