Gkay Burak AKKUS Ece AKSU - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Gkay Burak AKKUS Ece AKSU

Description:

Dewey encoding of Element IDs jointly captures ancestor and descendant information. ... ID of an ancestor is a prefix of the ID of a descendant. ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 59
Provided by: GOK9
Category:
Tags: akkus | aksu | ancestor | burak | ece | gkay

less

Transcript and Presenter's Notes

Title: Gkay Burak AKKUS Ece AKSU


1
XRANK
  • XRANK Ranked Keyword Search over
  • XML Documents
  • Ece AKSU
  • Gökay Burak AKKUS

2
This Paper...
  • Describes the architecture, implementation and
    evaluation of the XRANK system
  • The contributions of the paper are
  • (a) the problem definition and system
    architecture
  • (b) an algorithm for computing the ranking of
    XML elements
  • (c) new inverted list index structures and
    associated query processing algorithms
  • (d) an experimental evaluation of XRANK

3
Overview
  • Problem Efficiently producing ranked results for
    keyword search queries over hierarchical XML
    documents.
  • New challanges
  • Returns deeply nested XML elements.
  • Ranking is at the granularity of an XML element
    (not the document)
  • Keyword proximity is more complex.

4
Overview - 2
  • This paper pesents XRANK system to handle these
    features of XML keyword search.
  • XRANK offers both space performance benefits
  • XRANK generalizes a hyperlink based HTML search
    engine such as Google.
  • XRANK can be used to query both HTML and XML
    documents.

5
Keyword Search Querying - 1
  • Keyword search querying
  • Adv simple
  • users do not have to learn a complex query
    language
  • can issue queries without any prior knowledge
    about the structure of the underlying data.
  • Consequence Interface is fexible
  • Queries may not always be precise and can return
    large number of query results.

6
Keyword Search Querying - 2
  • An important requirement for keyword search is to
    rank the query results so that the most relevant
    results appear first.
  • Certain limitations of the HTML data model make
    such systems ineffective in many domains.
  • HTML is a presentation language
  • HTML cannot capture much semantics

7
Keyword Search Querying - 3
  • The XML data model addresses this limitation by
    allowing for extensible element tags. (Example
    Figure.1)

8
(No Transcript)
9
Querying XML Documents
  • One approach is the sophisticated query language
    XQUERY
  • Effective in some cases
  • Users have to learn a complex query language and
    understand the schema of underlying XML
  • An alternative approach is XRANK
  • Retain the simple keyword search query interface
  • Exploit XMLs tagged and nested structure during
    query processing.

10
New Challanges
  • Keyword searching over XML introduces many new
    challenges.
  • 1. The result of the keyword search query can be
    a deeply nested XML element.
  • return the deepest node
  • 2. Ranking is not solely based on hyperlinks.
  • semantics of containment links (relating parent
    and child elements) is very different from that
    of hyperlinks (such as IDREFs and XLinks)

11
New Challanges
  • 3. The notion of proximity among keywords is
    more complex
  • In HTML, proximity among keywords translates
    directly to the distance between keywords in a
    document.
  • For XML there is a 2-dimensional proximity
    metric.
  • Keyword distance
  • Ancestor distance

12
XML Data Model
  • XML is a hierarchical format for data
    representation and exchange.
  • An XML document consists of
  • Root element, nested sub-elements, attributes and
    values,
  • supports intra-document and inter-document
    references.

13
XML Data Model-2
  • Intra-document referencees are represented using
    IDREFs.
  • Inter-document references are represented using
    XLink.
  • Both IDREFs and XLinks are reffered as
    hyperlinks!

14
Definitions
  • A collection of hyperlinked XML documents can be
    defined as a directed graph
  • G (N, CE, HE)
  • N The set of nodes N NE U NV
  • NE The set of elements
  • NV The set of values
  • CE The set of containment edges relating nodes
  • HE The set of hyperlink edges relating nodes

15
Definitions - 2
  • The edge (u, v) ?CE iff v is a value/nested
    sub-element of u.
  • The edge (u, v) ? HE iff u contains a hyperlink
    reference to v.
  • An element u is a sub-element of an element v if
    (v,u) ? CE.
  • An element u is the parent of node v if (u,v) ?
    CE.
  • The predicate contains(v, k) is true if the node
    v directly or indirectly contains the keyword k.

16
Keyword Query Results
  • There are two possible semantics for keyword
    search queries
  • conjunctive keyword query semantics
  • contain all of the query keywords are returned.
  • disjunctive keyword query semantics
  • contain at least one of the query keywords are
    returned
  • This paper focuses on conjunctive keyword query
    semantics.

17
Keyword Query Results - 2
  • Qk1,, kn.
  • R0 v ?v ? NE ? ? k ? Q(contains(v,k))
  • the set of elements that directly or indirectly
    contain all of the query keywords.
  • Result(Q)v ? ? k ? Q ?c ? N ((v,c) ? CE ? c ?R0
    ? contains(c,k))
  • ensures that only the most specific results are
    returned.
  • ensures that an element that has multiple
    independent occurrences of the query keywords is
    returned,
  • CE are considered for result set, HE are
    considered for ranking

18
Keyword Query Results - 3
  • XML elements provides more context information
  • Also poses interesting user-interface challenges.
  • One solution is to allow the user to navigate up
    to the ancestors of the query result
  • Another solution, is to predefine a set of
    answer nodes AN.
  • XRANK supports both
  • may require knowledge of the domain and
    underlying XML schema

19
Ranking Keyword Query Results
  • Desired Properties of Ranking Function
  • 1) Result specificity more specific results
    higher than less specific results. one dimension
    of result proximity.
  • 2) Keyword proximity another dimension of
    result proximity.
  • 3) Hyperlink Awareness hyperlinked structure of
    XML documents.

20
Ranking Function Definition
  • ElemRank is defined at the granularity of an
    element and takes the nested structure of XML
    into account.
  • Similar to Googles PageRank
  • Q (k1, k2, , kn)
  • R Result(Q)
  • A result element v1 ? R
  • First define the ranking of v1 with respect to
    one query keyword ki, r(v1,ki) before defining
    the overall rank, rank(v1, Q).

21
Ranking with respect to one keyword
  • There exists a sub-element/value node
  • v2 of v1 such that
  • v2 ?R0 and contains(v2, ki).
  • There is a sequence of containment edges
  • in CE of the form (v1, v2), (v2, v3), , (vt,
    vt1) such that vt1 is a value node that
    directly contains the keyword ki.

22
Ranking with respect to one keyword
  • r(v1, ki) does not depend on the ElemRank of the
    result node v1, except when v1 vt for 2
    reasons
  • 1. less specific results indeed get lower ranks.
  • 2. in fact related to ElemRank(v1) due to certain
    properties of containment edges.
  • For multiple occurences of ki in v1 combined rank
    is
  • f max

23
Overall Ranking
  • The overall ranking is the sum of the ranks with
    respect to each query keyword, multiplied by a
    measure of keyword proximity p(v1, k1, k2, ,
    kn).

24
XRANK System Architecture
25
XRANK System Architecture-2
  • ElemRank Computation Module
  • Computes the ElemRanks of XML elements
  • Combined with ancestor info
  • HDIL
  • Generates an index structure called HDIL
  • The Query Evaluator Module
  • Evaluates queries using HDIL
  • Returns ranked results.

26
ElemRank Computational Module
  • ElemRank is a measure of the objective importance
    of an XML element and is based on the hyperlinked
    structure of XML docs.
  • PageRank function is sum of 2 probabilities
  • Visiting v at random (d0.85)
  • Visiting v by navigating

27
ElemRank Computational Module
  • PageRank is unidirectional
  • Forward ElemRank propagation
  • Paper ? section
  • Reverse ElemRank propagation
  • Paper -- gt workshop

28
Refinements of PageRank
  • Bi-directional transfer of ElemRanks
  • Discrimination between containment and hyperlink
    edges
  • Aggregate ElemRanks for reverse containment
    relationships

29
Bi-directional Transfer of ElemRanks
  • A simple solution is to add reverse containment
    edges,
  • does not distinguish between containment and
    hyperlink edges

30
Discrimination between containment and hyperlink
edges
  • It weights forward and reverse containment
    relationships similarly.

31
Aggregate ElemRanks for reverse containment
relationships
32
XRANK System
  • Efficiently Evaluating XML Keyword Search Queries

33
Efficiently Evaluating XML Keyword Search Queries
  • Naïve Approach
  • Dewey Inverted List (DIL)
  • Ranked Dewey Inverted List (RDIL)
  • Hybrid Dewey Inverted List (HDIL)

34
Naïve Approach
  • Main Difference between XML and HTML keyword
    search
  • The granularity of query results
  • XML keyword search returns elements
  • HTML keyword search returns documents
  • One way to do XML keyword search
  • Treat each element as a document

35
Problems of Naïve Approach
  • Space Overhead
  • Spurious Query Results
  • Inaccurate ranking of results

36
Space Overhead
  • An inverted list contains for each keyword, the
    list of documents that contain the keyword
  • For XML documents, the list of elements
  • A large space overhead because each inverted
    list contains
  • XML element that directly contains the keyword(1)
  • All of (1)s ancestors redundantly

37
Spurious Query Results
  • The naïve approach ignores ancestor-descendant
    relationships.
  • All elements treated as independent documents
  • Results will not correspond to the desired
    semantics for XML keyword search

38
Inaccurate Ranking of Results
  • Existing approaches do not take result
    specificity into account when ranking results.

39
Dewey Inverted List (DIL)
  • Naïve approach has drawbacks
  • Decouples representation of ancestors and
    descendants.
  • Dewey encoding of Element IDs jointly captures
    ancestor and descendant information.

40
(No Transcript)
41
DIL
  • An interesting feature
  • ID of an ancestor is a prefix of the ID of a
    descendant.
  • Ancestor-descendant relationships are implicitly
    captured in the Dewey ID.

42
DIL Data Structure
  • The inverted list for a keyword k contains the
    Dewey IDs of all the XML elements that directly
    contain the keyword k.
  • For multiple documents
  • First component of each Dewey ID is the document
    ID

43
DIL Data Structure -2
  • An entry in DIL
  • ElemRank of corresponding XML element
  • The list of all positions where the keyword k
    appears in that element.
  • Entries are sorted by Dewey IDs
  • The size of DIL is smaller than that of Naïve
    Approach.

44
(No Transcript)
45
DIL Query Processing
  • An algorithm that works in a single pass over the
    query keyword inverted lists.
  • The key idea
  • Merge the query keyword inverted lists
  • Simultaneously compute the longest common prefix
    of the Dewey IDs in different lists.

46
(No Transcript)
47
(No Transcript)
48
Ranked Dewey Inverted List (RDIL)
  • If inverted lists are long (due to common
    keywords or large document collections) even the
    cost of a single scan of the inverted list can be
    expensive, especially if the users want only the
    top few results.

49
RDIL -2
  • One solution
  • Order the inverted lists by the ElemRank instead
    of by the Dewey ID.
  • Higher ranked results will appear first in the
    inverted list.
  • Threshold Algorithm.

50
RDIL Data Structure
  • RDIL is similar to DIL except that
  • Inverted lists are ordered by ElemRank,
  • Each inverted list has a B-tree index of the
    Dewey ID field.

51
(No Transcript)
52
RDIL Query Processing
  • Consider an entry retrieved from the inverted
    list of keyword k i .
  • The entry contains the Dewey ID d of a top-ranked
    element that directly contains the query keyword
    k i .
  • To determine a query result the longest prefix of
    d that also contains the other query keywords
    needs to be determined.

53
(No Transcript)
54
Hybrid Dewey Inverted List (HDIL)
  • In many cases RDIL is likely to perform well.
  • It may perform worse than DIL when there is a
    query where keywords are not correlated.

55
HDIL -2
  • The individual query keywords occur relatively
    frequently in the document collection but rarely
    occur together in the same document.
  • Since the number of results is small
  • RDIL has to scan most (or all) of the inverted
    lists to produce the output.
  • Can we combine the benefits of DIL and RDIL
    without replicating the entire inverted list
    index?

56
(No Transcript)
57
HDIL Query Processing
  • An adaptive strategy
  • Periodically monitor performance.
  • Calculate
  • Time spent t
  • The number of results above the threshold r
  • Estimated time remaining for RDIL (m-r)t/r
  • m desired number of query results
  • If estimated time is more than the expected time
    for DIL, then switch to DIL.

58
Experimental Evaluation
  • Experimental Setup
  • Quality and Ranking Function
  • Space requirements
  • Query Performance
  • (1) the number of query keywords
  • (2) the correlation between the keywords
  • (3) the desired number of query results
  • (4) the selectivity of the keywords.
Write a Comment
User Comments (0)
About PowerShow.com