XRANK: Ranked Keyword Search over XML Documents - PowerPoint PPT Presentation

About This Presentation
Title:

XRANK: Ranked Keyword Search over XML Documents

Description:

Stack to store current Dewey-ID, ranks, position List, longest common prefixes : deweyStack ... Compute a list of results for each of query keywords and ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 73
Provided by: cseIi3
Category:

less

Transcript and Presenter's Notes

Title: XRANK: Ranked Keyword Search over XML Documents


1
XRANK Ranked Keyword Search over XML Documents
Lin Guo Feng Shao Chavdar Botev Jayavel
Shanmugasundaram
Presentation by Meghana Kshirsagar Nitin
Gupta Indian Institute of Technology, Bombay
2
Outline
  • Motivation
  • Problem Definition, Query Semantics
  • Ranking Function
  • A New Datastructure Dewey Inverted List (DIL)
  • Algorithms
  • Performance Evaluation

3
Motivation
4
Motivation - I
  • Why do we need search over XML data?
  • Why not use search techniques used on WWW
    (keyword search on HTML)?

5
Motivation - IIKeyword Search XML Vs HTML
  • XML
  • structural
  • Links IDREFs and Xlinks
  • Tags Content specifiers
  • ranking
  • Result XML element (a tree)
  • Element-level ranking
  • Proximity
  • width
  • height
  • HTML
  • structural
  • Links document-to-document
  • Tags Format specifiers
  • ranking
  • Result Document
  • Page-level ranking
  • Proximity
  • width distance between words

6
Problem Definition,Query Semantics,and Ranking
7
Problem Definition
  • Input Set of keywords
  • Output Ranked XML elements

What is a result? How to rank results ?
8
Bird's eye view of the system
Results
Query Keywords
XML doc repository
Preprocessing (ElemRank computation)
9
What is a result?
  • A minimal Steiner tree of XML elements
  • Result-set is a set of XML elements that
  • includes a subset of elements containing all
    query-keywords at least once, after excluding the
    occurrences of keywords in contained results (if
    any).

10
result 1
result 2
11
Result Graphical representation
containment edge
ancestor
descendant
12
Ranking Which results to return first?
  • Properties
  • The Ranking function should
  • reflect Result Specificity
  • consider Keyword-Proximity
  • be Hyperlink Aware
  • Ranking function
  • f (height, width, link-structure)

13
Less specific result
More specific result
14
Ranking Function
For a single XML element (node)
r (v1, ki) ElemRank ( vt ) . decayt-1
v1
vt
ki
15
Ranking Function
Combining ranks in case of multiple occurrences
Overall Rank
16
Semantics of the ranking function
Link structure
r (v1, ki) ElemRank ( vt ) . decayt-1
Specificity (height)
Proximity
17
ElemRank Computation adopt PageRank??
  • PageRank
  • Short-comings
  • Fails to capture
  • bidirectional transfer of ElemRanks
  • discrimination between edge-types (containment
    and hyperlink)
  • doesn't aggregate ElemRanks for reverse
    containment relationships

18
ElemRank Computation - I
  • Consider Both forward and reverse ElemRank
    propagation.
  • Ne total of XML elements
  • Nh(u) hyperlinks from 'u'
  • Nc(u) children of 'u'
  • E HE U CE U CE'
  • CE' reverse containment edges

19
ElemRank Computation - II
  • Seperate containment and hyperlink edges
  • CE containment edges
  • HE hyperlink edges
  • ElemRank (sub elements) a 1 / ( sibling
    sub-elements )

20
ElemRank Computation - III
  • Sum over the reverse-containment edges,
    instead of distributing the weight
  • Nd(u) total XML documents
  • Nde(v) elements in the XML doc containing v
  • ElemRank (parent) a Sum (ElemRank(sub-eleme
    nts))

21
Datastructures and Algorithms
22
Naïve Algorithm
  • Approach
  • XML element doc
  • Use keyword search on WWW
  • Limitations
  • Space overhead (in inverted indices)
  • Failure to model Hierarchical relationships
    (ancestordecendent).
  • Inaccurate Ranking
  • Need a new datastructure which can model
    hierarchical relationships !!
  • Answer Dewey Inverted Lists

23
Labeling nodes using Dewey Ids
24
Dewey Inverted Lists
  • One entry per keyword
  • Entry for keyword 'k' has Dewey-IDs of elements
    directly containing 'k'
  • Simple equi merge-join of Dewey-ID-lists won't
    work !
  • Need to compute prefixes.

25
System Architecture
26
DIL Query Processing
  • Simple equality merge-join will not work
  • Need to find LCP (longest common prefix) over all
    elements with query keyword-match.
  • Single pass over the inverted lists suffices!
  • Compute LCP while merging the ILs of individual
    keywords.
  • ILs are sorted on Dewey-IDs

27
Datastructures
  • Array of all inverted lists invertedList
  • invertedListi for keyword 'i'
  • each invertedListi is sorted on Dewey-ID
  • Heap to maintain top-m results resultHeap
  • Stack to store current Dewey-ID, ranks, position
    List, longest common prefixes deweyStack

28
Algorithm on DILs - Abstract
  • While all inverted-lists are not processed
  • Read the next entry from DIL having smallest
    Dewey-ID
  • call this 'currentEntry'
  • Find the longest common prefix (lcp) between
    stack components and entry read from DIL
  • lcp (deweyStack , currentEntry)
  • Pop non-matching entries from Dewey-stack Add
    result to heap if appropriate
  • check if current top-of-stack contains all
    keywords
  • if yes, compute OverallRank, put this result onto
    heap
  • else
  • non-matching entries are popped one component at
    a time and update (rank, posList) on each pop
  • Push non-matching part of 'currentEntry' to
    'deweyStack'
  • non-matching components of 'currentEntry.deweyID'
    are pushed onto stack
  • Update components of top entry of deweyStack

29
Example
Query XQL Ricardo
30
Algorithm Trace Step 1
Ranki Rank due to keyword 'i' PosListi
List of occurrences of keyword 'i'
Smallest ID 5.0.3.0.0
DeweyStack
DIL invertedList
push all components and find rank, posL
31
Algorithm Trace Step 2
Smallest ID 5.0.3.0.1
DeweyStack
DIL invertedList
find lcp and pop nonmatching components
32
Algorithm Trace Step 3
Smallest ID 5.0.3.0.1
DeweyStack
DIL invertedList
updated rank, posL
33
Algorithm Trace Step 4
Smallest ID 5.0.3.0.1
DeweyStack
DIL invertedList
push non-matching components
34
Algorithm Trace Step 5
Smallest ID 6.0.3.8.3
DeweyStack
DIL invertedList
find lcp, update, finally pop all components
35
Problems with DIL
  • Scans the entire inverted-list for all keywords
    before a result is output
  • Very inefficient for top-k computation

36
Other Techniques - RDIL
  • Ranked Dewey Inverted List
  • For efficient top-k result computation
  • IL is ordered by ElemRank
  • Each IL has a B tree index on the Dewey-IDs
  • Algorithm with RDIL uses a threshold

37
Algorithm using RDIL (Abstract)
  • Choose the next entry from one of the
    invertedList in a Round-Robin fashion.
  • say chosen IL invertedListi
  • d top-ranked Dewey-ID from invertedListi
  • Find the longest common prefix that contains all
    query-keywords
  • Probe the B tree index of all other keyword ILs,
    for the longest common prefix
  • Claim
  • d2 smallest Dewey-ID in invertedListj of
    query-keyword 'j'
  • d3 immediate predecessor of d2
  • lcp max_prefix (lcp ( d, d2) , lcp ( d, d3))
  • Check if 'lcp' is a complete result
  • Recompute 'threshold' sum (ElemRank of last
    processed element in each query keyword IL)
  • If (rank of top-k results on heap) gt
    threshold) return

38
Performance of RDIL
  • Works well for queries with highly correlated
    keywords
  • BUT ! becomes equivalent (actually worse) to
    DIL for totally uncorrelated keywords
  • Need an intermediate technique

39
HDIL
  • Uses both DIL and RDIL
  • Adaptive strategy
  • Start with RDIL
  • Switch to DIL if performance is bad
  • Performance?
  • Estimated remaining time for RDIL (m r ) t
    / r
  • t time spent so far
  • r no. of results above threshold so far
  • m desired no. of results
  • Estimated remaining time for DIL ?
  • No. of query-keywords is known
  • Size of each IL is known

40
HDIL
  • Datastructures?
  • Store full IL sorted on Dewey-ID
  • Store small fraction of IL sorted on ElemRank
  • Share the leaf level between IL and B tree (in
    RDIL)
  • Overhead top levels of B tree

41
Updating the lists
  • Updation is easy
  • Insertion very bad!
  • techniques from Tatarinov et al.
  • we've seen a better technique in this course )
    OrdPath

42
Evaluation
  • Criteria
  • no. of query-keywords
  • correlation between query-keywords
  • desired no. of query results
  • selectivity of keywords
  • Setup
  • Datasets used DBLP, Xmark
  • d1 0.35, d2 0.25, d3 0.25
  • 2.8GHz Pentium IV 1GB RAM 80GB HDD

43
Performance - 1
44
Performance - 2
45
Critique
  • New datastructure (DIL) defined to represent
    hierarchical relationships accurately and
    efficiently.
  • Hyperlinks and IDREFs are considered only while
    computing ElemRank. Not used while returning
    results.
  • Only containment edges (ancestor-descendant) are
    considered while computing result trees.
  • Works only on trees, can't handle graphs.

46
(No Transcript)
47
(No Transcript)
48
The SphereSearch Engine for Unified Banked
Retrieval of Heterogenous XML and Web Documents
  • Jens Graupmann Ralf Schenkel Gerhard
    Weikum
  • Max-Plack-Institut fur Informatik
  • Presentation by
  • Nitin Gupta Meghana Kshirsagar
  • Indian Institute of Technology Bombay

49
Why another search engine ?
  • To cope with diversity in the structures and
    annotations of the data
  • Ranked retrieval paradigm for producing relevance
    ordered results lists rather than a mere boolean
    retrieval.
  • Short comings of the current search engines
  • Concept aware
  • Context aware (or link-awareness)
  • Abstraction aware
  • Query Language

50
Concept awareness
  • Example researcher max planck yields many
    results about researchers who work at the
    institute Max Plack Society
  • Better formulation would be researcher
    personmax planck
  • Objective attained by
  • Transformation to XML
  • Data Annotation

51
Concept awareness Transformation
  • ltExperimentsgt
  • ... Text1 ...
  • ltSettingsgt
  • ... Text2 ...
  • lt/Settingsgt
  • lt/Experimentsgt
  • ...
  • ltH1gtExperimentslt/H1gt
  • ... Text1 ...
  • ltH2gtSettingslt/H2gt
  • ... Text2 ...
  • ltH1gt ...

52
Abstraction Awareness
  • Example Synonyms, Ontologies
  • Is connection to various encyclopedias/ Wiki's
    possible?
  • Objective attained by using
  • Ontology Service provides quantified ontological
    information to the system
  • Preprocessed information based on focused web
    crawls to estimate statistical correlations
    between the characteristic words of related
    concepts

53
Context Awareness
  • Query may not be answered by web search engines
    as no single web page may be a match
  • Unlike usual navigation axes in XML, context
    should go beyond trees.
  • Consider graph structure spanned by
    Xlink/XPointer references and href hyperlinks
  • Objective attained by
  • introduction of the concept of a SPHERE

54
Context Awareness Sphere
  • What is a sphere?
  • Relevance of an element for a group of query
    conditions is not just determined by its own
    content, but also by the content of other
    neighboring elements, including linked documents,
    in an environment - called Sphere - of the
    element.

55
Query Language
  • Query S (Q, J) consists of
  • set Q G1 .. Gq of query groups
  • set J J1 .. Jm of join conditions
  • Each Qi consists of
  • set of keyword conditions t1 .. tk
  • set of concept value conditions c1 v1 ... cl
    vl
  • Each join has the form Qi.v (or ) Qj.w

56
Query Language
  • Example
  • P(professor, locationGermany)
  • C(course, databases)
  • R(project, XML)
  • A(gothic, church)
  • B(romanic, church)
  • A.location B.location

German professors who teach database courses and
have projects on XML
Gothic and Romanic churches at the same location
57
Data Model
  • Collection X (D, L) of XML documents D together
    with a set L of (href, Xpointer, or Xlink) links
    between their elements
  • Consider all attributes as elements then element
    level graph GE(X) (VE(X), EE(X)) has the union
    of all the elements of the document as nodes and
    undirected edges between them
  • Each edge has nonnegative weight
  • 1 for parent-child ? for links
  • A distance function dX(x,y) computes weight of
    a shortest path in GE(X) between x and y

58
Spheres and Query Groups
  • Node-score ns(n,t) is computed using Okapi BM25
    model
  • Similarity condition K Compute exp(K) for the
    keyword. The node score is defined as max
    x?exp(K) sim(K,x) ns(n,x)
  • where sim(K,x) is the ontological similarity
  • Concept value
  • ns(n, cv) 0 if name(n) ? c
  • ns(n,v) otherwise
  • Similarity concept value c v sim(name(n), c)
    ns(n,v)
  • This is insufficient
  • in the presence of linked documents
  • when content is spread over several elements


59
Spheres and Query Groups
  • Sphere Sd(n) set of nodes at distance d from
    node n
  • sd(n,t) ? v ? Sd(n) ns(v,t)
  • s(n,t) ? si(n,t) ai

s(1,t) 1 40.5 20.25 50.125 4.175
s(2,t) 3 00.5 00.25 10.125 3.125
s(n, G) ? j s(n,tj) ? j s(n, cjvj)
60
Spheres and Query Groups Ranking
  • Create a connection graph G(N) (V(N), E(N))
  • Weight of an edge between x,y
  • 0 if x and y are not connected
  • 1/ dx(x,y)1 otherwise

Compactness C(N) of a potential answer N is then
the sum of the total edge weights of a maximal
spanning tree for G(N), and the score is given
by s(N, S) ß C(N) (1- ß) ?i s(ni, Gi)
61
Spheres and Query Groups Joins
  • New virtual links to form an extended collection
    X' (D, L')
  • Connect the elements that match the join
  • Similarity join For Qi.v Qj.w, consider sets
    N(v) (resp N(w)) with name v (w) or contain v (w)
    in their content. For each pair x N(v), y N(w)
    add a link x,y with weight 1/csim(x,y)

62
System Architecture
Focused web crawls used to estimate statistical
correlations between the characteristic words of
related concepts. Current version uses Dice
coefficient.
Content stored in inverted lists with
corresponding tfidf-style term statistics
Indexer stores with each element the
corresponding Dewer encoding of its position
within the document
63
Query Processor
  1. First compute a result list for each query group
  2. Add virtual links for join conditions
  3. Compute the compactness of a subset of all
    potential answers of the query in order to return
    the top-k results
  1. Compute a list of results for each of query
    keywords and concept-value conditions.
  2. Candidate nodes Nodes that are at distance at
    most D from any node that occurs in at least one
    of the lists. Sphere score is computed only for
    these nodes since only these can have a non-zero
    score!
  3. For eachl candidate node N, look up the node
    scores of nodes in the sphere of N, and adding
    these scores with a proper damping factor.

64
Query Processor
  • Virtual links Processor considers only a limited
    set of possible end points for efficient
    computation
  • Nodes in the spheres upto distance D around nodes
    with nonzero sphere score for any query group
  • Why? Any other node will have distance atleast
    D1 to any results node and thus contributes at
    most 1/ (D1)1 to the compactness, which is
    negligible
  • This set of candidate nodes can be computed on
    the fly
  • Set further reduced by testing join attributes,
    for example A.x B.y results in two sets of
    potential end points.

65
Query Processor
  • Generating answers
  • Naïve method generate all possible potential
    answers from the answers to query groups, compute
    connection graphs and compactness, and finally
    their score
  • For top-k answers, use Fagin's Threshold
    Algorithm with sorted lists only
  • Input Sorted list of node scores and pairwise
    node scores (edges)
  • Output k potential answers with the best scores

66
Experiments
  • Sun V40z, 16GB RAM, Windows 2003 Server, Tomcat
    4.0.6 environment, Oracle 10g database
  • Benchmarks XMach, Xmark, INEX, TREC

Does not consider XML at all
Semantically poor tags
Designed for XQuery-style exact match
Wikipedia Collection from the Wikipedia project
HTML Collection transformed into XML and
annotated Wikipedia Collection Extension of
Wikipedia with IMDB data, with generated XML
files for each movie and actor DBLP Collection
Based on the DBLP project which indexes more than
480,000 publications INEX Set of 12,107 XML
documents, a set of queries with and without
structural constraints
67
Experiments
Conversion from HTML to XML
Dataset Statistics
68
Experiments
  • SSE-basic basic version limited to keyword
    conditions using sphere-based scoring
  • SSE-CV basic version plus concept-value
    conditions
  • SSE-QC CV version plus query groups (full
    contest awareness)
  • SSE-Join full version will all features
  • SSE-KW very restricted version with simple
    keyword search
  • GoogleWiki Google search restricted to
    Wikipedia.org
  • GoogleWiki Google on wikipedia.org with
    Google's operator for query expansion
  • GoogleWeb Google search on the entire web
  • GoogleWeb Google search on the entire web with
    query expansion

69
Experiments
Aggregated results for Wikipedia
70
Experiments
Aggregated results for Wikipedia and DBLP
71
Experiments
Graph showing the average runtimes for different
versions
72
Thank you
Write a Comment
User Comments (0)
About PowerShow.com