Ranked Information Retrieval on XML Data - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Ranked Information Retrieval on XML Data

Description:

Bernadette Blum, Christian Nicolaus, Markus Uhl. Ranked Information Retrieval on XML Data ... existing query languages (e.g. XML-QL, Quilt, XQL, ... XQuery) ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 49

Provided by: marku7

Category:

more less

Transcript and Presenter's Notes

Title: Ranked Information Retrieval on XML Data

1
Ranked Information Retrieval on XML Data

Seminar Informationsorganisation und -suche mit
XML
Dr. Ralf Schenkel
SS 2003
Saarland University
8. Juli 2003
Bernadette Blum, Christian Nicolaus, Markus Uhl

2
Outline

1. Introduction in Information Retrieval
2. Information Retrieval on XML Data
3. Approaches
ELIXIR
The ELIXIR language
The ELIXIR query processing algorithm
Experiments, Conclusion
XRANK
Data model
Ranking function
Data structures and algorithms
Experiments
4. Conclusion

3
1. Introduction in Information Retrieval

Definition
Information Retrieval (IR) is the technology for
searching in collections (corpora, intranets,
Web) of weakly structured documents text, HTML,
XML, ...
search engines, digital libraries, similarity
search on scientific data
Vector space model (text analysis)
based on word occurrence frequency
documents and queries are vectors
result ranking based on similarity metric in
vector space

4
1. Introduction in Information Retrieval (II)

Link analysis (structure analysis)
weighting documents
improve result ranking
Page rank approach (I)
web as directed graph G
random walk of a web surfer
follow hyperlinks with probability (1-?)
random jump with probability ?

5
1. Introduction in Information Retrieval (III)

Page rank approach (II)

?/5
q
Document
Hyperlink
(1-?)/3
?/5
random jump
(1-?)/3
(1-?)/3
?/5
?/5
?/5
Probability of random jump ?
Probability of following hyperlink (1- ?)

p(q)
random jump
hyperlinks
6
2. Information Retrieval on XML Data

XML standard for exchange of structured data and
documents
existing query languages (e.g. XML-QL, Quilt,
XQL, ? XQuery)
no ranked or weighted results based on textual
similarity
but extensions (XXL, XIRQL )

2 Approaches
ELIXIR SQL-like approach
XRANK Keyword based approach
7
3.1 ELIXIR

ELIXIR expressive and efficient language for
XML information retrieval
extension to XML-QL similarity operator
computed by WHIRL
returns best r answers

8
ELIXIR The ELIXIR language

Syntax
XML-QL Syntax (SQL-like)

output format
CONSTRUCT ltitemgtblt/gt WHERE ltitems.book
yearybgtblt/gt in db.xml,
ltitems.cdgtclt/gt in db.xml, yb gt
1990, b c.
pattern statements predicates
boolean operators
ELIXIRs similarity operator

similarity calculation even between 2 variables
(? expressiveness)
no nested queries

9
ELIXIR The ELIXIR language (II)

WHIRL (I)
Word-based Heterogeneous Information Retrieval
Logic
extends DATALOG with
only relational data
efficiently supports ranked IR
Syntax (Horn clause)

conjunction of relational predicates
output(y, a, t) - book(y, a, t), ygt1950,
ta.
output relation
input relation
boolean operator
similarity operator
10
ELIXIR The ELIXIR language (III)

WHIRL (II)
Similarity computation
standard IR term vector techniques
weighting terms (TF-IDF values)
cosine measure

(V Vocabulary of distinct terms Terms t ? V
Documents d, d ? RV)
11
ELIXIR The ELIXIR query processing algorithm

Example (naïve approach)

XML-QL query Q2
ltq2gt CONSTRUCT lttuplegtltbgtblt/gtltcgtclt/gtlt/gt
WHERE ltitems.bookgtblt/gt in db.xml,
ltitems.cdgtclt/gt in db.xml lt/gt
full cross product !
Similarity computation for every tupel (b, c)
12
ELIXIR The ELIXIR query processing algorithm
(II)

Problem

full cross product !
13
ELIXIR The ELIXIR query processing algorithm
(III)

Solution
not simply map the full XML data into relational
model
invoke WHIRL as a subroutine (? efficiency)

Avoid generating full cross product!
14
ELIXIR The ELIXIR query processing algorithm
(IV)
Start query Q1
3 Stages intermediate queries Q2, Q3, Q4

1. Partition into a set, Q21 Q2N, of XML-QL
queries
avoid generating full cross product
ordinary predicates

2 pattern statements with variables that are
compared with a similarity predicate gt distinct
Q2j queries

2. WHIRL query Q3
similarity predicates
ordered table of the r best answers

3. XML-QL query Q4
transformation of Q3s output
specified XML structure by Q1

15
ELIXIR The ELIXIR query processing algorithm (V)

Example (Step I Partition in Q2n queries)

ltq21gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gt lttuplegtltbgtShooting
Elvislt/gtlt/gtlt/gt
ltq21gt CONSTRUCT lttuplegtltbgtblt/gtlt/gt
WHERE ltitems.bookgtblt/gt in "db.xml" lt/gt
XML-QL query Q21
ltq22gt CONSTRUCT lttuplegtltcgtclt/gtlt/gt
WHERE ltitems.cdgtclt/gt in "db.xml" lt/gt
XML-QL query Q22
ltq22gtlttuplegtltcgtUkrainian folk musiclt/gtlt/gt
lttuplegtltcgtBeing therelt/gtlt/gt lttuplegtltcgtMilk
cow blueslt/gtlt/gtlt/gt
Avoid generating full cross product!
16
ELIXIR The ELIXIR query processing algorithm
(VI)

Example (Step II WHIRL query Q3)

ltq21gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gt lttuplegtltbgtShooting
Elvislt/gtlt/gtlt/gt
ltq3gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gtlt/gt
q3(b) - q21(b), q22(c), b c.
WHIRL query Q3
ltq22gtlttuplegtltcgtUkrainian folk musiclt/gtlt/gt
lttuplegtltcgtBeing therelt/gtlt/gt lttuplegtltcgtMilk
cow blueslt/gtlt/gtlt/gt
17
ELIXIR The ELIXIR query processing algorithm
(VII)

Example (Step III XML-QL query Q4)

ltq3gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gtlt/gt
ltresultsgt CONSTRUCT ltitemgtblt/gt WHERE
ltq3.tuplegtltbgtblt/gtlt/gt in "q3.xml lt/gt
XML-QL query Q4
Final XML OUTPUT
ltresultsgtltitemgtTraditional Ukrainian cookerylt/gt
ltitemgtBeing and nothingnesslt/gtlt/gt
18
ELIXIR Experiments, Conclusion

Experiments
Total processing time
depends on details of each query and input data
increases marginal with number of answers r
increases linearly with number of similarity
join predicates
Partition (Step 1) of initially query dominate
(expensive parsing and traversing)

19
ELIXIR Experiments, Conclusion (II)

Conclusion
ELEXIR extends XML-QL by supporting
IR-similarity-features for ranking
similarity joins even between 2 variables
(expressiveness)
Algorithm
rewrite original ELIXIR query in a series of
intermediate XML-QL and WHIRL queries.
no full cross product, only filtered tuples of
variable bindings (efficiency)
But
only non-nested queries
strict three-stage approach may be suboptimal in
some cases (partition)

20
XRANK Ranked Keyword Search over XML Documents
21
Introduction

XRANK - Keyword Search over XML documents
results
XML elements that contain all searched keywords
ranking
at granularity of XML elements
based on hyperlink structure
advantages
user does not have to learn a query language
no knowledge about the structure of XML
documents is needed
generalized keyword search engine
(both HTML and XML are possible)

22
Data Model

G (V, CE, HE) collection of XML
documents
V set of XML
elements (tags and attributes)
CE set of
containment edges
HE set of
hyperlinked edges
(u,v) in CE ? v is a sub-element of
u
(u,v) in HE ? u contains a
hyperlink to v
contains(v,k) ? v (in)directly contains
the keyword k

23
Example XML Graph
...
XML element
value
24
Keyword Query Results (1)
How to define results of keyword search queries
over XML documents?
elements with at least one sub-element
containining all keywords at least one
sub-element containing some keywords
elements that contain all keywords no
sub-element contains all keywords!
?
25
Ranking Elements
How to rank XML elements?
ElemRank

extension of PageRank at the granularity of
elements
objective importance of XML elements
based on hyperlinked and nested structure of XML
elements

26
ElemRank (1)
n XML elements nc(u)
sub-elements of u nh (u) outgoing
hyperlinks from u CE-1 (v,u) (u,v) ?
CE reverse containment edges E HE
? CE ? CE -1
nc(u) 3 nh(u) 3
u
containment edge
reverse containment edge
hyperlink edge
27
ElemRank (2)
? prob. for following a hyperlink 1- ?-
?- ? prob. for a random jump ? prob. for
using a containment edge ? prob. for using a
reverse containment edge
e
? / 3 e / 10
? / 1 e / 10
? / 3 e/10
e / 10
? / 3 e / 10
? / 3 e / 10
? / 3 e / 10
e / 10
? / 3 e / 10
containment edge
reverse containment edge
hyperlink edge
28
ElemRank (3)
ElemRank e(v)
e(u) nh(u)
e(u) nc(u)
e(u) 1
(1- ?- ?- ?) 1/n ? ? ? ?
? ?
(u,v) ? HE
(u,v) ? CE
(u,v) ? CE-1
(0 ?, ?, ? 1)
random
navigation
via hyperlinks
via forward containment edges
via reverse containment edges
29
Ranking Function (1)

ranking functions should take into account
result specifity
hyperlinks
keyword proximity

contains(v,k)
? sequence (v1,v2), ..., (vn-1,vn) s.t. vn
directly contains k
r(v,k) ElemRank(vn) decayn-1 (0 decay 1)

based on hyperlinked structure

result specifity

30
Ranking Function (2)

m occurences of keyword k

computation of r1, ..., rm
r(v,k) f(r1, ..., rm)

(with accumulation function f - e.g. max or sum)
p proximity measure

query q consists of keywords k1, ..., kn

R(v,q) (? r(v,ki)) p(v,k1, ..., kn)

keyword proximity

31
ltCDsgt ltCD id 1gt lttitlegt R.E.M.
Out Of Time lt/titlegt ltsonggt
lttitlegt Radio Song lt/titlegt lttimegt
412 lt/timegt lt/songgt ltsonggt
lttitlegt Losing My Religion lt/titlegt
lttimegt 426 lt/timegt lt/songgt
... lt/CDgt ltCD id 2gt lttitlegt
R.E.M. Automatic For... lt/titlegt ...
lt/CDgt ... lt/CDsgt
32
XRANK Architecture
ranked result list
keyword search query
XML documents
Query Evaluator
data acces
ElemRank computation
index structures algorithms
XML elements
with ElemRanks
33
Naïve Approach

naïve inverted list
contains all XML elements that contain the
keyword

...
key1
elem11
elem12
...
key2
elem21
elem22
etc.

space overhead
spurious results
inaccurate ranking

34
Dewey IDs
0
ltCDsgt
...
0.0
0.1
ltCDgt
ltCDgt
...
...
0.1.0
0.0.0
0.0.1
0.0.2
lttitlegt
lttitlegt
ltsonggt
ltsonggt
R.E.M. Automatic For The People
R.E.M. Out Of Time
0.0.1.1
0.0.1.0
0.0.2.1
0.0.2.0
lttimegt
lttitlegt
lttimegt
lttitlegt
426
Losing My Religion
412
Radio Song
35
DIL Data Structure

Dewey inverted list
contains the Dewey IDs of all XML elements that
directly contain the keyword
sorted by Dewey ID (ascending)

Dewey ID
ElemRank
position list
0
0.0.0
75
R.E.M.
80
0
0.1.0

Dewey ID
ElemRank
position list
2
Religion
0.0.2.0
88

36
DIL Query Processing (1)

key idea computation of longest common prefix
(lcp) of Dewey IDs

pot_result
posList 1
posList 2
DeweyID
rank 1
rank 2
1.
0
75
0
y
0
70
0
n
0
65
0
n
37
DIL Query Processing (2)
pot_result
posList 1
posList 2
pot_result
posList 1
posList 2
DeweyID
DeweyID
rank 1
rank 2
rank 1
rank 2
2.
1.
0
0
75
0
y
88
n
2
2
70
0
n
83
n
2
0
y
0
0
65
0
n
70
0
78
2
lcp
0
65
0
73
n
2
38
DIL Query Processing (3)
pot_result
posList 1
posList 2
pot_result
posList 1
posList 2
DeweyID
DeweyID
rank 1
rank 2
rank 1
rank 2
2.
1.
0
0
75
0
y
88
n
2
0
2
70
0
n
83
n
2
y
0
0
65
0
n
70
0
78
2
lcp
0
65
0
73
n
2
lcp
3.
0
80
0
n
1
75
0
n
0.0 , 0
y
0
70
73
2
0
39
RDIL Data Structure

ranked Dewey inverted list
each Dewey ID in the list has a position in the
B-tree
B-tree sorted by Dewey ID (ascending)
inverted list sorted by ElemRank (descending)

B-tree on Dewey IDs
0.0.0

0.1.0
Dewey ID
ElemRank
80
0.1.0
R.E.M.
0.0.0
75

40
RDIL Query Processing (1)
key1
key3
key2
B
B
B
on Dewey IDs
entry21
entry31
entry11
entry22
entry32
entry12
sorted by ElemRank
entry23
entry33
entry13
...
...
...
lcp with Dewey ID11 ? result heap
41
RDIL Query Processing (2)
key1
key3
key2
B
B
B
on Dewey IDs
entry31
entry21
entry11
entry32
entry22
entry12
sorted by ElemRank
entry33
entry23
entry13
...
...
...
lcp with Dewey ID21 ? result heap
etc.
42
RDIL Query Processing (3)
key1
key3
key2
B
B
B
on Dewey IDs
entry21
entry31
entry11
entry22
entry32
entry12
sorted by ElemRank
entry23
entry33
entry13
...
...
...
? Ranking
max. reachable Ranking
threshold ?
43
RDIL Query Processing (4)
RDIL algorithm stops if threshold ? lt lowest
ElemRank in result heap because max. reachable
ranking ? lt lowest ElemRank in result heap
? max. reachable ranking lt lowest ElemRank in
result heap
!

44
XRANK Architecture
ranked result list
keyword search query
XML documents
Query Evaluator
data acces
DIL / RDIL
ElemRank computation
XML elements
with ElemRanks
45
Experimental Results (1)
46
Experimental Results (2)
47
Comparison DIL - RDIL
DIL
RDIL