Title: Ranked Information Retrieval on XML Data
1Ranked Information Retrieval on XML Data
- Seminar Informationsorganisation und -suche mit
XML - Dr. Ralf Schenkel
- SS 2003
- Saarland University
- 8. Juli 2003
- Bernadette Blum, Christian Nicolaus, Markus Uhl
2Outline
- 1. Introduction in Information Retrieval
- 2. Information Retrieval on XML Data
- 3. Approaches
- ELIXIR
- The ELIXIR language
- The ELIXIR query processing algorithm
- Experiments, Conclusion
- XRANK
- Data model
- Ranking function
- Data structures and algorithms
- Experiments
- 4. Conclusion
31. Introduction in Information Retrieval
- Definition
- Information Retrieval (IR) is the technology for
searching in collections (corpora, intranets,
Web) of weakly structured documents text, HTML,
XML, ... - search engines, digital libraries, similarity
search on scientific data - Vector space model (text analysis)
- based on word occurrence frequency
- documents and queries are vectors
- result ranking based on similarity metric in
vector space
41. Introduction in Information Retrieval (II)
- Link analysis (structure analysis)
- weighting documents
- improve result ranking
- Page rank approach (I)
- web as directed graph G
- random walk of a web surfer
- follow hyperlinks with probability (1-?)
- random jump with probability ?
51. Introduction in Information Retrieval (III)
?/5
q
Document
Hyperlink
(1-?)/3
?/5
random jump
(1-?)/3
(1-?)/3
?/5
?/5
?/5
Probability of random jump ?
Probability of following hyperlink (1- ?)
p(q)
random jump
hyperlinks
62. Information Retrieval on XML Data
- XML standard for exchange of structured data and
documents - existing query languages (e.g. XML-QL, Quilt,
XQL, ? XQuery) - no ranked or weighted results based on textual
similarity - but extensions (XXL, XIRQL )
2 Approaches
ELIXIR SQL-like approach
XRANK Keyword based approach
73.1 ELIXIR
- ELIXIR expressive and efficient language for
XML information retrieval - extension to XML-QL similarity operator
- computed by WHIRL
- returns best r answers
8ELIXIR The ELIXIR language
- Syntax
- XML-QL Syntax (SQL-like)
output format
CONSTRUCT ltitemgtblt/gt WHERE ltitems.book
yearybgtblt/gt in db.xml,
ltitems.cdgtclt/gt in db.xml, yb gt
1990, b c.
pattern statements predicates
boolean operators
ELIXIRs similarity operator
- similarity calculation even between 2 variables
(? expressiveness) - no nested queries
9ELIXIR The ELIXIR language (II)
- WHIRL (I)
- Word-based Heterogeneous Information Retrieval
Logic - extends DATALOG with
- only relational data
- efficiently supports ranked IR
- Syntax (Horn clause)
conjunction of relational predicates
output(y, a, t) - book(y, a, t), ygt1950,
ta.
output relation
input relation
boolean operator
similarity operator
10ELIXIR The ELIXIR language (III)
- WHIRL (II)
- Similarity computation
- standard IR term vector techniques
- weighting terms (TF-IDF values)
- cosine measure
(V Vocabulary of distinct terms Terms t ? V
Documents d, d ? RV)
11ELIXIR The ELIXIR query processing algorithm
- Example (naïve approach)
XML-QL query Q2
ltq2gt CONSTRUCT lttuplegtltbgtblt/gtltcgtclt/gtlt/gt
WHERE ltitems.bookgtblt/gt in db.xml,
ltitems.cdgtclt/gt in db.xml lt/gt
full cross product !
Similarity computation for every tupel (b, c)
12ELIXIR The ELIXIR query processing algorithm
(II)
full cross product !
13ELIXIR The ELIXIR query processing algorithm
(III)
- Solution
- not simply map the full XML data into relational
model - invoke WHIRL as a subroutine (? efficiency)
Avoid generating full cross product!
14ELIXIR The ELIXIR query processing algorithm
(IV)
Start query Q1
3 Stages intermediate queries Q2, Q3, Q4
- 1. Partition into a set, Q21 Q2N, of XML-QL
queries - avoid generating full cross product
- ordinary predicates
2 pattern statements with variables that are
compared with a similarity predicate gt distinct
Q2j queries
- 2. WHIRL query Q3
- similarity predicates
- ordered table of the r best answers
- 3. XML-QL query Q4
- transformation of Q3s output
- specified XML structure by Q1
15ELIXIR The ELIXIR query processing algorithm (V)
- Example (Step I Partition in Q2n queries)
ltq21gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gt lttuplegtltbgtShooting
Elvislt/gtlt/gtlt/gt
ltq21gt CONSTRUCT lttuplegtltbgtblt/gtlt/gt
WHERE ltitems.bookgtblt/gt in "db.xml" lt/gt
XML-QL query Q21
ltq22gt CONSTRUCT lttuplegtltcgtclt/gtlt/gt
WHERE ltitems.cdgtclt/gt in "db.xml" lt/gt
XML-QL query Q22
ltq22gtlttuplegtltcgtUkrainian folk musiclt/gtlt/gt
lttuplegtltcgtBeing therelt/gtlt/gt lttuplegtltcgtMilk
cow blueslt/gtlt/gtlt/gt
Avoid generating full cross product!
16ELIXIR The ELIXIR query processing algorithm
(VI)
- Example (Step II WHIRL query Q3)
ltq21gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gt lttuplegtltbgtShooting
Elvislt/gtlt/gtlt/gt
ltq3gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gtlt/gt
q3(b) - q21(b), q22(c), b c.
WHIRL query Q3
ltq22gtlttuplegtltcgtUkrainian folk musiclt/gtlt/gt
lttuplegtltcgtBeing therelt/gtlt/gt lttuplegtltcgtMilk
cow blueslt/gtlt/gtlt/gt
17ELIXIR The ELIXIR query processing algorithm
(VII)
- Example (Step III XML-QL query Q4)
ltq3gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gtlt/gt
ltresultsgt CONSTRUCT ltitemgtblt/gt WHERE
ltq3.tuplegtltbgtblt/gtlt/gt in "q3.xml lt/gt
XML-QL query Q4
Final XML OUTPUT
ltresultsgtltitemgtTraditional Ukrainian cookerylt/gt
ltitemgtBeing and nothingnesslt/gtlt/gt
18ELIXIR Experiments, Conclusion
- Experiments
- Total processing time
- depends on details of each query and input data
- increases marginal with number of answers r
- increases linearly with number of similarity
join predicates - Partition (Step 1) of initially query dominate
(expensive parsing and traversing)
19ELIXIR Experiments, Conclusion (II)
- Conclusion
- ELEXIR extends XML-QL by supporting
IR-similarity-features for ranking - similarity joins even between 2 variables
(expressiveness) - Algorithm
- rewrite original ELIXIR query in a series of
intermediate XML-QL and WHIRL queries. - no full cross product, only filtered tuples of
variable bindings (efficiency) - But
- only non-nested queries
- strict three-stage approach may be suboptimal in
some cases (partition)
20XRANK Ranked Keyword Search over XML Documents
21Introduction
- XRANK - Keyword Search over XML documents
- results
- XML elements that contain all searched keywords
- ranking
- at granularity of XML elements
- based on hyperlink structure
- advantages
- user does not have to learn a query language
- no knowledge about the structure of XML
documents is needed - generalized keyword search engine
- (both HTML and XML are possible)
22Data Model
- G (V, CE, HE) collection of XML
documents - V set of XML
elements (tags and attributes) - CE set of
containment edges - HE set of
hyperlinked edges - (u,v) in CE ? v is a sub-element of
u - (u,v) in HE ? u contains a
hyperlink to v - contains(v,k) ? v (in)directly contains
the keyword k
23Example XML Graph
...
XML element
value
24Keyword Query Results (1)
How to define results of keyword search queries
over XML documents?
elements with at least one sub-element
containining all keywords at least one
sub-element containing some keywords
elements that contain all keywords no
sub-element contains all keywords!
?
25Ranking Elements
How to rank XML elements?
ElemRank
- extension of PageRank at the granularity of
elements - objective importance of XML elements
- based on hyperlinked and nested structure of XML
- elements
26ElemRank (1)
n XML elements nc(u)
sub-elements of u nh (u) outgoing
hyperlinks from u CE-1 (v,u) (u,v) ?
CE reverse containment edges E HE
? CE ? CE -1
nc(u) 3 nh(u) 3
u
containment edge
reverse containment edge
hyperlink edge
27ElemRank (2)
? prob. for following a hyperlink 1- ?-
?- ? prob. for a random jump ? prob. for
using a containment edge ? prob. for using a
reverse containment edge
e
? / 3 e / 10
? / 1 e / 10
? / 3 e/10
e / 10
? / 3 e / 10
? / 3 e / 10
? / 3 e / 10
e / 10
? / 3 e / 10
containment edge
reverse containment edge
hyperlink edge
28ElemRank (3)
ElemRank e(v)
e(u) nh(u)
e(u) nc(u)
e(u) 1
(1- ?- ?- ?) 1/n ? ? ? ?
? ?
(u,v) ? HE
(u,v) ? CE
(u,v) ? CE-1
(0 ?, ?, ? 1)
random
navigation
via hyperlinks
via forward containment edges
via reverse containment edges
29Ranking Function (1)
- ranking functions should take into account
- result specifity
- hyperlinks
- keyword proximity
- contains(v,k)
-
- ? sequence (v1,v2), ..., (vn-1,vn) s.t. vn
directly contains k - r(v,k) ElemRank(vn) decayn-1 (0 decay 1)
- based on hyperlinked structure
30Ranking Function (2)
- m occurences of keyword k
- computation of r1, ..., rm
- r(v,k) f(r1, ..., rm)
(with accumulation function f - e.g. max or sum)
p proximity measure
- query q consists of keywords k1, ..., kn
- R(v,q) (? r(v,ki)) p(v,k1, ..., kn)
31ltCDsgt ltCD id 1gt lttitlegt R.E.M.
Out Of Time lt/titlegt ltsonggt
lttitlegt Radio Song lt/titlegt lttimegt
412 lt/timegt lt/songgt ltsonggt
lttitlegt Losing My Religion lt/titlegt
lttimegt 426 lt/timegt lt/songgt
... lt/CDgt ltCD id 2gt lttitlegt
R.E.M. Automatic For... lt/titlegt ...
lt/CDgt ... lt/CDsgt
32XRANK Architecture
ranked result list
keyword search query
XML documents
Query Evaluator
data acces
ElemRank computation
index structures algorithms
XML elements
with ElemRanks
33Naïve Approach
- naïve inverted list
- contains all XML elements that contain the
keyword
...
key1
elem11
elem12
...
key2
elem21
elem22
etc.
- space overhead
- spurious results
- inaccurate ranking
34Dewey IDs
0
ltCDsgt
...
0.0
0.1
ltCDgt
ltCDgt
...
...
0.1.0
0.0.0
0.0.1
0.0.2
lttitlegt
lttitlegt
ltsonggt
ltsonggt
R.E.M. Automatic For The People
R.E.M. Out Of Time
0.0.1.1
0.0.1.0
0.0.2.1
0.0.2.0
lttimegt
lttitlegt
lttimegt
lttitlegt
426
Losing My Religion
412
Radio Song
35DIL Data Structure
- Dewey inverted list
- contains the Dewey IDs of all XML elements that
- directly contain the keyword
- sorted by Dewey ID (ascending)
Dewey ID
ElemRank
position list
0
0.0.0
75
R.E.M.
80
0
0.1.0
Dewey ID
ElemRank
position list
2
Religion
0.0.2.0
88
36DIL Query Processing (1)
- key idea computation of longest common prefix
(lcp) of Dewey IDs
pot_result
posList 1
posList 2
DeweyID
rank 1
rank 2
1.
0
75
0
y
0
70
0
n
0
65
0
n
37DIL Query Processing (2)
pot_result
posList 1
posList 2
pot_result
posList 1
posList 2
DeweyID
DeweyID
rank 1
rank 2
rank 1
rank 2
2.
1.
0
0
75
0
y
88
n
2
2
70
0
n
83
n
2
0
y
0
0
65
0
n
70
0
78
2
lcp
0
65
0
73
n
2
38DIL Query Processing (3)
pot_result
posList 1
posList 2
pot_result
posList 1
posList 2
DeweyID
DeweyID
rank 1
rank 2
rank 1
rank 2
2.
1.
0
0
75
0
y
88
n
2
0
2
70
0
n
83
n
2
y
0
0
65
0
n
70
0
78
2
lcp
0
65
0
73
n
2
lcp
3.
0
80
0
n
1
75
0
n
0.0 , 0
y
0
70
73
2
0
39RDIL Data Structure
- ranked Dewey inverted list
- each Dewey ID in the list has a position in the
B-tree - B-tree sorted by Dewey ID (ascending)
- inverted list sorted by ElemRank (descending)
B-tree on Dewey IDs
0.0.0
0.1.0
Dewey ID
ElemRank
80
0.1.0
R.E.M.
0.0.0
75
40RDIL Query Processing (1)
key1
key3
key2
B
B
B
on Dewey IDs
entry21
entry31
entry11
entry22
entry32
entry12
sorted by ElemRank
entry23
entry33
entry13
...
...
...
lcp with Dewey ID11 ? result heap
41RDIL Query Processing (2)
key1
key3
key2
B
B
B
on Dewey IDs
entry31
entry21
entry11
entry32
entry22
entry12
sorted by ElemRank
entry33
entry23
entry13
...
...
...
lcp with Dewey ID21 ? result heap
etc.
42RDIL Query Processing (3)
key1
key3
key2
B
B
B
on Dewey IDs
entry21
entry31
entry11
entry22
entry32
entry12
sorted by ElemRank
entry23
entry33
entry13
...
...
...
? Ranking
max. reachable Ranking
threshold ?
43RDIL Query Processing (4)
RDIL algorithm stops if threshold ? lt lowest
ElemRank in result heap because max. reachable
ranking ? lt lowest ElemRank in result heap
? max. reachable ranking lt lowest ElemRank in
result heap
!
44XRANK Architecture
ranked result list
keyword search query
XML documents
Query Evaluator
data acces
DIL / RDIL
ElemRank computation
XML elements
with ElemRanks
45Experimental Results (1)
46Experimental Results (2)
47Comparison DIL - RDIL
DIL
RDIL
- inverted lists sorted by
- Dewey ID
- compute longest common prefix on Dewey IDs
- extracts the minimum
- of all remaining Dewey IDs
- all lists are completely
- scanned
- outperforms RDIL
- if keyword correlation is low
- inverted lists sorted by
- ElemRank
- chooses next list sequentially
- stops if a certain threshold
- is reached
- outperforms DIL if
- keyword correlation is high
48Conclusion
2 Approaches
- ELIXIR
- SQL-like structure based search
- extends XML-QL by supporting IR-similarity-feature
s for ranking - ranked results based only on textual similarity
(even between 2 variables)
- XRANK
- keyword based search à la Google
- ranked results based on textual similarity
- hierarchical and hyperlinked structure