Outline - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Outline

Description:

Ralf Schenkel. joint work with Gerhard Weikum, Jens Graupmann, ... Web forms & tables, Wikipedia. relationships quantified by. statistical correlation measures ... – PowerPoint PPT presentation

Number of Views:307

Avg rating:3.0/5.0

Slides: 38

Provided by: Wei84

Category:

more less

Transcript and Presenter's Notes

Title: Outline

1
(No Transcript)
2
Outline

XML IR
Area Overview and Contributions
Efficiency TopX
Effectiveness SphereSearch

3
XML-IR Example (1)
Professor
Address
...
Name Gerhard Weikum
City SB
Country Germany
Teaching
Research
Course
Project
Title IR
Syllabus
Title Intelligent Search of XML Data
Description Information retrieval ...
...
...
Sponsor German Science Foundation
Book
Article
...
...
//Professorcontains(.,SB) and
contains(.//Course,IR)and contains(.//Research,
XML)
4
XML-IR Example (2)
Professor
Lecturer
Address
...
Address Max-Planck Institute for CS, Germany
Name Gerhard Weikum
City SB
Country Germany
Name Ralf Schenkel
Teaching
Research
Interests Semistructured Data, IR
Teaching
Course
Project
Title IR

Challenges
Information spread over multiple documents
Heterogeneous schemas and content
Similarity queries with a vast number of
potential results

Seminar
Syllabus
Title Intelligent Search of XML Data
Description Information retrieval ...
...
...
Contents Ranked Search ...
Literature
Sponsor German Science Foundation
Article
Book
...
...
Book
...
Title Statistical Language Models
//Professorcontains(.,SB) and
contains(.//Course,IR)and contains(.//Resear
ch,XML)
//Professorcontains(.,SB) and
contains(.//Course,IR)and contains(.//Research,
XML)
5
XML-IR History and Related Work
IR on structured docs (SGML)
Web query languages
1995
W3QS (Technion Haifa)
OED etc. (U Waterloo)
Araneus (U Roma)
HySpirit (U Dortmund)
Lorel (Stanford U)
HyperStorM (GMD Darmstadt)
WebSQL (U Toronto)
WHIRL (CMU)
IR on XML
XML query languages
XIRQL (U Dortmund)
XML-QL (ATT Labs)
XXL TopX (U Saarland / MPI)
2000
XPath 1.0 (W3C)
ApproXQL (U Berlin / U Munich)
ELIXIR (U Dublin)
INEX benchmark NEXI XPath XQuery Full-Text
PowerDB-IR (ETH Zurich)
JuruXML (IBM Haifa )
XPath 2.0 (W3C)
XSearch (Hebrew U)
Timber (U Michigan)
XQuery (W3C)
XRank Quark (Cornell U)
FleXPath (ATT Labs)
TeXQuery (ATT Labs)
Commercial software (MarkLogic, Verity?, IBM?,
Oracle?, Google?, ...)
2005
6
Area Overview and Contributions
Documents about XML
...
...
...
7
TopX Efficient XML IR
Goal Efficiently compute the best results of a
similarity query

Query and scoring model for similarity queries
Extend top-k query processing algorithms for
sorted lists Buckley85, Güntzer et al. 00,
Fagin01to XML queries data, including
similarity queries
Exploit cheap disk space for highly redundant
indexing

8
TopX Data Model

ltPgt Gerhard Weikum ltCgtIRlt/Cgt SB
ltRgtXMLlt/Rgtlt/Pgt

docid1pre1 post3 tagP contentGerhard
Weikum IR SB XML
1
docid1pre3 post2tagRcontentXML
docid1pre2 post1tagCcontentIR
3
2

pure tree model, ignoring links
content of descendants replicated, per-element
term scores (using tf/idf scores or variant of
Okapi BM25 model)
pre/postorder labels reflecting element hierarchy

9
Queries and Query Scores

query tree/graph pattern with
mandatory/optional content conditions (CC)
mandatory path conditions (PC)
content-based score for element e with tag A and
CC Tc
candidate connected sub-pattern with element
ids and scores
result candidate with scores for all mandatory
conditions
content-based score of result with elements
e1,,em for query q with CC T1c1, ..., Tmcm

ProfessorSB
CourseIR
ResearchXML
relevance tf-based or BM25 specificity idf
per tag type compactness subtree size
10
Data Structures
ProfessorSB
CourseIR
ResearchXML
ProfessorSB
CourseIR
ResearchXML
oid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
oid docid score pre post
216 17 0.9 2 15
72 3 0.8 10 8
51 2 0.5 4 12
671 31 0.4 12 23
oid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4

1) Build index lists for each tag-term pair,
grouped by document, sorted by max score in
document
Block-fetch all elements for the same doc
Create and/or update candidates, including
testing PCs in memory
Maintain score and best score for each candidate,
prune when possible

11
Query Evaluation By Example
ProfessorSB
top-2 results
CourseIR
ResearchXML
ProfessorSB
CourseIR
ResearchXML
oid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
oid docid score pre post
216 17 0.9 2 15
72 3 0.8 10 8
51 2 0.5 4 12
671 31 0.4 12 23
oid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
Doc2
Doc17
Doc1
Doc5
Further speedup by additional random accesses for
promising documents
Doc3
Pseudo-Doc
12
Ontology-Based Query Expansion
Thesaurus/Ontology concepts, relationships,
glosses from WordNet, Gazetteers, Web forms
tables, Wikipedia
Similarity conditions like professorSB//course
IR
disambiguation
alchemist
Query expansion
primadonna
director
artist
d-exp(x)wsim(x,w)gtd
wizard
investigator
Weighted expanded query Example (professor
lecturer(0.7) scholar(0.6) ...) SB//(course
class(1.0) seminar(0.84) ... ) IR Web
search (0.653) ...
intellectual
professor
researcher
educator
HYPONYM (0.7)
scientist
scholar
lecturer
better recall, but possiblyworse precision (due
totopic drift)
mentor
teacher
academic, academician, faculty member
relationships quantified by statistical
correlation measures
Efficient top-k search with dynamic expansion
13
TopX Incremental Query Expansion
Consider expandable content condition
ProfessorSB with score
max T?0-exp(Professor) sim(Professor,T)s(e,TSB
)
dynamic query expansion with incremental,
on-demandmerging of additional index lists
thesaurus/ontology
ProfessorSB
ResearchXML
professor
57 0.644 0.452 0.433 0.375 0.3
12 0.914 0.828 0.617 0.5561 0.5 44
0.5
lecturer 0.7 scholar 0.6 academic
0.53 scientist 0.5
...
...
...
...
...
much more efficient than threshold-based
expansion no threshold tuning better recall,
no topic drift
14
Probabilistic Pruning

Add c to top-k result if current score(c) gt min
score in top-k
Drop c only if best score(c) min-k, otherwise
keep it

? Often overly conservative (deep scans, large
number of candidates)
score predictor can use Poisson approximations or
histogram convolution
score
?
Drop c?
best score(c)
? Erel. precision_at_k 1??
min-k
scan depth
? Erel. precision_at_k 1??
current score(c)
15
Experimental Results INEX Benchmark
on IEEE-CS journal and conference articles 12223
XML docs with 12 mio. elements, 7.9 GB for all
indexes
20 CO queries, e.g. XML editors or parsers 20
CAS queries, e.g. //article.//biblabout(.//QBI
C) and .//pabout(.//image
retrieval)
join struct TopX TopX
sort index (?0.0) (?0.1) sorted
accesses _at_10 9,122,318 761,970
635,507 426,986 random accesses
_at_10 0 3,245,068 64,807 59,414 relative
recall _at_10 1 1 1 0.8 precision_at_10 0.34
0.34 0.34 0.32 MAP_at_1000 0.17 0.17
0.17 0.17
16
TopX Current and Future Work

Integration of a structure index to speed up
queries with complex path expressions
Scheduling of index-scan steps and few random
accesses
Efficient consideration of correlated dimensions
Integrated support for all kinds of XML
similarity searchcontent ontological sim,
structural sim
Integration of top-k operator into physical
algebra and query optimizer of XML engine
Extension to graph-structured data

17
SphereSearch Unified Retrieval
Goal Increase recall precision for hard
queries on linked and heterogeneous data

Unified search for unstructured, semistructured,
structured data from heterogeneous sources
Graph-based model, including links
Annotation engines from NLP to recognize classes
of named entities (persons, locations, dates, )
Flexible query language with context-aware
scoring
Compactness-based scores

18
Unifying Search on Heterogeneous Data

Databases

19
Extended Data Model
docid1pre1 post3tagPcontentGerhard
Weikum IR SB XML
1

ltPgt Gerhard Weikum ltCgtIRlt/Cgt SB
ltRgtXMLlt/Rgtlt/Pgt

docid1pre3 post2tagR contentXML
docid1pre2 post1tagCcontentIR
3
2
2
20
Queries

query query groups (optional) join conditions
query group relaxed content condition with
keyword and concept-value conditions, possibly
including similarity operators
CourseIR G1(CourseIR) or G1(Course,IR)
ProfessorSB G2(ProfessorSB) or
G2(Professor,LocationSB)
Range conditions Datelt1980, LocationSB
join condition elements with similar
contentG1(gothic church) G2(roman church)
G1.locationG2.location

21
Scores for Query Groups
Q(A,B)
1

Weighted aggregation of scores in environment of
element (sphere score)

Rewards proximity of terms and compactness of
term distribution
22
Scores for Query Results

Join conditions like A.TB.S
Find nodes N1 with type T, N2 with type S
For each pair, add edge (n1,n2) with weight
(conf(n1)conf(n2)sim(n1,n2))-1
sim(n1,n2) content-based similarity
? Influence compactness

query result R one result per query group

compactness 1/size of a minimal spanning tree
23
Experimental Results

On extended Wikipedia corpus with links to IMDB
and extended DBLP corpus (with links to homepages)

Queries like
G(California,governor) M(movie)
A(Madonna,husband) B(director) A.personB.director

24
SphereSearch Current and Future Work

Intuitive graphical user interface
Refined type-specific similarity measures (like
geographic distances)
Deep Web search through automatic portal queries
Parameter tuning with relevance feedback
Efficiency of query evaluation through
precomputation and integrated top-k

25
Integrating TopX and SphereSearch
(G1,..,Gn)
compactness-based top-k operator
top-kresults
distance-basedaggregationtop-k operator
distance-basedaggregationtop-k operator
distance-basedaggregationtop-k operator
top-k
top-k
top-k
26
Thank you!
27
HOPI 2-Hop-Cover based Path Index
for query conditions //course//professor //cours
e//professor
test connectivity and find connections
B tree on element names
...
...
...
course
professor
Lin?, Lout18, 19, 20, 23, 26, 27, 28, 29, 76,
85, 86
17
76
Lin17, 20, 75, Lout77, 78, 79, 80, 28, 29
44
Lin..., Lout...
92
Lin..., Lout...
...
...
17 course
75 homepage
18 title
19 outline
20 lecturer
76 professor
21 chap
23 lit
22 chap
77 name
78 office
79 CV
80 projects
25 cite
o81 bibl
o82 honors
24 cite
26 title
27 publ
28 title
29 publ
83 book
84 paper
85 EU project
86 DFG project
28
Constructing a 2-Hop Cover

Definition (E. Cohen et al., SODA 2002)
a 2-hop cover of a graph G(V,E)
is a labeling (Lin, Lout) of all nodes where
Lin(n) ? m m? n, Lout(n) ? p n? p, and
? (m,n) ?E ? center node x with x?Lout(m) ? x
?Lin(n)

Theorem (Cohen et al.) The size of a 2-hop
cover is ?n?V Lin(n) Lout(n). Finding a
minimal 2-hop cover is NP-complete.
Polynomial Algorithm with O(log V)
Approximation (Cohen et al.) T E
//uncovered connections while T ?? for
each node n construct center graph
C(n) (m,p) (m,n), (n,p) ? E find
node n with densest subgraph S(n) of C(n)?T
for each source s in S(n) do Lout(s)
Lout(s)?n for each target t in S(n) do
Lin(t) Lin(s)?n remove edges of S(n)
from T
29
Making 2-Hop Cover Practically Viable
Small subset of DBLP 6,210 documents, 168,991
elements, 25,368 links, 14 MB uncompressed XML
element-level graph has 168,991 nodes and 188,149
edges its TC has 344,992,370 connections, ca.
2 GB
2-Hop Cover 1,289,930 entries ? compression
factor of 267 ? fast lookups 7.6
entries/node
But computation took 45 hours and 80 GB
RAM! Full DBLP has ca. 600,000 documents the Web
has ?
HOPI (2-Hop Cover Based Index) EDBT 04 for very
large, disk-resident graphs keep center graphs
in priority queue incrementally update center
subgraphs avoid precomputing complete
transitive closure support incremental updates
30
Efficient HOPI Construction EDBT 04

Divide-and-conquer
Partition G by partitioning the XML document
graph
(using heuristics) with
node weights elems in doc edge weights
cross-doc links
Compute 2-hop cover for each partition
Join covers
for each cross-partition edge x ? y with x?P,
y?Q
add x to Lout(a) for all a ? P with a ? x
and
to Lin(b) for all b ? P with y ? b

Implementation stores Lin and Lout sets in
database tables Lin (Id, InId) with indexes
on Id InId and InId Id Lout (Id, OutId) with
indexes on Id OutId and OutId Id Elems (Id,
ElemName) efficiently supports connection queries
for all XPath axes on arbitrary XML data graphs
Performance TC 344,992,370 connections 2-Hop
Cover 15,976,677 entries ? compression factor
of 21.6 ? lookups ok 94.5 entries/node ?
build time feasible 3 hours and 1GB RAM
31
Structurally Recursive Cover Join ICDE 05

Basic Idea
Compute (small) graph from partitioning
Compute its two-hop cover Hin, Hout
Combine this cover with the covers of the
partitions

32
Example (1)
7
8
4
5
2
3
1
6
Build partition-level skeleton graph
33
Example (2)
1
8
7
2
8
7
2
7
2
2
Hin
1
2
2
2,7
2
2
Hout

Enhanced Cover Join Algorithm
For each link source s,add Hout(s) to Lout(a)
for each ancestor a of sin partition of s
For each link target t,add Hin(t) to Lin(t) for
each descendant d of tin partition of t

Join can run independently for all links
34
Example (3)
Hout 2, 7
Hin 2
7
8
4
5
2
3
1
6
35
Experimental Results for Enhanced HOPI

on small DBLP subset
TC 344,992,370 connections
2-Hop Cover 9,999,052 entries
? compression factor 34.5
? lookups ok 59.2 entries/node
? fast build time 23 minutes with 1GB RAM
(with partition size 10,000 connections, edge
weight anc desc)

macroscopic experiments on INEX benchmark (with
TopX search engine, using smart indexing and
probabilistic top-k) 12,000 IEEE-CS
publications with fine-grained XML markup 40
queries such as //article .//bibl.//QBIC)
and .//p.//image retrieval) ? outperforms
best competitor (U Wisconsin) by factor 10 ?
achieves very good precision and recall
36
Ongoing and Future Work

Even better graph partitioning
Efficient incremental updates
Full integration of HOPI
with TopX and SphereSearch engines
Graph-based Scoring Models
Query Processing and Optimization
over XML Indexes
Graph Service for Biological Networks
(interoperate with BN at U Saarland / CBI)

37
TopX Algorithm
based on index table L (Tag, Term, MaxScore,
DocId, Score, ElemId, Pre, Post)
decompose query content conditions (CC) path
conditions (PC) //conditions may be optional
or mandatory for each index list Li (extracted
from L by tag and/or term) do block-scan
next elements from same doc d test
evaluable PCs of all elements of d drop
elements and docs that do not satisfy mandatory
PCs or CCs update score bookkeeping for d
consider random accesses for d by cost-based
scheduler drop d if (prob.) score threshold
is not reached

Write a Comment

User Comments (0)