Title: Outline
1(No Transcript)
2Outline
- XML IR
- Area Overview and Contributions
- Efficiency TopX
- Effectiveness SphereSearch
3XML-IR Example (1)
Professor
Address
...
Name Gerhard Weikum
City SB
Country Germany
Teaching
Research
Course
Project
Title IR
Syllabus
Title Intelligent Search of XML Data
Description Information retrieval ...
...
...
Sponsor German Science Foundation
Book
Article
...
...
//Professorcontains(.,SB) and
contains(.//Course,IR)and contains(.//Research,
XML)
4XML-IR Example (2)
Professor
Lecturer
Address
...
Address Max-Planck Institute for CS, Germany
Name Gerhard Weikum
City SB
Country Germany
Name Ralf Schenkel
Teaching
Research
Interests Semistructured Data, IR
Teaching
Course
Project
Title IR
- Challenges
- Information spread over multiple documents
- Heterogeneous schemas and content
- Similarity queries with a vast number of
potential results
Seminar
Syllabus
Title Intelligent Search of XML Data
Description Information retrieval ...
...
...
Contents Ranked Search ...
Literature
Sponsor German Science Foundation
Article
Book
...
...
Book
...
Title Statistical Language Models
//Professorcontains(.,SB) and
contains(.//Course,IR)and contains(.//Resear
ch,XML)
//Professorcontains(.,SB) and
contains(.//Course,IR)and contains(.//Research,
XML)
5XML-IR History and Related Work
IR on structured docs (SGML)
Web query languages
1995
W3QS (Technion Haifa)
OED etc. (U Waterloo)
Araneus (U Roma)
HySpirit (U Dortmund)
Lorel (Stanford U)
HyperStorM (GMD Darmstadt)
WebSQL (U Toronto)
WHIRL (CMU)
IR on XML
XML query languages
XIRQL (U Dortmund)
XML-QL (ATT Labs)
XXL TopX (U Saarland / MPI)
2000
XPath 1.0 (W3C)
ApproXQL (U Berlin / U Munich)
ELIXIR (U Dublin)
INEX benchmark NEXI XPath XQuery Full-Text
PowerDB-IR (ETH Zurich)
JuruXML (IBM Haifa )
XPath 2.0 (W3C)
XSearch (Hebrew U)
Timber (U Michigan)
XQuery (W3C)
XRank Quark (Cornell U)
FleXPath (ATT Labs)
TeXQuery (ATT Labs)
Commercial software (MarkLogic, Verity?, IBM?,
Oracle?, Google?, ...)
2005
6Area Overview and Contributions
Documents about XML
...
...
...
7TopX Efficient XML IR
Goal Efficiently compute the best results of a
similarity query
- Query and scoring model for similarity queries
- Extend top-k query processing algorithms for
sorted lists Buckley85, Güntzer et al. 00,
Fagin01to XML queries data, including
similarity queries - Exploit cheap disk space for highly redundant
indexing
8TopX Data Model
- ltPgt Gerhard Weikum ltCgtIRlt/Cgt SB
ltRgtXMLlt/Rgtlt/Pgt
docid1pre1 post3 tagP contentGerhard
Weikum IR SB XML
1
docid1pre3 post2tagRcontentXML
docid1pre2 post1tagCcontentIR
3
2
- pure tree model, ignoring links
- content of descendants replicated, per-element
term scores (using tf/idf scores or variant of
Okapi BM25 model) - pre/postorder labels reflecting element hierarchy
9Queries and Query Scores
- query tree/graph pattern with
- mandatory/optional content conditions (CC)
- mandatory path conditions (PC)
- content-based score for element e with tag A and
CC Tc - candidate connected sub-pattern with element
ids and scores - result candidate with scores for all mandatory
conditions - content-based score of result with elements
e1,,em for query q with CC T1c1, ..., Tmcm
ProfessorSB
CourseIR
ResearchXML
relevance tf-based or BM25 specificity idf
per tag type compactness subtree size
10Data Structures
ProfessorSB
CourseIR
ResearchXML
ProfessorSB
CourseIR
ResearchXML
oid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
oid docid score pre post
216 17 0.9 2 15
72 3 0.8 10 8
51 2 0.5 4 12
671 31 0.4 12 23
oid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
- 1) Build index lists for each tag-term pair,
grouped by document, sorted by max score in
document - Block-fetch all elements for the same doc
- Create and/or update candidates, including
testing PCs in memory - Maintain score and best score for each candidate,
prune when possible
11Query Evaluation By Example
ProfessorSB
top-2 results
CourseIR
ResearchXML
ProfessorSB
CourseIR
ResearchXML
oid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
oid docid score pre post
216 17 0.9 2 15
72 3 0.8 10 8
51 2 0.5 4 12
671 31 0.4 12 23
oid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
Doc2
Doc17
Doc1
Doc5
Further speedup by additional random accesses for
promising documents
Doc3
Pseudo-Doc
12Ontology-Based Query Expansion
Thesaurus/Ontology concepts, relationships,
glosses from WordNet, Gazetteers, Web forms
tables, Wikipedia
Similarity conditions like professorSB//course
IR
disambiguation
alchemist
Query expansion
primadonna
director
artist
d-exp(x)wsim(x,w)gtd
wizard
investigator
Weighted expanded query Example (professor
lecturer(0.7) scholar(0.6) ...) SB//(course
class(1.0) seminar(0.84) ... ) IR Web
search (0.653) ...
intellectual
professor
researcher
educator
HYPONYM (0.7)
scientist
scholar
lecturer
better recall, but possiblyworse precision (due
totopic drift)
mentor
teacher
academic, academician, faculty member
relationships quantified by statistical
correlation measures
Efficient top-k search with dynamic expansion
13TopX Incremental Query Expansion
Consider expandable content condition
ProfessorSB with score
max T?0-exp(Professor) sim(Professor,T)s(e,TSB
)
dynamic query expansion with incremental,
on-demandmerging of additional index lists
thesaurus/ontology
ProfessorSB
ResearchXML
professor
57 0.644 0.452 0.433 0.375 0.3
12 0.914 0.828 0.617 0.5561 0.5 44
0.5
lecturer 0.7 scholar 0.6 academic
0.53 scientist 0.5
...
...
...
...
...
much more efficient than threshold-based
expansion no threshold tuning better recall,
no topic drift
14Probabilistic Pruning
- Add c to top-k result if current score(c) gt min
score in top-k - Drop c only if best score(c) min-k, otherwise
keep it
? Often overly conservative (deep scans, large
number of candidates)
score predictor can use Poisson approximations or
histogram convolution
score
?
Drop c?
best score(c)
? Erel. precision_at_k 1??
min-k
scan depth
? Erel. precision_at_k 1??
current score(c)
15Experimental Results INEX Benchmark
on IEEE-CS journal and conference articles 12223
XML docs with 12 mio. elements, 7.9 GB for all
indexes
20 CO queries, e.g. XML editors or parsers 20
CAS queries, e.g. //article.//biblabout(.//QBI
C) and .//pabout(.//image
retrieval)
join struct TopX TopX
sort index (?0.0) (?0.1) sorted
accesses _at_10 9,122,318 761,970
635,507 426,986 random accesses
_at_10 0 3,245,068 64,807 59,414 relative
recall _at_10 1 1 1 0.8 precision_at_10 0.34
0.34 0.34 0.32 MAP_at_1000 0.17 0.17
0.17 0.17
16TopX Current and Future Work
- Integration of a structure index to speed up
queries with complex path expressions - Scheduling of index-scan steps and few random
accesses - Efficient consideration of correlated dimensions
- Integrated support for all kinds of XML
similarity searchcontent ontological sim,
structural sim - Integration of top-k operator into physical
algebra and query optimizer of XML engine - Extension to graph-structured data
17SphereSearch Unified Retrieval
Goal Increase recall precision for hard
queries on linked and heterogeneous data
- Unified search for unstructured, semistructured,
structured data from heterogeneous sources - Graph-based model, including links
- Annotation engines from NLP to recognize classes
of named entities (persons, locations, dates, ) - Flexible query language with context-aware
scoring - Compactness-based scores
18Unifying Search on Heterogeneous Data
Databases
19Extended Data Model
docid1pre1 post3tagPcontentGerhard
Weikum IR SB XML
1
- ltPgt Gerhard Weikum ltCgtIRlt/Cgt SB
ltRgtXMLlt/Rgtlt/Pgt
docid1pre3 post2tagR contentXML
docid1pre2 post1tagCcontentIR
3
2
2
20Queries
- query query groups (optional) join conditions
- query group relaxed content condition with
keyword and concept-value conditions, possibly
including similarity operators - CourseIR G1(CourseIR) or G1(Course,IR)
- ProfessorSB G2(ProfessorSB) or
- G2(Professor,LocationSB)
- Range conditions Datelt1980, LocationSB
- join condition elements with similar
contentG1(gothic church) G2(roman church)
G1.locationG2.location
21Scores for Query Groups
Q(A,B)
1
- Weighted aggregation of scores in environment of
element (sphere score)
Rewards proximity of terms and compactness of
term distribution
22Scores for Query Results
- Join conditions like A.TB.S
- Find nodes N1 with type T, N2 with type S
- For each pair, add edge (n1,n2) with weight
(conf(n1)conf(n2)sim(n1,n2))-1 - sim(n1,n2) content-based similarity
- ? Influence compactness
- query result R one result per query group
compactness 1/size of a minimal spanning tree
23Experimental Results
- On extended Wikipedia corpus with links to IMDB
and extended DBLP corpus (with links to homepages)
- Queries like
- G(California,governor) M(movie)
- A(Madonna,husband) B(director) A.personB.director
24SphereSearch Current and Future Work
- Intuitive graphical user interface
- Refined type-specific similarity measures (like
geographic distances) - Deep Web search through automatic portal queries
- Parameter tuning with relevance feedback
- Efficiency of query evaluation through
precomputation and integrated top-k
25Integrating TopX and SphereSearch
(G1,..,Gn)
compactness-based top-k operator
top-kresults
distance-basedaggregationtop-k operator
distance-basedaggregationtop-k operator
distance-basedaggregationtop-k operator
top-k
top-k
top-k
26Thank you!
27HOPI 2-Hop-Cover based Path Index
for query conditions //course//professor //cours
e//professor
test connectivity and find connections
B tree on element names
...
...
...
course
professor
Lin?, Lout18, 19, 20, 23, 26, 27, 28, 29, 76,
85, 86
17
76
Lin17, 20, 75, Lout77, 78, 79, 80, 28, 29
44
Lin..., Lout...
92
Lin..., Lout...
...
...
17 course
75 homepage
18 title
19 outline
20 lecturer
76 professor
21 chap
23 lit
22 chap
77 name
78 office
79 CV
80 projects
25 cite
o81 bibl
o82 honors
24 cite
26 title
27 publ
28 title
29 publ
83 book
84 paper
85 EU project
86 DFG project
28Constructing a 2-Hop Cover
- Definition (E. Cohen et al., SODA 2002)
- a 2-hop cover of a graph G(V,E)
- is a labeling (Lin, Lout) of all nodes where
- Lin(n) ? m m? n, Lout(n) ? p n? p, and
- ? (m,n) ?E ? center node x with x?Lout(m) ? x
?Lin(n)
Theorem (Cohen et al.) The size of a 2-hop
cover is ?n?V Lin(n) Lout(n). Finding a
minimal 2-hop cover is NP-complete.
Polynomial Algorithm with O(log V)
Approximation (Cohen et al.) T E
//uncovered connections while T ?? for
each node n construct center graph
C(n) (m,p) (m,n), (n,p) ? E find
node n with densest subgraph S(n) of C(n)?T
for each source s in S(n) do Lout(s)
Lout(s)?n for each target t in S(n) do
Lin(t) Lin(s)?n remove edges of S(n)
from T
29Making 2-Hop Cover Practically Viable
Small subset of DBLP 6,210 documents, 168,991
elements, 25,368 links, 14 MB uncompressed XML
element-level graph has 168,991 nodes and 188,149
edges its TC has 344,992,370 connections, ca.
2 GB
2-Hop Cover 1,289,930 entries ? compression
factor of 267 ? fast lookups 7.6
entries/node
But computation took 45 hours and 80 GB
RAM! Full DBLP has ca. 600,000 documents the Web
has ?
HOPI (2-Hop Cover Based Index) EDBT 04 for very
large, disk-resident graphs keep center graphs
in priority queue incrementally update center
subgraphs avoid precomputing complete
transitive closure support incremental updates
30Efficient HOPI Construction EDBT 04
- Divide-and-conquer
- Partition G by partitioning the XML document
graph - (using heuristics) with
- node weights elems in doc edge weights
cross-doc links - Compute 2-hop cover for each partition
- Join covers
- for each cross-partition edge x ? y with x?P,
y?Q - add x to Lout(a) for all a ? P with a ? x
and - to Lin(b) for all b ? P with y ? b
-
Implementation stores Lin and Lout sets in
database tables Lin (Id, InId) with indexes
on Id InId and InId Id Lout (Id, OutId) with
indexes on Id OutId and OutId Id Elems (Id,
ElemName) efficiently supports connection queries
for all XPath axes on arbitrary XML data graphs
Performance TC 344,992,370 connections 2-Hop
Cover 15,976,677 entries ? compression factor
of 21.6 ? lookups ok 94.5 entries/node ?
build time feasible 3 hours and 1GB RAM
31Structurally Recursive Cover Join ICDE 05
- Basic Idea
- Compute (small) graph from partitioning
- Compute its two-hop cover Hin, Hout
- Combine this cover with the covers of the
partitions
32Example (1)
7
8
4
5
2
3
1
6
Build partition-level skeleton graph
33Example (2)
1
8
7
2
8
7
2
7
2
2
Hin
1
2
2
2,7
2
2
Hout
- Enhanced Cover Join Algorithm
- For each link source s,add Hout(s) to Lout(a)
for each ancestor a of sin partition of s - For each link target t,add Hin(t) to Lin(t) for
each descendant d of tin partition of t
Join can run independently for all links
34Example (3)
Hout 2, 7
Hin 2
7
8
4
5
2
3
1
6
35Experimental Results for Enhanced HOPI
- on small DBLP subset
- TC 344,992,370 connections
- 2-Hop Cover 9,999,052 entries
- ? compression factor 34.5
- ? lookups ok 59.2 entries/node
- ? fast build time 23 minutes with 1GB RAM
- (with partition size 10,000 connections, edge
weight anc desc)
macroscopic experiments on INEX benchmark (with
TopX search engine, using smart indexing and
probabilistic top-k) 12,000 IEEE-CS
publications with fine-grained XML markup 40
queries such as //article .//bibl.//QBIC)
and .//p.//image retrieval) ? outperforms
best competitor (U Wisconsin) by factor 10 ?
achieves very good precision and recall
36Ongoing and Future Work
- Even better graph partitioning
- Efficient incremental updates
- Full integration of HOPI
- with TopX and SphereSearch engines
- Graph-based Scoring Models
- Query Processing and Optimization
- over XML Indexes
- Graph Service for Biological Networks
- (interoperate with BN at U Saarland / CBI)
37TopX Algorithm
based on index table L (Tag, Term, MaxScore,
DocId, Score, ElemId, Pre, Post)
decompose query content conditions (CC) path
conditions (PC) //conditions may be optional
or mandatory for each index list Li (extracted
from L by tag and/or term) do block-scan
next elements from same doc d test
evaluable PCs of all elements of d drop
elements and docs that do not satisfy mandatory
PCs or CCs update score bookkeeping for d
consider random accesses for d by cost-based
scheduler drop d if (prob.) score threshold
is not reached