Outline - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Outline

Description:

Ralf Schenkel. joint work with Gerhard Weikum, Jens Graupmann, ... Web forms & tables, Wikipedia. relationships quantified by. statistical correlation measures ... – PowerPoint PPT presentation

Number of Views:307
Avg rating:3.0/5.0
Slides: 38
Provided by: Wei84
Category:
Tags: outline | webforms

less

Transcript and Presenter's Notes

Title: Outline


1
(No Transcript)
2
Outline
  • XML IR
  • Area Overview and Contributions
  • Efficiency TopX
  • Effectiveness SphereSearch

3
XML-IR Example (1)
Professor
Address
...
Name Gerhard Weikum
City SB
Country Germany
Teaching
Research
Course
Project
Title IR
Syllabus
Title Intelligent Search of XML Data
Description Information retrieval ...
...
...
Sponsor German Science Foundation
Book
Article
...
...
//Professorcontains(.,SB) and
contains(.//Course,IR)and contains(.//Research,
XML)
4
XML-IR Example (2)
Professor
Lecturer
Address
...
Address Max-Planck Institute for CS, Germany
Name Gerhard Weikum
City SB
Country Germany
Name Ralf Schenkel
Teaching
Research
Interests Semistructured Data, IR
Teaching
Course
Project
Title IR
  • Challenges
  • Information spread over multiple documents
  • Heterogeneous schemas and content
  • Similarity queries with a vast number of
    potential results

Seminar
Syllabus
Title Intelligent Search of XML Data
Description Information retrieval ...
...
...
Contents Ranked Search ...
Literature
Sponsor German Science Foundation
Article
Book
...
...
Book
...
Title Statistical Language Models
//Professorcontains(.,SB) and
contains(.//Course,IR)and contains(.//Resear
ch,XML)
//Professorcontains(.,SB) and
contains(.//Course,IR)and contains(.//Research,
XML)
5
XML-IR History and Related Work
IR on structured docs (SGML)
Web query languages
1995
W3QS (Technion Haifa)
OED etc. (U Waterloo)
Araneus (U Roma)
HySpirit (U Dortmund)
Lorel (Stanford U)
HyperStorM (GMD Darmstadt)
WebSQL (U Toronto)
WHIRL (CMU)
IR on XML
XML query languages
XIRQL (U Dortmund)
XML-QL (ATT Labs)
XXL TopX (U Saarland / MPI)
2000
XPath 1.0 (W3C)
ApproXQL (U Berlin / U Munich)
ELIXIR (U Dublin)
INEX benchmark NEXI XPath XQuery Full-Text
PowerDB-IR (ETH Zurich)
JuruXML (IBM Haifa )
XPath 2.0 (W3C)
XSearch (Hebrew U)
Timber (U Michigan)
XQuery (W3C)
XRank Quark (Cornell U)
FleXPath (ATT Labs)
TeXQuery (ATT Labs)
Commercial software (MarkLogic, Verity?, IBM?,
Oracle?, Google?, ...)
2005
6
Area Overview and Contributions
Documents about XML
...
...
...
7
TopX Efficient XML IR
Goal Efficiently compute the best results of a
similarity query
  • Query and scoring model for similarity queries
  • Extend top-k query processing algorithms for
    sorted lists Buckley85, Güntzer et al. 00,
    Fagin01to XML queries data, including
    similarity queries
  • Exploit cheap disk space for highly redundant
    indexing

8
TopX Data Model
  • ltPgt Gerhard Weikum ltCgtIRlt/Cgt SB
    ltRgtXMLlt/Rgtlt/Pgt

docid1pre1 post3 tagP contentGerhard
Weikum IR SB XML
1
docid1pre3 post2tagRcontentXML
docid1pre2 post1tagCcontentIR
3
2
  • pure tree model, ignoring links
  • content of descendants replicated, per-element
    term scores (using tf/idf scores or variant of
    Okapi BM25 model)
  • pre/postorder labels reflecting element hierarchy

9
Queries and Query Scores
  • query tree/graph pattern with
  • mandatory/optional content conditions (CC)
  • mandatory path conditions (PC)
  • content-based score for element e with tag A and
    CC Tc
  • candidate connected sub-pattern with element
    ids and scores
  • result candidate with scores for all mandatory
    conditions
  • content-based score of result with elements
    e1,,em for query q with CC T1c1, ..., Tmcm

ProfessorSB
CourseIR
ResearchXML
relevance tf-based or BM25 specificity idf
per tag type compactness subtree size
10
Data Structures
ProfessorSB
CourseIR
ResearchXML
ProfessorSB
CourseIR
ResearchXML
oid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
oid docid score pre post
216 17 0.9 2 15
72 3 0.8 10 8
51 2 0.5 4 12
671 31 0.4 12 23
oid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
  • 1) Build index lists for each tag-term pair,
    grouped by document, sorted by max score in
    document
  • Block-fetch all elements for the same doc
  • Create and/or update candidates, including
    testing PCs in memory
  • Maintain score and best score for each candidate,
    prune when possible

11
Query Evaluation By Example
ProfessorSB
top-2 results
CourseIR
ResearchXML
ProfessorSB
CourseIR
ResearchXML
oid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
oid docid score pre post
216 17 0.9 2 15
72 3 0.8 10 8
51 2 0.5 4 12
671 31 0.4 12 23
oid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
Doc2
Doc17
Doc1
Doc5
Further speedup by additional random accesses for
promising documents
Doc3
Pseudo-Doc
12
Ontology-Based Query Expansion
Thesaurus/Ontology concepts, relationships,
glosses from WordNet, Gazetteers, Web forms
tables, Wikipedia
Similarity conditions like professorSB//course
IR
disambiguation
alchemist
Query expansion
primadonna
director
artist
d-exp(x)wsim(x,w)gtd
wizard
investigator
Weighted expanded query Example (professor
lecturer(0.7) scholar(0.6) ...) SB//(course
class(1.0) seminar(0.84) ... ) IR Web
search (0.653) ...
intellectual
professor
researcher
educator
HYPONYM (0.7)
scientist
scholar
lecturer
better recall, but possiblyworse precision (due
totopic drift)
mentor
teacher
academic, academician, faculty member
relationships quantified by statistical
correlation measures
Efficient top-k search with dynamic expansion
13
TopX Incremental Query Expansion
Consider expandable content condition
ProfessorSB with score
max T?0-exp(Professor) sim(Professor,T)s(e,TSB
)
dynamic query expansion with incremental,
on-demandmerging of additional index lists
thesaurus/ontology
ProfessorSB
ResearchXML
professor
57 0.644 0.452 0.433 0.375 0.3
12 0.914 0.828 0.617 0.5561 0.5 44
0.5
lecturer 0.7 scholar 0.6 academic
0.53 scientist 0.5
...
...
...
...
...
much more efficient than threshold-based
expansion no threshold tuning better recall,
no topic drift
14
Probabilistic Pruning
  • Add c to top-k result if current score(c) gt min
    score in top-k
  • Drop c only if best score(c) min-k, otherwise
    keep it

? Often overly conservative (deep scans, large
number of candidates)
score predictor can use Poisson approximations or
histogram convolution
score
?
Drop c?
best score(c)
? Erel. precision_at_k 1??
min-k
scan depth
? Erel. precision_at_k 1??
current score(c)
15
Experimental Results INEX Benchmark
on IEEE-CS journal and conference articles 12223
XML docs with 12 mio. elements, 7.9 GB for all
indexes
20 CO queries, e.g. XML editors or parsers 20
CAS queries, e.g. //article.//biblabout(.//QBI
C) and .//pabout(.//image
retrieval)
join struct TopX TopX
sort index (?0.0) (?0.1) sorted
accesses _at_10 9,122,318 761,970
635,507 426,986 random accesses
_at_10 0 3,245,068 64,807 59,414 relative
recall _at_10 1 1 1 0.8 precision_at_10 0.34
0.34 0.34 0.32 MAP_at_1000 0.17 0.17
0.17 0.17
16
TopX Current and Future Work
  • Integration of a structure index to speed up
    queries with complex path expressions
  • Scheduling of index-scan steps and few random
    accesses
  • Efficient consideration of correlated dimensions
  • Integrated support for all kinds of XML
    similarity searchcontent ontological sim,
    structural sim
  • Integration of top-k operator into physical
    algebra and query optimizer of XML engine
  • Extension to graph-structured data

17
SphereSearch Unified Retrieval
Goal Increase recall precision for hard
queries on linked and heterogeneous data
  • Unified search for unstructured, semistructured,
    structured data from heterogeneous sources
  • Graph-based model, including links
  • Annotation engines from NLP to recognize classes
    of named entities (persons, locations, dates, )
  • Flexible query language with context-aware
    scoring
  • Compactness-based scores

18
Unifying Search on Heterogeneous Data






Databases






19
Extended Data Model
docid1pre1 post3tagPcontentGerhard
Weikum IR SB XML
1
  • ltPgt Gerhard Weikum ltCgtIRlt/Cgt SB
    ltRgtXMLlt/Rgtlt/Pgt

docid1pre3 post2tagR contentXML
docid1pre2 post1tagCcontentIR
3
2
2
20
Queries
  • query query groups (optional) join conditions
  • query group relaxed content condition with
    keyword and concept-value conditions, possibly
    including similarity operators
  • CourseIR G1(CourseIR) or G1(Course,IR)
  • ProfessorSB G2(ProfessorSB) or
  • G2(Professor,LocationSB)
  • Range conditions Datelt1980, LocationSB
  • join condition elements with similar
    contentG1(gothic church) G2(roman church)
    G1.locationG2.location

21
Scores for Query Groups
Q(A,B)
1
  • Weighted aggregation of scores in environment of
    element (sphere score)

Rewards proximity of terms and compactness of
term distribution
22
Scores for Query Results
  • Join conditions like A.TB.S
  • Find nodes N1 with type T, N2 with type S
  • For each pair, add edge (n1,n2) with weight
    (conf(n1)conf(n2)sim(n1,n2))-1
  • sim(n1,n2) content-based similarity
  • ? Influence compactness
  • query result R one result per query group

compactness 1/size of a minimal spanning tree
23
Experimental Results
  • On extended Wikipedia corpus with links to IMDB
    and extended DBLP corpus (with links to homepages)
  • Queries like
  • G(California,governor) M(movie)
  • A(Madonna,husband) B(director) A.personB.director

24
SphereSearch Current and Future Work
  • Intuitive graphical user interface
  • Refined type-specific similarity measures (like
    geographic distances)
  • Deep Web search through automatic portal queries
  • Parameter tuning with relevance feedback
  • Efficiency of query evaluation through
    precomputation and integrated top-k

25
Integrating TopX and SphereSearch
(G1,..,Gn)
compactness-based top-k operator
top-kresults
distance-basedaggregationtop-k operator
distance-basedaggregationtop-k operator
distance-basedaggregationtop-k operator
top-k
top-k
top-k
26
Thank you!
27
HOPI 2-Hop-Cover based Path Index
for query conditions //course//professor //cours
e//professor
test connectivity and find connections
B tree on element names
...
...
...
course
professor
Lin?, Lout18, 19, 20, 23, 26, 27, 28, 29, 76,
85, 86
17
76
Lin17, 20, 75, Lout77, 78, 79, 80, 28, 29
44
Lin..., Lout...
92
Lin..., Lout...
...
...
17 course
75 homepage
18 title
19 outline
20 lecturer
76 professor
21 chap
23 lit
22 chap
77 name
78 office
79 CV
80 projects
25 cite
o81 bibl
o82 honors
24 cite
26 title
27 publ
28 title
29 publ
83 book
84 paper
85 EU project
86 DFG project
28
Constructing a 2-Hop Cover
  • Definition (E. Cohen et al., SODA 2002)
  • a 2-hop cover of a graph G(V,E)
  • is a labeling (Lin, Lout) of all nodes where
  • Lin(n) ? m m? n, Lout(n) ? p n? p, and
  • ? (m,n) ?E ? center node x with x?Lout(m) ? x
    ?Lin(n)

Theorem (Cohen et al.) The size of a 2-hop
cover is ?n?V Lin(n) Lout(n). Finding a
minimal 2-hop cover is NP-complete.
Polynomial Algorithm with O(log V)
Approximation (Cohen et al.) T E
//uncovered connections while T ?? for
each node n construct center graph
C(n) (m,p) (m,n), (n,p) ? E find
node n with densest subgraph S(n) of C(n)?T
for each source s in S(n) do Lout(s)
Lout(s)?n for each target t in S(n) do
Lin(t) Lin(s)?n remove edges of S(n)
from T
29
Making 2-Hop Cover Practically Viable
Small subset of DBLP 6,210 documents, 168,991
elements, 25,368 links, 14 MB uncompressed XML
element-level graph has 168,991 nodes and 188,149
edges its TC has 344,992,370 connections, ca.
2 GB
2-Hop Cover 1,289,930 entries ? compression
factor of 267 ? fast lookups 7.6
entries/node
But computation took 45 hours and 80 GB
RAM! Full DBLP has ca. 600,000 documents the Web
has ?
HOPI (2-Hop Cover Based Index) EDBT 04 for very
large, disk-resident graphs keep center graphs
in priority queue incrementally update center
subgraphs avoid precomputing complete
transitive closure support incremental updates
30
Efficient HOPI Construction EDBT 04
  • Divide-and-conquer
  • Partition G by partitioning the XML document
    graph
  • (using heuristics) with
  • node weights elems in doc edge weights
    cross-doc links
  • Compute 2-hop cover for each partition
  • Join covers
  • for each cross-partition edge x ? y with x?P,
    y?Q
  • add x to Lout(a) for all a ? P with a ? x
    and
  • to Lin(b) for all b ? P with y ? b

Implementation stores Lin and Lout sets in
database tables Lin (Id, InId) with indexes
on Id InId and InId Id Lout (Id, OutId) with
indexes on Id OutId and OutId Id Elems (Id,
ElemName) efficiently supports connection queries
for all XPath axes on arbitrary XML data graphs
Performance TC 344,992,370 connections 2-Hop
Cover 15,976,677 entries ? compression factor
of 21.6 ? lookups ok 94.5 entries/node ?
build time feasible 3 hours and 1GB RAM
31
Structurally Recursive Cover Join ICDE 05
  • Basic Idea
  • Compute (small) graph from partitioning
  • Compute its two-hop cover Hin, Hout
  • Combine this cover with the covers of the
    partitions

32
Example (1)
7
8
4
5
2
3
1
6
Build partition-level skeleton graph
33
Example (2)
1
8
7
2
8
7
2
7
2
2
Hin
1
2
2
2,7
2
2
Hout
  • Enhanced Cover Join Algorithm
  • For each link source s,add Hout(s) to Lout(a)
    for each ancestor a of sin partition of s
  • For each link target t,add Hin(t) to Lin(t) for
    each descendant d of tin partition of t

Join can run independently for all links
34
Example (3)
Hout 2, 7
Hin 2
7
8
4
5
2
3
1
6
35
Experimental Results for Enhanced HOPI
  • on small DBLP subset
  • TC 344,992,370 connections
  • 2-Hop Cover 9,999,052 entries
  • ? compression factor 34.5
  • ? lookups ok 59.2 entries/node
  • ? fast build time 23 minutes with 1GB RAM
  • (with partition size 10,000 connections, edge
    weight anc desc)

macroscopic experiments on INEX benchmark (with
TopX search engine, using smart indexing and
probabilistic top-k) 12,000 IEEE-CS
publications with fine-grained XML markup 40
queries such as //article .//bibl.//QBIC)
and .//p.//image retrieval) ? outperforms
best competitor (U Wisconsin) by factor 10 ?
achieves very good precision and recall
36
Ongoing and Future Work
  • Even better graph partitioning
  • Efficient incremental updates
  • Full integration of HOPI
  • with TopX and SphereSearch engines
  • Graph-based Scoring Models
  • Query Processing and Optimization
  • over XML Indexes
  • Graph Service for Biological Networks
  • (interoperate with BN at U Saarland / CBI)

37
TopX Algorithm
based on index table L (Tag, Term, MaxScore,
DocId, Score, ElemId, Pre, Post)
decompose query content conditions (CC) path
conditions (PC) //conditions may be optional
or mandatory for each index list Li (extracted
from L by tag and/or term) do block-scan
next elements from same doc d test
evaluable PCs of all elements of d drop
elements and docs that do not satisfy mandatory
PCs or CCs update score bookkeeping for d
consider random accesses for d by cost-based
scheduler drop d if (prob.) score threshold
is not reached
Write a Comment
User Comments (0)
About PowerShow.com