Title: XML FullText Search: Challenges and Opportunities
1XML Full-Text Search Challenges and
Opportunities
Sihem Amer-Yahia ATT Labs Research
- Jayavel Shanmugasundaram
- Cornell University
2Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Query Processing
- Open Issues
3Motivation
- XML is able to represent a mix of structured and
text information. - XML applications digital libraries, content
management. - XML repositories IEEE INEX collection,
LexisNexis, the Library of Congress collection.
4XML in Library of Congresshttp//thomas.loc.gov/h
ome/gpoxmlc109/h2739_ih.xml
- ltbill bill-stage"Introduced-in-House"gt
- ltcongressgt109th CONGRESSlt/congressgt
ltsessiongt1st Sessionlt/sessiongt - ltlegis-numgtH. R. 2739lt/legis-numgt
- ltcurrent-chambergtIN THE HOUSE OF
REPRESENTATIVESlt/current-chambergt - ltactiongt
- ltaction-date date"20050526"gtMay 26,
2005lt/action-dategt - ltaction-descgtltsponsor name-id"T000266"gtMr.
Tierneylt/sponsorgt (for himself, ltcosponsor
name-id"M001143"gtMs. McCollum of
Minnesotalt/cosponsorgt, ltcosponsor
name-id"M000725"gtMr. George Miller of
Californialt/cosponsorgt) introduced the following
bill which was referred to the ltcommittee-name
committee-id"HED00"gtCommittee on Education and
the Workforcelt/committee-namegt - lt/action-descgt
- lt/actiongt
-
5THOMAS Library of Congress
6INEX Data
- ltarticlegt ltfnogtK0271lt/fnogt
ltdoigt10.1041/K0271s-2004lt/doigt - ltfmgt lthdrgtlthdr1gtlttigtIEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERINGlt/tigt ltcrtgt - ltissngt1041-4347lt/issngt/04/20.00 copy 2004
IEEE Published by the IEEE Computer
Societylt/crtgtlt/hdr1gtlthdr2gtltobigtltvolnogtVol.
16lt/volnogt, ltissnogtNo. 2lt/issnogtlt/obigt
ltpdtgtltmogtFEBRUARYlt/mogtltyrgt2004lt/yrgtlt/pdtgt - ltppgtpp. 271-288lt/ppgtlt/hdr2gt lt/hdrgt
lttiggtltatlgtA Graph-Based Approach for Timing
Analysis and Refinement of OPS5 Knowledge-Based
Systemslt/atlgtltpngtpp. 271-288lt/pngtltref
rid"K02711aff" type"aff"gtlt/refgtlt/tiggt - ltau sequence"first"gtltfnmgtAlbert Mo
Kimlt/fnmgtltsnmgt ltref aid"K0271a1
type"prb"gtChenglt/refgtlt/snmgtltrolegtSenior
Memberlt/rolegtltaffgtltonmgtIEEElt/onmgtlt/affgtlt/augtltau
sequence"additional"gtltfnmgtHsiu-yenlt/fnmgtltsnmgt
Tsailt/snmgtlt/augt - ltabsgtltpgtltbgtAbstractlt/bgtmdashThis paper
examines the problem of predicting the timing
behavior of knowledge-based systems for real-
7Example INEX Query
- ltinex_topic topic_id"275" query_type"CAS"gt
- ltcastitlegt//articleabout(.//abs, "data
mining")//secabout(., "frequent
itemsets")lt/castitlegt - ltdescriptiongtsections about frequent
itemsets from articles with abstract about data
mininglt/descriptiongt - ltnarrativegtTo be relevant, a component
has to be a section about "frequent itemsets".
For example, it could be about algorithms for
finding frequent itemsets, or uses of frequent
itemsets to generate rules. Also, the article
must have an abstract about "data mining". I need
this information for a paper that I am writing.
It is a survey of different algorithms for
finding frequent itemsets. The paper will also
have a section on why we would want to find
frequent itemsets.lt/narrativegt - lt/inex_topicgt
8Challenges in XML FT Search
- Searching over Semi-Structured Data
- Users may specify a search context and return
context. - Expressive Power and Extensibility
- Users should be able to express complex full-text
searches and combine them with structural
searches. - Scores and Ranking
- Users may specify a scoring condition, possibly
over both full-text and structured predicates and
obtain top-k results based on query relevance
scores. - The language should allow for an efficient
implementation.
9XML FT Search Definition
- Context expression XML elements searched
- pre-defined XML nodes.
- XPath/XQuery queries.
- Return expression XML fragments returned
- pre-defined meaningful XML fragments.
- XPath/XQuery to build answers.
- Search expression FT search conditions
- Boolean keyword search.
- proximity distance, scoping, thesaurus, stop
words, stemming. - Score expression
- system-defined scoring function.
- user-defined scoring function.
- query-dependent keyword weights.
10Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Query Processing
- Open Issues
11Four Classes of Languages
- Keyword search (INEX Content-Only Queries)
- book xml
- Tag Keyword search
- book xml
- Path Expression Keyword search
- /book./title about xml db
- XQuery Complex full-text search
- for b in /booklet score s b ftcontains
xml db
distance 5
12Outline
- Motivation
- Full-Text Search Languages
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Scoring
- Query Processing
- Open Issues
13XRank Guo et al., SIGMOD 2003
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
14XRank Guo et al., SIGMOD 2003
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
15XIRQL Fuhr Grobjohann, SIGIR 2001
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML ltemgt
The XQL language lt/emgt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
Index Node
16Similar Notion of Results
- Nearest Concept Queries
- Schmidt et al., ICDE 2002
- XKSearch
- Xu Papakonstantinou, SIGMOD 2005
17Outline
- Motivation
- Full-Text Search Languages
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Scoring
- Query Processing
- Open Issues
18XSearch Cohen et al., VLDB 2003
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
lt/papergt ltpaper id2gt
lttitlegt XML Indexing lt/titlegt
ltpaper id2gt
Not a meaningful result
19Outline
- Motivation
- Full-Text Search Languages
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Scoring
- Query Processing
- Open Issues
20XPath W3C 2005
- fncontains(e, string) returns true iff e
contains string
//sectionfncontains(./title, XML Indexing)
21XIRQL Fuhr Grobjohann, SIGIR 2001
- Weighted extension to XQL (precursor to XPath)
//section0.6 .// cw XQL
0.4 .//section cw syntax
22XXL Theobald Weikum, EDBT 2002
- Introduces similarity operator
Select Z From http//www.myzoos.edu/zoos.html W
here zoos..zoo As Z and
Z.animals.(animal)?.specimen as A and
A.species lion and
A.birthplace..country as B and
A.region B.content
23NEXI Trotman Sigurbjornsson, INEX 2004
- Narrowed Extended XPath I
- INEX Content-and-Structure (CAS) Queries
//articleabout(.//title, apple) and
about(.//sec, computer)
24Outline
- Motivation
- Full-Text Search Languages
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Scoring
- Query Processing
- Open Issues
25Schema-Free XQuery Li, Yu, Jagadish, VLDB 2003
- Meaningful least common ancestor (mlcas)
for a in doc(bib.xml)//author b in
doc(bib.xml)//title c in
doc(bib.xml)//year where a/text() Mary
and exists mlcas(a,b,c) return
ltresultgt b,c lt/resultgt
26XQuery Full-Text W3C 2005
- Two new XQuery constructs
- FTContainsExpr
- Expresses Boolean full-text search predicates
- Seamlessly composes with other XQuery expressions
- FTScoreClause
- Extension to FLWOR expression
- Can score FTContainsExpr and other expressions
27FTContainsExpr
- //book ftcontains Usability testing
distance 5 - //book./content ftcontains Usability with
stems/title - //book ftcontains /articleauthorDawkins/title
28FTScore Clause
In any order
- FOR v SCORE s? IN FUZZY Expr
- LET
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in
- /pub/book. ftcontains Usability
testing - ORDER BY sRETURN ltresult scoresgt b
lt/resultgt
29FTScore Clause
In any order
- FOR v SCORE s? IN FUZZY Expr
- LET
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in
- /pub/book. ftcontains Usability
testing -
and ./price lt 10.00 - ORDER BY sRETURN b
30FTScore Clause
In any order
- FOR v SCORE s? IN FUZZY Expr
- LET
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in FUZZY
- /pub/book. ftcontains Usability
testing - ORDER BY sRETURN b
31XQuery Full-Text Evolution
Quark Full-TextLanguage (Cornell)
2002
IBM, Microsoft,Oracle proposals
TeXQuery(Cornell, ATT Labs)
2003
XQuery Full-Text
2004
XQuery Full-Text (Second Draft)
2005
32Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Query Processing
- Open Issues
33Full-Text Scoring
- Score value should reflect relevance of answer to
user query. Higher scores imply a higher degree
of relevance. - Queries return document fragments. Granularity of
returned results affects scoring. - For queries containing conditions on structure,
structural conditions may affect scoring. - Existing proposals extend common scoring methods
probabilistic or vector-based similarity.
34Granularity of Results
- Keyword queries
- compute possibly different scores for LCAs.
- Tag Keyword queries
- compute scores based on tags and keywords.
- Path Expression Keyword queries
- compute scores based on paths and keywords.
- XQuery Complex full-text queries
- compute scores for (newly constructed) XML
fragments satisfying XQuery (structural,
full-text and scalar conditions).
35Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Query Processing
- Open Issues
36Granularity of Results
- Document as hierarchical structure of elements as
opposed to flat document. - XXL Theobald Weikum, EDBT 2002
- XIRQL Fuhr Grobjohann, SIGIR 2001
- XRANK Guo et al., SIGMOD 2003
- Propagate keyword weights along document
structure.
37XML Data Model
Containment edge
Hyperlink edge
38XXLTheobald Weikum, EDBT 2002
- Compute similar terms with relevance score r1
using an ontology. - Compute tfidf of each term for a given element
content with relevance score r2. - Relevance of an element content for a term is
r1r2. - r1 and r2 are computed as a weighted distance in
an ontology graph. - Probabilities of conjunctions multiplied
(independence assumption) along elements of same
path to compute path score.
39Probabilistic ScoringXIRQL Fuhr Grobjohann,
SIGIR 2001
- Extension of XPath.
- Weighting and ranking
- weighting of query terms
- P(wsum((0.6,a), (0.4,b)) 0.6 P(a)0.4 P(b)
- probabilistic interpretation of Boolean
connectors - P(a b) P(a) P(b)
40XIRQL Example
- Query
- Search for an artist named Ulbrich, living in
Frankfurt, Germany about 100 years ago - Data
- Ernst Olbrich, Darmstadt, 1899
- Weights and ranking
- P(Olbrich p Ulbrich)0.8 (phonetic similarity)
- P(1899 n 1903)0.9 (numeric similarity)
- P(Darmstadt g Frankfurt)0.7 (geographic distance)
41PageRank Brin Page 1998
Hyperlink edge
w
1-d Probability of random jump
42ElemRank Guo et al. SIGMOD 2003
Hyperlink edge
Containment edge
w
1-d1-d2-d3 Probability of random jump
43Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Query Processing
- Open Issues
44XSearchCohen et al., VLDB 2003
- tfilf to compute weight of keyword for a leaf
element. - A vector is associated with each non-leaf
element. - sim(Q,N) sum of the cosine distances between the
vectors associated with nodes in N and vectors
associated with terms matched in Q.
45Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Query Processing
- Open Issues
46Vectorbased ScoringJuruXML Mass et al INEX
2002
- Transform query into (term,path) conditions
- article/bm/bib/bibl/bbabout(., hypercube
mesh torus nonnumerical database) - (term,path)-pairs
- hypercube, article/bm/bib/bibl/bb
- mesh, article/bm/bib/bibl/bb
- torus, article/bm/bib/bibl/bb
- nonnumerical, article/bm/bib/bibl/bb
- database, article/bm/bib/bibl/bb
- Modified cosine similarity as retrieval function
for vague matching of path conditions.
47JuruXML Vague Path Matching
- Modified vector-based cosine similarity
Example of length normalization cr
(article/bibl, article/bm/bib/bibl/bb) 3/6 0.5
48Query Relaxation on Structure
- Schlieder, EDBT 2002
- Delobel Rousset, 2002
- Amer-Yahia et al, VLDB 2005
49XML Query RelaxationAmer-Yahia et al EDBT
2002FlexPath Amer-Yahia et al SIGMOD 2004
Query
book
- Tree pattern relaxations
- Leaf node deletion
- Edge generalization
- Subtree promotion
edition paperback
info
author Dickens
book
book
book
Data
edition?
info
author Dickens
info
edition (paperback)
info
edition paperback
author Charles Dickens
author C. Dickens
50Adaptation of tf.idf to XML WhirlpoolMarian et
al ICDE 2005
51A Family of XML Scoring Methods Amer-Yahia et al
VLDB 2005
book
Query
- Twig scoring
- High quality
- Expensive computation
- Path scoring
- Binary scoring
- Low quality
- Fast computation
edition (paperback)
info
author (Dickens)
52Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Query Processing
- Open Issues
53XIRQL Relaxation
- XIRQL proposes vague predicates but it is not
clear how to combine it with all of XQuery. - Open issue as how to relax all of XQuery
including structured and scalar predicates.
54Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Query Processing
- Open Issues
55Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Query Processing
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Open Issues
56Main Issue
- Given Query keywords
- Compute Least Common Ancestors (LCAs) that
contain query keywords, in ranked order
57Naïve Method
- Naïve inverted lists
- Ricardo 1 5 6 8
- XQL 1 5 6 7
1
ltworkshopgt
date
lttitlegt
lteditorsgt
ltproceedingsgt
2
3
4
5
28 July
XML and
David Carmel
ltpapergt
ltpapergt
6
lttitlegt
ltauthorgt
7
8
Problems 1. Space Overhead 2. Spurious Results
XQL and
Ricardo
Main issue Decouples representation of ancestors
and descendants
58Dewey Encoding of IDs 1850s
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt
0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt
XQL and
Ricardo
59XRank Dewey Inverted List (DIL)
Position List
Dewey Id
Score
5.0.3.0.0
85
32
XQL
Sorted by Dewey Id
91
8.0.3.8.3
38
89
5.0.3.0.1
82
38
Ricardo
Sorted by Dewey Id
8.2.1.4.2
99
52
Store IDs of elements that directly contain
keyword - Avoids space overhead
60DIL Query Processing
- Merge query keyword inverted lists in Dewey ID
Order - Entries with common prefixes are processed
together - Compute Longest Common Prefix of Dewey IDs during
the merge - Longest common prefix ensures most specific
results - Also suppresses spurious results
- Keep top-m results seen so far in output heap
- Calculate rank using two-dimensional proximity
metric - Output contents of output heap after scanning
inverted lists - Algorithm works in a single scan over inverted
lists
61XRank Ranked Dewey Inverted List (RDIL)
B-tree On Dewey Id
Inverted List
XQL
Sorted by Score
(other keywords)
62RDIL Algorithm
- An element may be ranked highly in one list and
low in another list - B-tree helps search for low ranked element
- When to stop scanning inverted lists?
- Based on Threshold Algorithm Fagin et al.,
2002, which periodically calculates a threshold - Can stop if we have sufficient results above the
threshold - Extension to most specific results
63RDIL Query Processing
Output Heap
Temp Heap
P
P
B-tree on Dewey Id
Ricardo
Inverted List
P 9.0.4.2.0
Rank(9.0.4)
threshold Score(P)Score(R)
threshold Score(P)Max-Score
XQL
9.0.4.1.2
R
8.2.1.4.2
9.0.4.1.2
9.0.5.6
10.8.3
9.0.5.6
9.0.4.1.2
B-tree on Dewey Id
9.0.4.2.0
64ID Order vs. Rank Order
- Approaches that combine benefits
- Long ID inverted list, short score inverted list
- HDIL (Guo et al., SIGMOD 2003)
- Chunk inverted list based on score, organize by
ID within chunk - FlexPath (Amer-Yahia et al., SIGMOD 2004)
- SVR (Guo et al., ICDE 2005)
65Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Query Processing
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Open Issues
66XSearch Technique
- Given An interconnection relationship R between
nodes (semantic relationship) - R is reflexive and symmetric
- Node interconnection index
- Given two nodes n and n in a document d, find if
(n,n) are in R - Use dynamic programming to compute closure
- Online vs. offline
67Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Query Processing
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Open Issues
68XXL Indexing
- Element Path Index (EPI)
- Evaluates simple path expressions
- Element Content Index (ECI)
- Traditional inverted list (but replicates nested
elements) - Ontology Index (OI)
- Lookup similar concepts (for evaluating e)
- Returned in ranked order
69Myaeng et al. SIGIR 1994
Document ID
Element Tag
Element ID
Element Tag
Element Tag
Probability
Probability
Probability
5
85
act
XQL
0.3
play
0.2
plays
0.1
70Integrating Structure and ILKaushik et al.,
SIGMOD 2004
Document ID
1
book
Start ID
Index ID
End ID
Depth
Score
2
edition
info
3
5
85
99
XQL
3
5
0.9
author
title
4
5
B Tree
71Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Query Processing
- Simple Keyword Search
- Tags Keyword Search
- Path Expressions Keyword Search
- XQuery Complex Full-Text Search
- Open Issues
72Scoring Functions Critical for Top-k Query
Processing
- Top-k answer quality depends on scoring function.
- Efficient top-k query processing requires scoring
function to be - Monotone.
- Fast to compute.
73Structural Join Relaxation
//book./info./author ftcontains Dickens
./edition ftcontains
paperback
contains(edition,paperback)
paperback
contains(author,Dickens)
Dickens
pc(info,edition) or ad(book,edition)
edition
pc(info,author)
author
pc(book,info) or ad(book,info)
book
info
info
book
74Quark/GalaTex Architecture
4
ltxmlgt ltdocgt Text Text Text Text lt/docgt lt/xml
Preprocessing Inverted Lists Generation
Full-Text Primitives (FTWord, FTWindow,
FTTimes etc.)
API on positions
.xml
ltdocgt Text Text Text Text lt/docgt
Quark/Galax XQuery Engine
evaluation
.xml
Full-Text Query
Equivalent XQuery Query
XQFT Parser
75Outline
- Motivation
- Full-Text Search Languages
- Scoring
- Query Processing
- Open Issues
76System Architecture
Integration Layer
XQuery Engine
IR Engine
77System Architecture
XQuery IR Engine
Quark/GalaTex use this architecture
78Structural Relaxation
- FOR b SCORE s in FUZZY
- /pub/book.
ftcontains Usability with stems - ORDER BY s
- RETURN b
79Search Over Views
Data Source 1
Data Source 2
ltbooksgt ltbookgt lt/bookgt ltbookgt
lt/bookgt lt/booksgt
ltreviewsgt ltreviewgt lt/reviewgt ltreviewgt
lt/reviewgt lt/reviewsgt
ltbookgt ltreviewsgt lt/reviewsgtlt/bookgt
Integrated View
80Other Open Issues
- Experimental evaluation of scoring functions and
ranking algorithms for XML (INEX). - Search over a mix of HTML and XML.
- Joint scoring on full-text and scalar predicates.
- Score-aware algebra for XML for the joint
optimization of queries on both structure and
text.