Title: Text Search over XML Documents
1Text Search over XML Documents
- Jayavel Shanmugasundaram
- Cornell University
2The HTML World
ltbodygt lth1gt XML and Information Retrieval A
SIGIR 2000 Workshop lt/h1gt ltpgt The
workshop was held on 28 July 2000. The editors of
the workshop were David Carmel,
Yoelle Maarek, and Aya Soffer lt/pgt
lth2gt XQL and Proximal Nodes lt/h2gt
ltpgt The paper was authored by Ricardo
Baeza-Yates and Gonzalo
Navarro. The abstract of this paper is given
below. lt/pgt ltpgt We consider
the recently proposed language lt/pgt
ltpgt The paper references the following
papers lta
hrefhttp//www.acm.org/www8/paper/xmlqlgt
lt/agt
lt/pgt
3The XML World
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
4Key Aspect of XML
- Captures text and structure
- Applications
- Digital libraries
- Content management
- Many such XML repositories already available
- Library of Congress documents
- IEEE INEX collection
- SIGMOD, DBLP, Shakespeares plays,
5Searching XML Repositories
- Confluence of Information Retrieval (text) and
Database (structure) techniques - A spectrum of possibilities
Pure KeywordSearch
Full-Text DB Queries
KeywordSearch inContext
6Outline
- Pure Keyword Search
- Keyword Search in Context
- Full-Text DB Queries
- Related Work and Conclusion
7Keyword Search over HTML
Ranked Results
Query Keywords
Hyperlinked HTML Documents
8Keyword Search over XMLGuo, Shao, Botev,
Shanmugasundaram, SIGMOD 2003
XRank
Ranked Results
Query Keywords
Mix of Hyperlinked XMLand HTML Documents
9Outline
- Pure Keyword Search
- Design Principles
- Indexing and Query Processing
- Keyword Search in Context
- Full-Text DB Queries
- Conclusion
10XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
11XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
12Design Principles
- Return most specific element containing the query
keywords
13XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt ltpaper
id2gt
14Design Principles
- Return most specific element containing the query
keywords - Ranking has to be done at the granularity of
elements
15XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
16Design Principles
- Return most specific element containing the query
keywords - Ranking has to be done at the granularity of
elements - Generalize HTML keyword search
17Design Principles
- Return most specific element containing the query
keywords - Ranking has to be done at the granularity of
elements - Generalize HTML keyword search
18Data Model
Containment edge
Hyperlink edge
19ElemRank
- Captures importance of an element
- Analogous to Googles PageRank
- But computed at granularity of elements
- Exploit hyperlink edges and containment edges
- Naturally generalizes Googles PageRank
- Random walk interpretation
20PageRank Brin Page 1998
Hyperlink edge
w
1-d Probability of random jump
21ElemRank
Hyperlink edge
Containment edge
w
1-d1-d2-d3 Probability of random jump
22Outline
- Pure Keyword Search
- Design Principles
- Indexing and Query Processing
- Keyword Search in Context
- Full-Text DB Queries
- Conclusion
23System Architecture
Keyword query
Ranked Results
XML/HTML Documents
Query Evaluator
Data access
XML Elements with ElemRanks
ElemRank Computation
Hybrid Dewey Inverted List
Compute top-k query results as per definition of
ranking
24Naïve Method
- Naïve inverted lists
- Ricardo 1 5 6 8
- XQL 1 5 6 7
1
ltworkshopgt
date
lttitlegt
lteditorsgt
ltproceedingsgt
2
3
4
5
28 July
XML and
David Carmel
ltpapergt
ltpapergt
6
lttitlegt
ltauthorgt
7
8
Problems 1. Space Overhead 2. Spurious Results
XQL and
Ricardo
Main issue Decouples representation of ancestors
and descendants
25Dewey IDs 1850s
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt
0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt
XQL and
Ricardo
26Dewey Inverted List (DIL)
Position List
ElemRank
Dewey Id
5.0.3.0.0
85
32
XQL
Sorted by Dewey Id
91
8.0.3.8.3
38
89
5.0.3.0.1
82
38
Ricardo
Sorted by Dewey Id
8.2.1.4.2
99
52
Store IDs of elements that directly contain
keyword - Avoids space overhead
27DIL Query Processing
- Merge query keyword inverted lists in Dewey ID
Order - Entries with common prefixes are processed
together - Compute Longest Common Prefix of Dewey IDs during
the merge - Longest common prefix ensures most specific
results - Also suppresses spurious results
- Keep top-k results seen so far in output heap
- Output contents of output heap after scanning
inverted lists - Algorithm works in a single scan over inverted
lists
28Ranked Dewey Inverted List (RDIL)
B-tree On Dewey Id
Inverted List
XQL
Sorted by ElemRank
B-tree On Dewey Id
Inverted List
Ricardo
Sorted by ElemRank
29RDIL Query Processing
Output Heap
Temp Heap
P
P
B-tree on Dewey Id
Ricardo
Inverted List
P 9.0.4.2.0
Rank(9.0.4)
threshold ElemRank(P)Max-ElemRank
threshold ElemRank(P)ElemRank(R)
XQL
9.0.4.1.2
R
8.2.1.4.2
9.0.4.1.2
9.0.5.6
10.8.3
9.0.5.6
9.0.4.1.2
B-tree on Dewey Id
9.0.4.2.0
30Motivation for DIL/RDIL Hybrid
- Correlation of query keywords probability that
the query keywords occur in same element - High correlation RDIL likely to outperform DIL
by stopping early - Low correlation DIL likely to outperform RDIL
because RDIL has to scan most (or entire)
inverted list - Dilemma
- DIL and RDIL are likely to outperform each other
- But require inverted lists to be sorted in
different orders - Challenges
- Get benefits of DIL and RDIL without doubling
space? - How can keyword correlation be determined?
31Hybrid Dewey Inverted List (HDIL)
B-tree On Dewey Id
Full Inverted List
XQL
Sorted by Dewey id
Short List
Sorted by ElemRank
- RDIL is better only when it scans little of
inverted list - Short list sorted by ElemRank - saves space!
- Can reuse full inverted list as leaf of B-tree
- Saves space!
32DBLP High Correlation Keywords
33DBLP Low Correlation Keywords
34Outline
- Pure Keyword Search
- Keyword Search in Context
- Full-Text DB Queries
- Related Work and Conclusion
35INEX IEEE
SIGMOD Record
...
Find relevant elements in Shakespeares plays
about the process of speech
- 9 of top 10 results for one repository were not
in the top 10 results of other repository - XIRQLs Fuhr Grobjohann, SIGIR 2001 TF-IDF
scoring
36Explaining the Results
- TF-IDF scoring for a keyword k
- TF (Term Frequency) occurences of k in element
- Usually normalized by some factor
- IDF (Inverse Document Frequency)( elements)/(
elements that contain k) - Score sum of TFIDF for all query keywords
- Main reason for skewed results
- Language of engineers very different from
language of Shakespeare! - process common in INEX, speech uncommon
37INEX IEEE
SIGMOD Record
...
Need a way to efficiently compute IDF (or other
corpus scoring statistic) on-the-fly
38Context-Sensitive RankingBotev
Shanmugasundaram, WebDB 2005
- Use Dewey inverted lists context B-trees
- Two pass algorithm
- First pass collect statistics
- Second pass compute results (entries cached from
first pass)
39Outline
- Pure Keyword Search
- Keyword Search in Context
- Full-Text DB Queries
- Related Work and Conclusion
40Motivation
- Many new applications require sophisticated DB
queries complex full-text search - Example Library of Congress documents in XML
- Current XML query languages are mostly database
languages - Examples XQuery, XPath
- Provide very rudimentary text/IR support
- fncontains(e, keywords)
- No support for complex IR queries
- Distance predicates, stemming, scoring,
41Example Queries
- From XQuery Full-Text Use Cases Document
- Find the titles of the books whose body contains
the phrases Usability and Web site in that
order, in the same paragraph, using stemming if
necessary to match the tokens - Find the titles of the books published after 1999
whose body contains Usability and testing
within a window of 3 words, and return them in
score order
42XQuery Full-TextW3C Working Draft
Quark Full-TextLanguage (Cornell)
2002
IBM, Microsoft,Oracle proposals
TeXQuery(Cornell, ATT)
2003
XQuery Full-Text
2004
XQuery Full-Text (Second Draft)
2005
43XQuery Primer
Find the titles of books
//book/title
Find the titles of books with price lt 25
//book./price lt 25/title
for b in //book./author Dawkins order by
b/price return b
Find books written byDawkins, in order of price
44Syntax OverviewAmer-Yahia, Botev,
Shanmugasundaram, WWW 2004
- Two new XQuery constructs
- FTContainsExpr
- Expresses Boolean full-text search predicates
- Seamlessly composes with other XQuery expressions
- FTScoreClause
- Extension to FOR expression
- Can score FTContainsExpr and other expressions
45FTContainsExpr
- ContextExpr ftcontains FTSelection
- ContextExpr (any XQuery expression) is context
spec - FTSelection is search spec
- Returns true iff at least one node in ContextExpr
satisfies the FTSelection - Examples
- //book ftcontains Usability testing
distance 5 - //book./content ftcontains Usability with
stems/title - //book ftcontains /articleauthorDawkins/title
46FTScore Clause
- FOR v SCORE s? IN Expr
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in
- /pub/book. ftcontains Usability
testing - ORDER BY sRETURN b
47FTScore Clause
- FOR v SCORE s? IN Expr
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in
- /pub/book. ftcontains Usability
testing -
and ./price lt 10.00 - ORDER BY sRETURN b
48Quark
- An open-source C implementation of XQuery
Full-Text - http//www.cs.cornell.edu/database/quark
- Compiles on Linux and Windows
- Key features
- Mix of structured and full-text predicates
- Score all of XQuery!
- Full-text search over views
49Outline
- Pure Keyword Search
- Keyword Search in Context
- Full-Text DB Queries
- Related Work and Conclusion
50Related Work
- Semi-structured ranked keyword search
- XIRQL Fuhr and Grobjohann
- XXL Theobald and Weikum
- Commercial search engines Luk et al.
- INEX initiative
- Keyword search over databases
- BANKS Bhalotia et al.
- DBXplorer Agrawal et al.
- DISCOVER Hristidis et al.
- LORE Goldman et al.
5110000 Foot View of Data Management
Information Retrieval Systems
Ranked Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data