Text Search over XML Documents - PowerPoint PPT Presentation

About This Presentation

Title:

Text Search over XML Documents

Description:

Compute Longest Common Prefix of Dewey IDs during the merge ... Short list sorted by ElemRank - saves space! Can reuse full inverted list as leaf of B -tree ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 52

Provided by: jayavelsha

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Text Search over XML Documents

1
Text Search over XML Documents

Jayavel Shanmugasundaram
Cornell University

2
The HTML World
ltbodygt lth1gt XML and Information Retrieval A
SIGIR 2000 Workshop lt/h1gt ltpgt The
workshop was held on 28 July 2000. The editors of
the workshop were David Carmel,
Yoelle Maarek, and Aya Soffer lt/pgt
lth2gt XQL and Proximal Nodes lt/h2gt
ltpgt The paper was authored by Ricardo
Baeza-Yates and Gonzalo
Navarro. The abstract of this paper is given
below. lt/pgt ltpgt We consider
the recently proposed language lt/pgt
ltpgt The paper references the following
papers lta
hrefhttp//www.acm.org/www8/paper/xmlqlgt
lt/agt
lt/pgt
3
The XML World
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
4
Key Aspect of XML

Captures text and structure
Applications
Digital libraries
Content management
Many such XML repositories already available
Library of Congress documents
IEEE INEX collection
SIGMOD, DBLP, Shakespeares plays,

5
Searching XML Repositories

Confluence of Information Retrieval (text) and
Database (structure) techniques
A spectrum of possibilities

Pure KeywordSearch
Full-Text DB Queries
KeywordSearch inContext
6
Outline

Pure Keyword Search
Keyword Search in Context
Full-Text DB Queries
Related Work and Conclusion

7
Keyword Search over HTML
Ranked Results
Query Keywords
Hyperlinked HTML Documents
8
Keyword Search over XMLGuo, Shao, Botev,
Shanmugasundaram, SIGMOD 2003
XRank
Ranked Results
Query Keywords
Mix of Hyperlinked XMLand HTML Documents
9
Outline

Pure Keyword Search
Design Principles
Indexing and Query Processing
Keyword Search in Context
Full-Text DB Queries
Conclusion

10
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
11
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
12
Design Principles

Return most specific element containing the query
keywords

13
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt ltpaper
id2gt
14
Design Principles

Return most specific element containing the query
keywords
Ranking has to be done at the granularity of
elements

15
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
16
Design Principles

Return most specific element containing the query
keywords
Ranking has to be done at the granularity of
elements
Generalize HTML keyword search

17
Design Principles

Return most specific element containing the query
keywords
Ranking has to be done at the granularity of
elements
Generalize HTML keyword search

18
Data Model
Containment edge
Hyperlink edge
19
ElemRank

Captures importance of an element
Analogous to Googles PageRank
But computed at granularity of elements
Exploit hyperlink edges and containment edges
Naturally generalizes Googles PageRank
Random walk interpretation

20
PageRank Brin Page 1998
Hyperlink edge
w
1-d Probability of random jump
21
ElemRank
Hyperlink edge
Containment edge
w
1-d1-d2-d3 Probability of random jump
22
Outline

Pure Keyword Search
Design Principles
Indexing and Query Processing
Keyword Search in Context
Full-Text DB Queries
Conclusion

23
System Architecture
Keyword query
Ranked Results
XML/HTML Documents
Query Evaluator
Data access
XML Elements with ElemRanks
ElemRank Computation
Hybrid Dewey Inverted List
Compute top-k query results as per definition of
ranking
24
Naïve Method

Naïve inverted lists
Ricardo 1 5 6 8
XQL 1 5 6 7

1
ltworkshopgt
date
lttitlegt
lteditorsgt
ltproceedingsgt
2
3
4
5
28 July
XML and
David Carmel
ltpapergt
ltpapergt
6

lttitlegt
ltauthorgt
7
8

Problems 1. Space Overhead 2. Spurious Results
XQL and
Ricardo
Main issue Decouples representation of ancestors
and descendants
25
Dewey IDs 1850s
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt

0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt

XQL and
Ricardo
26
Dewey Inverted List (DIL)
Position List
ElemRank
Dewey Id
5.0.3.0.0
85
32
XQL
Sorted by Dewey Id
91
8.0.3.8.3
38
89

5.0.3.0.1
82
38
Ricardo
Sorted by Dewey Id
8.2.1.4.2
99
52

Store IDs of elements that directly contain
keyword - Avoids space overhead
27
DIL Query Processing

Merge query keyword inverted lists in Dewey ID
Order
Entries with common prefixes are processed
together
Compute Longest Common Prefix of Dewey IDs during
the merge
Longest common prefix ensures most specific
results
Also suppresses spurious results
Keep top-k results seen so far in output heap
Output contents of output heap after scanning
inverted lists
Algorithm works in a single scan over inverted
lists

28
Ranked Dewey Inverted List (RDIL)
B-tree On Dewey Id
Inverted List
XQL
Sorted by ElemRank
B-tree On Dewey Id
Inverted List
Ricardo
Sorted by ElemRank
29
RDIL Query Processing
Output Heap
Temp Heap
P
P
B-tree on Dewey Id
Ricardo
Inverted List
P 9.0.4.2.0
Rank(9.0.4)
threshold ElemRank(P)Max-ElemRank
threshold ElemRank(P)ElemRank(R)
XQL
9.0.4.1.2
R
8.2.1.4.2
9.0.4.1.2
9.0.5.6
10.8.3
9.0.5.6
9.0.4.1.2
B-tree on Dewey Id
9.0.4.2.0
30
Motivation for DIL/RDIL Hybrid

Correlation of query keywords probability that
the query keywords occur in same element
High correlation RDIL likely to outperform DIL
by stopping early
Low correlation DIL likely to outperform RDIL
because RDIL has to scan most (or entire)
inverted list
Dilemma
DIL and RDIL are likely to outperform each other
But require inverted lists to be sorted in
different orders
Challenges
Get benefits of DIL and RDIL without doubling
space?
How can keyword correlation be determined?

31
Hybrid Dewey Inverted List (HDIL)
B-tree On Dewey Id
Full Inverted List
XQL
Sorted by Dewey id
Short List
Sorted by ElemRank

RDIL is better only when it scans little of
inverted list
Short list sorted by ElemRank - saves space!
Can reuse full inverted list as leaf of B-tree
Saves space!

32
DBLP High Correlation Keywords
33
DBLP Low Correlation Keywords
34
Outline

Pure Keyword Search
Keyword Search in Context
Full-Text DB Queries
Related Work and Conclusion

35
INEX IEEE
SIGMOD Record
...
Find relevant elements in Shakespeares plays
about the process of speech

9 of top 10 results for one repository were not
in the top 10 results of other repository
XIRQLs Fuhr Grobjohann, SIGIR 2001 TF-IDF
scoring

36
Explaining the Results

TF-IDF scoring for a keyword k
TF (Term Frequency) occurences of k in element
Usually normalized by some factor
IDF (Inverse Document Frequency)( elements)/(
elements that contain k)
Score sum of TFIDF for all query keywords

Main reason for skewed results
Language of engineers very different from
language of Shakespeare!
process common in INEX, speech uncommon

37
INEX IEEE
SIGMOD Record
...
Need a way to efficiently compute IDF (or other
corpus scoring statistic) on-the-fly
38
Context-Sensitive RankingBotev
Shanmugasundaram, WebDB 2005

Use Dewey inverted lists context B-trees
Two pass algorithm
First pass collect statistics
Second pass compute results (entries cached from
first pass)

39
Outline

Pure Keyword Search
Keyword Search in Context
Full-Text DB Queries
Related Work and Conclusion

40
Motivation

Many new applications require sophisticated DB
queries complex full-text search
Example Library of Congress documents in XML
Current XML query languages are mostly database
languages
Examples XQuery, XPath
Provide very rudimentary text/IR support
fncontains(e, keywords)
No support for complex IR queries
Distance predicates, stemming, scoring,

41
Example Queries

From XQuery Full-Text Use Cases Document
Find the titles of the books whose body contains
the phrases Usability and Web site in that
order, in the same paragraph, using stemming if
necessary to match the tokens
Find the titles of the books published after 1999
whose body contains Usability and testing
within a window of 3 words, and return them in
score order

42
XQuery Full-TextW3C Working Draft
Quark Full-TextLanguage (Cornell)
2002
IBM, Microsoft,Oracle proposals
TeXQuery(Cornell, ATT)
2003
XQuery Full-Text
2004
XQuery Full-Text (Second Draft)
2005
43
XQuery Primer
Find the titles of books
//book/title
Find the titles of books with price lt 25
//book./price lt 25/title
for b in //book./author Dawkins order by
b/price return b
Find books written byDawkins, in order of price
44
Syntax OverviewAmer-Yahia, Botev,
Shanmugasundaram, WWW 2004

Two new XQuery constructs
FTContainsExpr
Expresses Boolean full-text search predicates
Seamlessly composes with other XQuery expressions
FTScoreClause
Extension to FOR expression
Can score FTContainsExpr and other expressions

45
FTContainsExpr

ContextExpr ftcontains FTSelection
ContextExpr (any XQuery expression) is context
spec
FTSelection is search spec
Returns true iff at least one node in ContextExpr
satisfies the FTSelection
Examples
//book ftcontains Usability testing
distance 5
//book./content ftcontains Usability with
stems/title
//book ftcontains /articleauthorDawkins/title

46
FTScore Clause

FOR v SCORE s? IN Expr
ORDER BY
RETURN
Example
FOR b SCORE s in
/pub/book. ftcontains Usability
testing
ORDER BY sRETURN b

47
FTScore Clause

FOR v SCORE s? IN Expr
ORDER BY
RETURN
Example
FOR b SCORE s in
/pub/book. ftcontains Usability
testing
and ./price lt 10.00
ORDER BY sRETURN b

48
Quark

An open-source C implementation of XQuery
Full-Text
http//www.cs.cornell.edu/database/quark
Compiles on Linux and Windows
Key features
Mix of structured and full-text predicates
Score all of XQuery!
Full-text search over views

49
Outline

Pure Keyword Search
Keyword Search in Context
Full-Text DB Queries
Related Work and Conclusion

50
Related Work

Semi-structured ranked keyword search
XIRQL Fuhr and Grobjohann
XXL Theobald and Weikum
Commercial search engines Luk et al.
INEX initiative
Keyword search over databases
BANKS Bhalotia et al.
DBXplorer Agrawal et al.
DISCOVER Hristidis et al.
LORE Goldman et al.

51
10000 Foot View of Data Management
Information Retrieval Systems
Ranked Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data

Write a Comment

User Comments (0)