Text Search over XML Documents - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Text Search over XML Documents

Description:

Compute Longest Common Prefix of Dewey IDs during the merge ... Short list sorted by ElemRank - saves space! Can reuse full inverted list as leaf of B -tree ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 52
Provided by: jayavelsha
Category:
Tags: xml | documents | list | longest | over | search | text

less

Transcript and Presenter's Notes

Title: Text Search over XML Documents


1
Text Search over XML Documents
  • Jayavel Shanmugasundaram
  • Cornell University

2
The HTML World
ltbodygt lth1gt XML and Information Retrieval A
SIGIR 2000 Workshop lt/h1gt ltpgt The
workshop was held on 28 July 2000. The editors of
the workshop were David Carmel,
Yoelle Maarek, and Aya Soffer lt/pgt
lth2gt XQL and Proximal Nodes lt/h2gt
ltpgt The paper was authored by Ricardo
Baeza-Yates and Gonzalo
Navarro. The abstract of this paper is given
below. lt/pgt ltpgt We consider
the recently proposed language lt/pgt
ltpgt The paper references the following
papers lta
hrefhttp//www.acm.org/www8/paper/xmlqlgt
lt/agt
lt/pgt
3
The XML World
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
4
Key Aspect of XML
  • Captures text and structure
  • Applications
  • Digital libraries
  • Content management
  • Many such XML repositories already available
  • Library of Congress documents
  • IEEE INEX collection
  • SIGMOD, DBLP, Shakespeares plays,

5
Searching XML Repositories
  • Confluence of Information Retrieval (text) and
    Database (structure) techniques
  • A spectrum of possibilities

Pure KeywordSearch
Full-Text DB Queries
KeywordSearch inContext
6
Outline
  • Pure Keyword Search
  • Keyword Search in Context
  • Full-Text DB Queries
  • Related Work and Conclusion

7
Keyword Search over HTML
Ranked Results
Query Keywords
Hyperlinked HTML Documents
8
Keyword Search over XMLGuo, Shao, Botev,
Shanmugasundaram, SIGMOD 2003
XRank
Ranked Results
Query Keywords
Mix of Hyperlinked XMLand HTML Documents
9
Outline
  • Pure Keyword Search
  • Design Principles
  • Indexing and Query Processing
  • Keyword Search in Context
  • Full-Text DB Queries
  • Conclusion

10
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
11
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
12
Design Principles
  • Return most specific element containing the query
    keywords

13
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt ltpaper
id2gt
14
Design Principles
  • Return most specific element containing the query
    keywords
  • Ranking has to be done at the granularity of
    elements

15
XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Work lt/subsectiongt
The XQL language
lt/subsectiongt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
16
Design Principles
  • Return most specific element containing the query
    keywords
  • Ranking has to be done at the granularity of
    elements
  • Generalize HTML keyword search

17
Design Principles
  • Return most specific element containing the query
    keywords
  • Ranking has to be done at the granularity of
    elements
  • Generalize HTML keyword search

18
Data Model
Containment edge
Hyperlink edge
19
ElemRank
  • Captures importance of an element
  • Analogous to Googles PageRank
  • But computed at granularity of elements
  • Exploit hyperlink edges and containment edges
  • Naturally generalizes Googles PageRank
  • Random walk interpretation

20
PageRank Brin Page 1998
Hyperlink edge
w
1-d Probability of random jump
21
ElemRank
Hyperlink edge
Containment edge
w
1-d1-d2-d3 Probability of random jump
22
Outline
  • Pure Keyword Search
  • Design Principles
  • Indexing and Query Processing
  • Keyword Search in Context
  • Full-Text DB Queries
  • Conclusion

23
System Architecture
Keyword query
Ranked Results
XML/HTML Documents
Query Evaluator
Data access
XML Elements with ElemRanks
ElemRank Computation
Hybrid Dewey Inverted List
Compute top-k query results as per definition of
ranking
24
Naïve Method
  • Naïve inverted lists
  • Ricardo 1 5 6 8
  • XQL 1 5 6 7

1
ltworkshopgt
date
lttitlegt
lteditorsgt
ltproceedingsgt
2
3
4
5
28 July
XML and
David Carmel
ltpapergt
ltpapergt
6

lttitlegt
ltauthorgt
7
8


Problems 1. Space Overhead 2. Spurious Results
XQL and
Ricardo
Main issue Decouples representation of ancestors
and descendants
25
Dewey IDs 1850s
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt

0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt


XQL and
Ricardo
26
Dewey Inverted List (DIL)
Position List
ElemRank
Dewey Id
5.0.3.0.0
85
32
XQL
Sorted by Dewey Id
91
8.0.3.8.3
38
89



5.0.3.0.1
82
38
Ricardo
Sorted by Dewey Id
8.2.1.4.2
99
52



Store IDs of elements that directly contain
keyword - Avoids space overhead
27
DIL Query Processing
  • Merge query keyword inverted lists in Dewey ID
    Order
  • Entries with common prefixes are processed
    together
  • Compute Longest Common Prefix of Dewey IDs during
    the merge
  • Longest common prefix ensures most specific
    results
  • Also suppresses spurious results
  • Keep top-k results seen so far in output heap
  • Output contents of output heap after scanning
    inverted lists
  • Algorithm works in a single scan over inverted
    lists

28
Ranked Dewey Inverted List (RDIL)
B-tree On Dewey Id
Inverted List
XQL
Sorted by ElemRank
B-tree On Dewey Id
Inverted List
Ricardo
Sorted by ElemRank
29
RDIL Query Processing
Output Heap
Temp Heap
P
P
B-tree on Dewey Id
Ricardo
Inverted List
P 9.0.4.2.0
Rank(9.0.4)
threshold ElemRank(P)Max-ElemRank
threshold ElemRank(P)ElemRank(R)
XQL
9.0.4.1.2
R
8.2.1.4.2
9.0.4.1.2
9.0.5.6
10.8.3
9.0.5.6
9.0.4.1.2
B-tree on Dewey Id
9.0.4.2.0
30
Motivation for DIL/RDIL Hybrid
  • Correlation of query keywords probability that
    the query keywords occur in same element
  • High correlation RDIL likely to outperform DIL
    by stopping early
  • Low correlation DIL likely to outperform RDIL
    because RDIL has to scan most (or entire)
    inverted list
  • Dilemma
  • DIL and RDIL are likely to outperform each other
  • But require inverted lists to be sorted in
    different orders
  • Challenges
  • Get benefits of DIL and RDIL without doubling
    space?
  • How can keyword correlation be determined?

31
Hybrid Dewey Inverted List (HDIL)
B-tree On Dewey Id
Full Inverted List
XQL
Sorted by Dewey id
Short List
Sorted by ElemRank
  • RDIL is better only when it scans little of
    inverted list
  • Short list sorted by ElemRank - saves space!
  • Can reuse full inverted list as leaf of B-tree
  • Saves space!

32
DBLP High Correlation Keywords
33
DBLP Low Correlation Keywords
34
Outline
  • Pure Keyword Search
  • Keyword Search in Context
  • Full-Text DB Queries
  • Related Work and Conclusion

35
INEX IEEE
SIGMOD Record
...
Find relevant elements in Shakespeares plays
about the process of speech
  • 9 of top 10 results for one repository were not
    in the top 10 results of other repository
  • XIRQLs Fuhr Grobjohann, SIGIR 2001 TF-IDF
    scoring

36
Explaining the Results
  • TF-IDF scoring for a keyword k
  • TF (Term Frequency) occurences of k in element
  • Usually normalized by some factor
  • IDF (Inverse Document Frequency)( elements)/(
    elements that contain k)
  • Score sum of TFIDF for all query keywords
  • Main reason for skewed results
  • Language of engineers very different from
    language of Shakespeare!
  • process common in INEX, speech uncommon

37
INEX IEEE
SIGMOD Record
...
Need a way to efficiently compute IDF (or other
corpus scoring statistic) on-the-fly
38
Context-Sensitive RankingBotev
Shanmugasundaram, WebDB 2005
  • Use Dewey inverted lists context B-trees
  • Two pass algorithm
  • First pass collect statistics
  • Second pass compute results (entries cached from
    first pass)

39
Outline
  • Pure Keyword Search
  • Keyword Search in Context
  • Full-Text DB Queries
  • Related Work and Conclusion

40
Motivation
  • Many new applications require sophisticated DB
    queries complex full-text search
  • Example Library of Congress documents in XML
  • Current XML query languages are mostly database
    languages
  • Examples XQuery, XPath
  • Provide very rudimentary text/IR support
  • fncontains(e, keywords)
  • No support for complex IR queries
  • Distance predicates, stemming, scoring,

41
Example Queries
  • From XQuery Full-Text Use Cases Document
  • Find the titles of the books whose body contains
    the phrases Usability and Web site in that
    order, in the same paragraph, using stemming if
    necessary to match the tokens
  • Find the titles of the books published after 1999
    whose body contains Usability and testing
    within a window of 3 words, and return them in
    score order

42
XQuery Full-TextW3C Working Draft
Quark Full-TextLanguage (Cornell)
2002
IBM, Microsoft,Oracle proposals
TeXQuery(Cornell, ATT)
2003
XQuery Full-Text
2004
XQuery Full-Text (Second Draft)
2005
43
XQuery Primer
Find the titles of books
//book/title
Find the titles of books with price lt 25
//book./price lt 25/title
for b in //book./author Dawkins order by
b/price return b
Find books written byDawkins, in order of price
44
Syntax OverviewAmer-Yahia, Botev,
Shanmugasundaram, WWW 2004
  • Two new XQuery constructs
  • FTContainsExpr
  • Expresses Boolean full-text search predicates
  • Seamlessly composes with other XQuery expressions
  • FTScoreClause
  • Extension to FOR expression
  • Can score FTContainsExpr and other expressions

45
FTContainsExpr
  • ContextExpr ftcontains FTSelection
  • ContextExpr (any XQuery expression) is context
    spec
  • FTSelection is search spec
  • Returns true iff at least one node in ContextExpr
    satisfies the FTSelection
  • Examples
  • //book ftcontains Usability testing
    distance 5
  • //book./content ftcontains Usability with
    stems/title
  • //book ftcontains /articleauthorDawkins/title

46
FTScore Clause
  • FOR v SCORE s? IN Expr
  • ORDER BY
  • RETURN
  • Example
  • FOR b SCORE s in
  • /pub/book. ftcontains Usability
    testing
  • ORDER BY sRETURN b

47
FTScore Clause
  • FOR v SCORE s? IN Expr
  • ORDER BY
  • RETURN
  • Example
  • FOR b SCORE s in
  • /pub/book. ftcontains Usability
    testing

  • and ./price lt 10.00
  • ORDER BY sRETURN b

48
Quark
  • An open-source C implementation of XQuery
    Full-Text
  • http//www.cs.cornell.edu/database/quark
  • Compiles on Linux and Windows
  • Key features
  • Mix of structured and full-text predicates
  • Score all of XQuery!
  • Full-text search over views

49
Outline
  • Pure Keyword Search
  • Keyword Search in Context
  • Full-Text DB Queries
  • Related Work and Conclusion

50
Related Work
  • Semi-structured ranked keyword search
  • XIRQL Fuhr and Grobjohann
  • XXL Theobald and Weikum
  • Commercial search engines Luk et al.
  • INEX initiative
  • Keyword search over databases
  • BANKS Bhalotia et al.
  • DBXplorer Agrawal et al.
  • DISCOVER Hristidis et al.
  • LORE Goldman et al.

51
10000 Foot View of Data Management
Information Retrieval Systems
Ranked Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
Write a Comment
User Comments (0)
About PowerShow.com