Text Search in XML - PowerPoint PPT Presentation

About This Presentation
Title:

Text Search in XML

Description:

Text Search in XML – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 33
Provided by: jayavelsha
Category:
Tags: xml | quark | search | text

less

Transcript and Presenter's Notes

Title: Text Search in XML


1
Text Search in XML
  • Jayavel Shanmugasundaram
  • Cornell University

2
Motivation
  • A key benefit of XML is its ability to represent
    a mix of structured and unstructured (text) data
  • Applications
  • Digital libraries
  • Content management
  • Many such XML repositories already available
  • IEEE INEX collection
  • Library of Congress documents
  • Shakespeares plays
  • SIGMOD, DBLP,

3
Example XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
4
Summary
  • Source of imprecision in XML full-text search
  • Scoring and ranking query results
  • Why is it different from traditional IR?
  • Structure plays a large role in ranking
  • Why is it different from traditional imprecise
    databases?
  • Even operators are fuzzy (more possible worlds)

5
Outline
  • Beyond Traditional IR
  • Beyond Traditional DB
  • Conclusion

6
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
Find relevant elements in important workshops
between the years 1999 and 2001 that are about
Ricardo and XML
7
Sources of Imprecision/Scores
  • Query results
  • XIRQL Fuhr Grobjohann, SIGIR 2001
  • XRANK Guo et al., SIGMOD 2003
  • XKSearch Xu Papakonstantinou, SIGMOD 2005

8
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
Find relevant elements in important workshops
between the years 1999 and 2001 that are about
Ricardo and XML
9
Data Model
Containment edge
Hyperlink edge
10
PageRank Brin Page
Hyperlink edge
w
1-d Probability of random jump
11
ElemRank Guo et al.
Hyperlink edge
Containment edge
w
1-d1-d2-d3 Probability of random jump
12
Sources of Imprecision/Scores
  • Query results
  • XIRQL Fuhr Grobjohann, SIGIR 2001
  • XRANK Guo et al., SIGMOD 2003
  • XKSearch Xu Papakonstantinou, SIGMOD 2005
  • Scores based on structure
  • XRANK (ElemRank), XIRQL (TF-IDF)

13
INEX IEEE
SIGMOD Record
...
Find relevant elements in Shakespeares plays
about the process of speech
9 of top 10 results for one repository were not
in the in top-10 results of other repository!
(XIRQLs TF-IDF scoring)
  • Language of engineers different from that of
    Shakespeare!
  • process common in INEX, speech uncommon

14
Sources of Imprecision/Scores
  • Query results
  • XIRQL Fuhr Grobjohann, SIGIR 2001
  • XRANK Guo et al., SIGMOD 2003
  • XKSearch Xu Papakonstantinou, SIGMOD 2005
  • Scores based on structure
  • XRANK (ElemRank), XIRQL (TF-IDF)
  • Scores based on search context
  • Quark Botev Shanmugasundaram, WebDB 2005
  • PowerDB Grabs Schek, INEX 2005

15
Outline
  • Beyond Traditional IR
  • Beyond Traditional DB
  • Conclusion

16
Why not use RDBMS SQL/MM?
  • Relations not natural for modeling XML text
  • Flexible schema
  • Predominantly text
  • Hyperlinks
  • No strict demarcation between structured and text
    data
  • Can issue structured and text queries over same
    data
  • Find books with year gt 1995
  • Find books containing keyword 1998
  • Can embed structured queries in text queries
  • Find books that contain the keywords that occur
    in the title of Richard Dawkins books

17
Current XML Query Languages
  • Current XML query languages are mostly database
    languages
  • Examples XQuery, XPath
  • Provide very rudimentary text/IR support
  • fncontains(e, keywords)
  • Returns true iff element e contains keywords
  • No support for complex IR queries
  • Distance predicates, stemming, scoring,

18
Example Queries
  • From XQuery Full-Text Use Cases Document
  • Find the titles of the books whose body contains
    the phrases Usability and Web site in that
    order, in the same paragraph, using stemming if
    necessary to match the tokens
  • Find the titles of the books whose body contains
    Usability and testing within a window of 3
    words, and return them in score order

19
XQuery Full-Text
  • Full-text search extension to XQuery
  • W3C Working Draft
  • XQuery Full-Text Evolution
  • Quark query language
  • Botev Shanmugasundaram, 2003
  • TeXQuery
  • WWW 2004 Amer-Yahia, Botev, Shanmugasundaram
  • XQuery Full-Text
  • http//www.w3.org/TR/xquery-full-text
  • Invited experts (Botev and Shanmugasundaram)

20
Outline
  • Beyond Traditional IR
  • Beyond Traditional DB
  • XQuery Full-Text
  • Research Directions
  • Conclusion

21
Syntax Overview
  • Two new XQuery constructs
  • FTContainsExpr
  • Expresses Boolean full-text search predicates
  • Seamlessly composes with other XQuery expressions
  • FTScoreClause
  • Extension to FLWOR expression
  • Can score FTContainsExpr and other expressions

22
FTContainsExpr
  • ContextExpr ftcontains FTSelection
  • ContextExpr (any XQuery expression) is context
    spec
  • FTSelection is search spec
  • Returns true iff at least one node in ContextExpr
    satisfies the FTSelection
  • Examples
  • //book ftcontains Usability testing
    distance 5
  • //book./content ftcontains Usability with
    stems/title
  • //book ftcontains /articleauthorDawkins/title

23
FTScore Clause
In any order
  • FOR v SCORE s? IN FUZZY Expr
  • LET
  • WHERE
  • ORDER BY
  • RETURN
  • Example
  • FOR b SCORE s in
  • /pub/book. ftcontains Usability
    testing
  • ORDER BY sRETURN ltresult scoresgt b
    lt/resultgt

24
FTScore Clause
In any order
  • FOR v SCORE s? IN FUZZY Expr
  • LET
  • WHERE
  • ORDER BY
  • RETURN
  • Example
  • FOR b SCORE s in
  • /pub/book. ftcontains Usability
    testing

  • and ./price lt 10.00
  • ORDER BY sRETURN b

25
FTScore Clause
In any order
  • FOR v SCORE s? IN FUZZY Expr
  • LET
  • WHERE
  • ORDER BY
  • RETURN
  • Example
  • FOR b SCORE s in FUZZY
  • /pub/book. ftcontains Usability
    testing
  • ORDER BY sRETURN b

26
Outline
  • Beyond Traditional IR
  • Beyond Traditional DB
  • XQuery Full-Text
  • Research Directions
  • Conclusion

27
System Architecture
Integration Layer
XQuery Engine
IR Engine
28
System Architecture
XQuery IR Engine
Quark _at_ Cornell uses this architecture
29
Structural Relaxation
  • FOR b SCORE s in FUZZY
  • /pub/book.
    ftcontains Usability with stems
  • ORDER BY s
  • RETURN b

30
Search Over Views
Data Source 1
Data Source 2
ltbooksgt ltbookgt lt/bookgt ltbookgt
lt/bookgt lt/booksgt
ltreviewsgt ltreviewgt lt/reviewgt ltreviewgt
lt/reviewgt lt/reviewsgt
ltbookgt ltreviewsgt lt/reviewsgtlt/bookgt
Integrated View
31
Outline
  • Beyond Traditional IR
  • Beyond Traditional DB
  • Conclusion

32
10000 Foot View of Data Management
Information Retrieval Systems
Ranked Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
Write a Comment
User Comments (0)
About PowerShow.com