Title: Text Search in XML
1Text Search in XML
- Jayavel Shanmugasundaram
- Cornell University
2Motivation
- A key benefit of XML is its ability to represent
a mix of structured and unstructured (text) data - Applications
- Digital libraries
- Content management
- Many such XML repositories already available
- IEEE INEX collection
- Library of Congress documents
- Shakespeares plays
- SIGMOD, DBLP,
3Example XML Document
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
4Summary
- Source of imprecision in XML full-text search
- Scoring and ranking query results
- Why is it different from traditional IR?
- Structure plays a large role in ranking
- Why is it different from traditional imprecise
databases? - Even operators are fuzzy (more possible worlds)
5Outline
- Beyond Traditional IR
- Beyond Traditional DB
- Conclusion
6ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
Find relevant elements in important workshops
between the years 1999 and 2001 that are about
Ricardo and XML
7Sources of Imprecision/Scores
- Query results
- XIRQL Fuhr Grobjohann, SIGIR 2001
- XRANK Guo et al., SIGMOD 2003
- XKSearch Xu Papakonstantinou, SIGMOD 2005
8ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
Find relevant elements in important workshops
between the years 1999 and 2001 that are about
Ricardo and XML
9Data Model
Containment edge
Hyperlink edge
10PageRank Brin Page
Hyperlink edge
w
1-d Probability of random jump
11ElemRank Guo et al.
Hyperlink edge
Containment edge
w
1-d1-d2-d3 Probability of random jump
12Sources of Imprecision/Scores
- Query results
- XIRQL Fuhr Grobjohann, SIGIR 2001
- XRANK Guo et al., SIGMOD 2003
- XKSearch Xu Papakonstantinou, SIGMOD 2005
- Scores based on structure
- XRANK (ElemRank), XIRQL (TF-IDF)
13INEX IEEE
SIGMOD Record
...
Find relevant elements in Shakespeares plays
about the process of speech
9 of top 10 results for one repository were not
in the in top-10 results of other repository!
(XIRQLs TF-IDF scoring)
- Language of engineers different from that of
Shakespeare! - process common in INEX, speech uncommon
14Sources of Imprecision/Scores
- Query results
- XIRQL Fuhr Grobjohann, SIGIR 2001
- XRANK Guo et al., SIGMOD 2003
- XKSearch Xu Papakonstantinou, SIGMOD 2005
- Scores based on structure
- XRANK (ElemRank), XIRQL (TF-IDF)
- Scores based on search context
- Quark Botev Shanmugasundaram, WebDB 2005
- PowerDB Grabs Schek, INEX 2005
15Outline
- Beyond Traditional IR
- Beyond Traditional DB
- Conclusion
16Why not use RDBMS SQL/MM?
- Relations not natural for modeling XML text
- Flexible schema
- Predominantly text
- Hyperlinks
- No strict demarcation between structured and text
data - Can issue structured and text queries over same
data - Find books with year gt 1995
- Find books containing keyword 1998
- Can embed structured queries in text queries
- Find books that contain the keywords that occur
in the title of Richard Dawkins books
17Current XML Query Languages
- Current XML query languages are mostly database
languages - Examples XQuery, XPath
- Provide very rudimentary text/IR support
- fncontains(e, keywords)
- Returns true iff element e contains keywords
- No support for complex IR queries
- Distance predicates, stemming, scoring,
18Example Queries
- From XQuery Full-Text Use Cases Document
- Find the titles of the books whose body contains
the phrases Usability and Web site in that
order, in the same paragraph, using stemming if
necessary to match the tokens - Find the titles of the books whose body contains
Usability and testing within a window of 3
words, and return them in score order
19XQuery Full-Text
- Full-text search extension to XQuery
- W3C Working Draft
- XQuery Full-Text Evolution
- Quark query language
- Botev Shanmugasundaram, 2003
- TeXQuery
- WWW 2004 Amer-Yahia, Botev, Shanmugasundaram
- XQuery Full-Text
- http//www.w3.org/TR/xquery-full-text
- Invited experts (Botev and Shanmugasundaram)
20Outline
- Beyond Traditional IR
- Beyond Traditional DB
- XQuery Full-Text
- Research Directions
- Conclusion
21Syntax Overview
- Two new XQuery constructs
- FTContainsExpr
- Expresses Boolean full-text search predicates
- Seamlessly composes with other XQuery expressions
- FTScoreClause
- Extension to FLWOR expression
- Can score FTContainsExpr and other expressions
22FTContainsExpr
- ContextExpr ftcontains FTSelection
- ContextExpr (any XQuery expression) is context
spec - FTSelection is search spec
- Returns true iff at least one node in ContextExpr
satisfies the FTSelection - Examples
- //book ftcontains Usability testing
distance 5 - //book./content ftcontains Usability with
stems/title - //book ftcontains /articleauthorDawkins/title
23FTScore Clause
In any order
- FOR v SCORE s? IN FUZZY Expr
- LET
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in
- /pub/book. ftcontains Usability
testing - ORDER BY sRETURN ltresult scoresgt b
lt/resultgt
24FTScore Clause
In any order
- FOR v SCORE s? IN FUZZY Expr
- LET
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in
- /pub/book. ftcontains Usability
testing -
and ./price lt 10.00 - ORDER BY sRETURN b
25FTScore Clause
In any order
- FOR v SCORE s? IN FUZZY Expr
- LET
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in FUZZY
- /pub/book. ftcontains Usability
testing - ORDER BY sRETURN b
26Outline
- Beyond Traditional IR
- Beyond Traditional DB
- XQuery Full-Text
- Research Directions
- Conclusion
27System Architecture
Integration Layer
XQuery Engine
IR Engine
28System Architecture
XQuery IR Engine
Quark _at_ Cornell uses this architecture
29Structural Relaxation
- FOR b SCORE s in FUZZY
- /pub/book.
ftcontains Usability with stems - ORDER BY s
- RETURN b
30Search Over Views
Data Source 1
Data Source 2
ltbooksgt ltbookgt lt/bookgt ltbookgt
lt/bookgt lt/booksgt
ltreviewsgt ltreviewgt lt/reviewgt ltreviewgt
lt/reviewgt lt/reviewsgt
ltbookgt ltreviewsgt lt/reviewsgtlt/bookgt
Integrated View
31Outline
- Beyond Traditional IR
- Beyond Traditional DB
- Conclusion
3210000 Foot View of Data Management
Information Retrieval Systems
Ranked Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data