Title: XQuery and Information Retrieval
1XQuery and Information Retrieval
- Jayavel Shanmugasundaram
- Cornell University
- (Invited Expert XQuery Full-Text Task Force)
2Motivation
- A key benefit of XML is its ability to represent
a mix of structured and unstructured (text) data - Applications
- Digital libraries
- Content management
- Many such XML repositories already available
- IEEE INEX collection
- Library of Congress documents
- Shakespeares plays
- SIGMOD, DBLP,
3Example XML Document
ltbook isbn3647593 year1995gt
ltauthorgtElina Roselt/titlegt ltcontentgt
ltparagt The usability of
software measures how well the software
provides support for quickly achieving
specified goals. lt/paragt
ltparagt The users must not only
be well served but must feel well
served. ltparagt
lt/contentgt lt/bookgt
4Current Query Languages
- Current XML query languages are mostly database
languages - Examples XQuery, XPath
- Provide very rudimentary text/IR support
- fncontains(e, keywords)
- Returns true iff element e contains keywords
- No support for complex IR queries
- Distance predicates, stemming, scoring,
5Example Queries
- From XQuery Full-Text Use Cases Document
- Find the titles of the books whose body contains
the phrases Usability and Web site in that
order, in the same paragraph, using stemming if
necessary to match the tokens - Find the titles of the books whose body contains
Usability and testing within a window of 3
words, and return them in score order
6Why not use SQL/MM (or variant)?
- Key difference No strict demarcation between
structured and text data in XML - Can issue structured and text queries over same
data - Find books with year gt 1995
- Find books containing keyword 1998
- Can embed structured queries in text queries
- Find books that contain the keywords that occur
in the title of Richard Dawkins books - Other important differences
- XML/XQuery data model
- Composability of full-text primitives
7Outline
- XQuery Full-Text Language
- Research Challenges
- Conclusion
8XQuery Full-Text
- Full-text search extension to XQuery
- W3C Working Draft
- Tightly integrated with the XQuery data model
- Provides well defined model for reasoning about
full-text operations and integration with XQuery - Composability
- Fully composable full-text primitives, including
Boolean connectives, distance predicates,
stemming - Can embed XQuery Full-Text primitives in XQuery
and vice versa - Flexible scoring construct
9XQuery Full-Text Evolution
Quark Full-TextLanguage (Cornell)
2002
IBM, Microsoft,Oracle proposals
TeXQuery(Cornell, ATT)
2003
XQuery Full-Text
2004
XQuery Full-Text (Second Draft)
2005
10Design Goals
- Should be able to specify the following
- Context spec Defines nodes over which full-text
search is to be performed. e.g., book chapters - Return spec Defines nodes that are to be
returned to users. e.g., book titles - Search spec Defines full-text search condition.
e.g., and, or, proximity, stemming - Scoring spec Define how results are to be
scored. e.g., how keywords should be weighted
11Syntax Overview
- Two new XQuery constructs
- FTContainsExpr
- Expresses Boolean full-text search predicates
- Seamlessly composes with other XQuery expressions
- FTScoreClause
- Extension to FLWOR expression
- Can score FTContainsExpr and other expressions
12Outline
- XQuery Full-Text Language
- FTContainsExpr
- FTScoreClause
- Research Challenges
- Conclusion
13FTContainsExpr
- Like other XQuery expressions
- Takes in sequences of items (nodes) as input
- Produces a sequence of items (nodes) as output
- Can seamlessly compose with other XQuery
expressions
XqueryExpression
Evaluate to aSequence of items
14FTContainsExpr
- ContextExpr ftcontains FTSelection
- ContextExpr (any XQuery expression) is context
spec - FTSelection is search spec
- Returns true iff at least one node in ContextExpr
satisfies the FTSelection - Examples
- //book ftcontains Usability testing
distance 5 - //book./content ftcontains Usability with
stems/title - //book ftcontains /articleauthorDawkins/title
15FTSelection
- Encapsulates all full-text conditions in
FTContainsExpr - Works in a new data model called AllMatch
- Operates on positions within XML nodes (more fine
grained than XQuery data model) - Fully composable similar to composition of
relational (and XML) operators!
FTSelection
Evaluate toAllMatch
16FTSelection Composability
- Usability
- /bookauthorDawkins/title
- Usability /bookauthorDawkins/title
- (Usability /bookauthorDawkins/title)
same sentence - (Usability /bookauthorDawkins/title)
same sentence window 5 - All of these evaluate to an AllMatch!
- Allows arbitrary composition of full-text
primitives
17FTContextModifier
- Can be applied on any FTSelection to specify
aspects such as stemming, thesauri, case, etc. - Fully composable with other context modifiers and
FTSelections - Examples
- Usability testing with stems
- Usability testing with stems window 5
without stop words - Usability testing with stems window 5
without stop words case insensitive
18Outline
- XQuery Full-Text Language
- FTContainsExpr
- FTScoreClause
- Research Challenges
- Conclusion
19FTScoreClause
- Two alternatives
- Both extensions to FLWOR clause
- Alternative 1
- Score Boolean XQuery expressions, including
FTContainsExpr - Current working draft syntax
- Alternative 2
- Score arbitrary XQuery expressions
- Under discussion
20Alternative 1
- FOR
- LET
- SCORE var AS Expr (Expr returns Boolean)
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b in /pubs/book
- SCORE s AS
- b ftcontains software weight 0.8
testing weight 0.2ORDER BY sRETURN ltresult
scoresgt b lt/resultgt
In any order
21Alternative 1
- FOR
- LET
- SCORE var AS Expr (Expr returns Boolean)
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b in /pubs/book
- SCORE s AS
- b/price lt 10.00ORDER BY sRETURN
ltresult scoresgt b lt/resultgt
In any order
22Alternative 1 Analysis
- Not powerful enough for some XML IR queries
- Case study XML INEX initiative
- Want to relax /pubs/book (in addition to
full-text predicates) - Boolean scoring expressions insufficient
/pubs/book. ftcontains Usability testing
23Alternative 2
In any order
- FOR v SCORE s? AT i? IN FUZZY Expr
- LET
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in
- /pub/book. ftcontains Usability
testing - ORDER BY sRETURN ltresult scoresgt b
lt/resultgt
24Alternative 2
In any order
- FOR v SCORE s? AT i? IN FUZZY Expr
- LET
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in FUZZY
- /pub/book. ftcontains Usability
testing - ORDER BY sRETURN ltresult scoresgt b
lt/resultgt
25Outline
- XQuery Full-Text Language
- Research Challenges
- Conclusion
26Challenge 1 System Architecture
Integration Layer
XQuery Engine
IR Engine
27Challenge 1 System Architecture
XQuery IR Engine
28Challenge 2 Structural Relaxation
- FOR b SCORE s in FUZZY
- /pub/book.
ftcontains Usability with stems - ORDER BY s
- RETURN ltresult scoresgt b lt/resultgt
29Challenge 3 Search Over Views
LET bookrevs FOR book IN //book
RETURN ltbookrevsgt
book
ltreviewsgt
FOR rev IN //review
WHERE rev/bookid book/id
RETURN
rev
lt/reviewsgt
lt/bookrevsgt
FOR bookrev IN bookrevs SCORE score AS
bookrev ftcontains Usability with stems ORDER
BY score RETURN bookrev
30Outline
- XQuery Full-Text Language
- Research Challenges
- Conclusion
31Conclusion
- Unified querying of structured data and text is
one of the most promising benefits of XML - XQuery Full-Text is a language designed to enable
this goal - Many research challenges
- System implementation
- Scoring
- Requirements of a new class of applications
- Starting to see research prototypes
- Quark (Open-source software, Cornell)
- GalaTeX (Reference implementation, ATT)