Title: Module 7
1Module 7
26MKT-ECHER-67FEX-44B6P
- XML and Information Retrieval
- (XQuery FullText, Research)
2The World is Flat (Friedman)
- Differences between Switzerland and Bangladesh
are disappearing - From traditional drivers talent, origin, and
luck, talent will dominate - Individuals (not countries) compete
- Chocolate sauce today, vanilla tomorrow
- No premium for strength
- No premium for memory
- Premium for collaboration, communication
- Premium for understanding, adaptivity
- Premium for creativity
- Good for Computer Scientists (lazy greedy)
3The World is Flat (XML)
- Remove barriers between machines
- Machines talk to machines
- communication, processing, storage
- One data model
- design and implementation
- Declarative programming, pay as you go along
- data and meta-data, structure/unstructured
- One data model
- here there, today tomorrow
- Decouple data from schema / interpretation
4References
- XQuery 1.0 and XPath 2.0 Full-Text
- http//www.w3.org/XML/Query
- Latest version November 2005
- Still work in progress!
- S. Amer-Yahia, J. ShanmugasundaramXML Full-Text
Search Challenges and Opportunities - Tutorial for VLDB Conf., August 2005
5Motivation
- A key benefit of XML is its ability to represent
a mix of structured and unstructured (text) data - Applications
- Digital libraries
- Content management
- Many such XML repositories already available
- IEEE INEX collection
- Library of Congress documents
- Shakespeares plays
- SIGMOD, DBLP,
6XML in Library of Congresshttp//thomas.loc.gov/h
ome/gpoxmlc109/h2739_ih.xml
- ltbill bill-stage"Introduced-in-House"gt
- ltcongressgt109th CONGRESSlt/congressgt
ltsessiongt1st Sessionlt/sessiongt - ltlegis-numgtH. R. 2739lt/legis-numgt
- ltcurrent-chambergtIN THE HOUSE OF
REPRESENTATIVESlt/current-chambergt - ltactiongt
- ltaction-date date"20050526"gtMay 26,
2005lt/action-dategt - ltaction-descgtltsponsor name-id"T000266"gtMr.
Tierneylt/sponsorgt (for himself, ltcosponsor
name-id"M001143"gtMs. McCollum of
Minnesotalt/cosponsorgt, ltcosponsor
name-id"M000725"gtMr. George Miller of
Californialt/cosponsorgt) introduced the following
bill which was referred to the ltcommittee-name
committee-id"HED00"gtCommittee on Education and
the Workforcelt/committee-namegt - lt/action-descgt
- lt/actiongt
-
7THOMAS Library of Congress
8INEX Data
- ltarticlegt ltfnogtK0271lt/fnogt
ltdoigt10.1041/K0271s-2004lt/doigt - ltfmgt lthdrgtlthdr1gtlttigtIEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERINGlt/tigt ltcrtgt - ltissngt1041-4347lt/issngt/04/20.00 copy 2004
IEEE Published by the IEEE Computer
Societylt/crtgtlt/hdr1gtlthdr2gtltobigtltvolnogtVol.
16lt/volnogt, ltissnogtNo. 2lt/issnogtlt/obigt
ltpdtgtltmogtFEBRUARYlt/mogtltyrgt2004lt/yrgtlt/pdtgt - ltppgtpp. 271-288lt/ppgtlt/hdr2gt lt/hdrgt
lttiggtltatlgtA Graph-Based Approach for Timing
Analysis and Refinement of OPS5 Knowledge-Based
Systemslt/atlgtltpngtpp. 271-288lt/pngtltref
rid"K02711aff" type"aff"gtlt/refgtlt/tiggt - ltau sequence"first"gtltfnmgtAlbert Mo
Kimlt/fnmgtltsnmgt ltref aid"K0271a1
type"prb"gtChenglt/refgtlt/snmgtltrolegtSenior
Memberlt/rolegtltaffgtltonmgtIEEElt/onmgtlt/affgtlt/augtltau
sequence"additional"gtltfnmgtHsiu-yenlt/fnmgtltsnmgt
Tsailt/snmgtlt/augt - ltabsgtltpgtltbgtAbstractlt/bgtmdashThis paper
examines the problem of predicting the timing
behavior of knowledge-based systems for real-
9Current Query Languages
- Current XML query languages are mostly database
languages - Examples XQuery, XPath
- Provide very rudimentary text/IR support
- fncontains(e, keywords)
- Returns true iff element e contains keywords
- No support for complex IR queries
- Distance predicates, stemming, scoring,
10Example Queries
- XQuery Full-Text Use Cases Document
- Find the titles of the books whose body contains
the phrases Usability and Web site in that
order, in the same paragraph, using stemming if
necessary to match the tokens - Find the titles of the books whose body contains
Usability and testing within a window of 3
words, and return them in score order
11Example INEX Query
- ltinex_topic topic_id"275" query_type"CAS"gt
- ltcastitlegt//articleabout(.//abs, "data
mining")//secabout(., "frequent
itemsets")lt/castitlegt - ltdescriptiongtsections about frequent
itemsets from articles with abstract about data
mininglt/descriptiongt - ltnarrativegtTo be relevant, a component
has to be a section about "frequent itemsets".
For example, it could be about algorithms for
finding frequent itemsets, or uses of frequent
itemsets to generate rules. Also, the article
must have an abstract about "data mining". I need
this information for a paper that I am writing.
It is a survey of different algorithms for
finding frequent itemsets. The paper will also
have a section on why we would want to find
frequent itemsets.lt/narrativegt - lt/inex_topicgt
12Grand Challenge Queries Kossmann 98
- Welcher ETH Professor spielt Landhockey und hat
einen effizienten Algorithmus zur Berechnung der
Pareto-Kurve entwickelt? - Wer hat meinen IDP Komplexitätsbeweis kopiert?
- Wer ist auf diesem Foto?
- In welchem Theaterstücktreibt die ehrgeizige
Ehe-frau ihren Mann zum Mord?
13Why not use SQL/MM?
- Key difference No strict demarcation between
structured and text data in XML - Can issue structured and text queries over same
data - Find books with year gt 1995
- Find books containing keyword 1998
- Can embed structured queries in text queries
- Find books that contain the keywords that occur
in the title of Richard Dawkins books - Other important differences
- XML/XQuery data model
- Composability of full-text primitives
14Challenges in XML FT Search
- Searching over Semi-Structured Data
- Users may specify a search context and return
context. - Expressive Power and Extensibility
- Users should be able to express complex full-text
searches and combine them with structural
searches. - Scores and Ranking
- Users may specify a scoring condition, possibly
over both full-text and structured predicates and
obtain top-k results based on query relevance
scores. - The language should allow for an efficient
implementation.
15XML FT Search Definition
- Context expression XML elements searched
- pre-defined XML nodes.
- XPath/XQuery queries.
- Return expression XML fragments returned
- pre-defined meaningful XML fragments.
- XPath/XQuery to build answers.
- Search expression FT search conditions
- Boolean keyword search.
- proximity distance, scoping, thesaurus, stop
words, stemming. - Score expression
- system-defined scoring function.
- user-defined scoring function.
- query-dependent keyword weights.
16Granularity of Results
- Keyword queries
- compute possibly different scores for LCAs.
- Tag Keyword queries
- compute scores based on tags and keywords.
- Path Expression Keyword queries
- compute scores based on paths and keywords.
- XQuery Complex full-text queries
- compute scores for (newly constructed) XML
fragments satisfying XQuery (structural,
full-text and scalar conditions).
17Four Classes of Languages
- Keyword search (INEX Content-Only Queries)
- book xml
- Tag Keyword search
- book xml
- Path Expression Keyword search
- /book./title about xml db
- XQuery Complex full-text search
- for b in /booklet score s b ftcontains
xml db
distance 5
18XPath W3C 2005
- Special function in XQuery for keyword search.
(First proposed FlorescuKossmann 2000 - fncontains(e, string) returns true iff e
contains string - What happens if string is generated by an
expression that returns a sequence of strings? - Does not allow specification of stemming, stop
words, scoring, etc.
//sectionfncontains(./title, XML Indexing)
19XQuery Full-Text
- Full-text search extension to XQuery
- W3C Working Draft
- Tightly integrated with the XQuery data model
- Provides well defined model for reasoning about
full-text operations and integration with XQuery - Composability
- Fully composable full-text primitives, including
Boolean connectives, distance predicates,
stemming - Can embed XQuery Full-Text primitives in XQuery
and vice versa - Flexible scoring construct
- AllMatches Data Model Tokenization!
20XQuery Full-Text Evolution
Quark Full-TextLanguage (Cornell)
2002
IBM, Microsoft,Oracle proposals
TeXQuery(Cornell, ATT)
2003
XQuery Full-Text
2004
XQuery Full-Text (Second Draft)
2005
21Syntax Overview
- Two new XQuery constructs
- FTContainsExpr
- Expresses Boolean full-text search predicates
- Seamlessly composes with other XQuery expressions
- FTScoreClause
- Extension to FLWOR expression
- Can score FTContainsExpr and other expressions
22FTContainsExpr
- Like other XQuery expressions
- Takes in a sequence of items (nodes) as input
- Produces a sequence of items (nodes) as output
- Seamlessly compose with other XQuery exprs.
- Do not confuse with fncontains function!
XQueryExpression
Evaluate to aSequence of items
23FTContainsExpr
- ContextExpr ftcontains FTSelection
- ContextExpr (any XQuery expression) is context
spec - FTSelection is search spec
- Returns true iff at least one node in ContextExpr
satisfies the FTSelection - Examples
- //book ftcontains Usability testing
distance 5 - //book./content ftcontains Usability with
stems/title - //book ftcontains /articleauthorDawkins/title
24FTSelection
- Encapsulates all full-text conditions in
FTContainsExpr - Works in a new data model called AllMatch
- Operates on positions within XML nodes (more fine
grained than XQuery data model) - Fully composable similar to composition of
relational (and XML) operators!
FTSelection
Evaluate toAllMatch
25FTSelection Examples
- Usability
- /bookauthorDawkins/title
- Usability /bookauthorDawkins/title
- (Usability /bookauthorDawkins/title)
same sentence - (Usability /bookauthorDawkins/title)
same sentence window 5 - All of these evaluate to an AllMatch!
- Allows arbitrary composition of full-text
primitives
26AllMatches Data Model
- Tokenization Extend XQuery data model to
represent each invidual word (not just string or
text). Each word is represented as a token - AllMatches Represent the results of FTSelections
- The following TokenInfo is kept for each match
- Word string (the matching word itsself)
- Pos integer (position of word within document)
- Para integer (position of paragraph)
- Sentence integer (position of sentence)
27FTContextModifier
- Can be applied on any FTSelection to specify
aspects such as stemming, thesauri, case, etc. - Fully composable with other context modifiers and
FTSelections - Examples
- Usability testing with stems
- Usability testing without stop words
- Usability testing case insensitive
28Porter Algorithm for Stemming
- Transform the word (sequence of vowels and
consonants) to a stem - Works for the English language
- Applies a set of heuristics e.g.,
- Plural sses -gt ss ies -gt i
- Tenses eed -gt ee (agreed -gt agree) ed -gt e
- Use thesauri to separate composite words
- Particularly useful in German
- Schwimmvogel -gt Schwimm, Vogel
- Stop Words Lists are available on Internet
29Full-Text Scoring
- Score value should reflect relevance of answer to
user query. - Higher scores imply a higher degree of relevance.
- Queries return document fragments.
- Granularity of returned results affects scoring.
- For queries containing conditions on structure,
- structural conditions may affect scoring.
- Existing proposals extend common scoring methods
(standard does not care!) - probabilistic or vector-based similarity.
30Vector Space Model
- Consider document as a vector of weights
- One weight per word tf idf
- Term frequency (tf)
- Inverse document frequence (idf)
- Consider query as a vector of weights
- One weight per word in query tf idf
- Compute similarity of vectors of doc and query
- Textbook cosine similarity
- Black art in each search engine
- Google PageRank, based on random walk
- Goal Maximize Precision and Recall
- Defined by humans! (AI-complete, no rigorous
approach)
31FTScoreClause
- Two alternatives
- Both extensions to FLWOR clause
- Alternative 1
- Score Boolean XQuery expressions, including
FTContainsExpr - Current working draft syntax
- Alternative 2
- Score arbitrary XQuery expressions
- Under discussion
- Exact scoring is implementation-dependent!!!
- Standard imposes competition between vendors
32Alternative 1
- FOR
- LET
- SCORE var AS Expr (Expr returns Boolean)
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b in /pubs/book
- SCORE s AS
- b ftcontains software weight 0.8
testing weight 0.2ORDER BY sRETURN ltresult
scoresgt b lt/resultgt
In any order
33Alternative 1
- FOR
- LET
- SCORE var AS Expr (Expr returns Boolean)
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b in /pubs/book
- SCORE s AS
- b/price lt 10.00ORDER BY sRETURN
ltresult scoresgt b lt/resultgt
In any order
34Alternative 1 Analysis
- Not powerful enough for some XML IR queries
- Case study XML INEX initiative
- Want to relax /pubs/book (in addition to
full-text predicates) - Boolean scoring expressions insufficient
/pubs/book. ftcontains Usability testing
35Alternative 2
In any order
- FOR v SCORE s? AT i? IN FUZZY Expr
- LET
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in
- /pub/book. ftcontains Usability
testing - ORDER BY sRETURN ltresult scoresgt b
lt/resultgt
36Alternative 2
In any order
- FOR v SCORE s? AT i? IN FUZZY Expr
- LET
- WHERE
- ORDER BY
- RETURN
- Example
- FOR b SCORE s in FUZZY
- /pub/book. ftcontains Usability
testing - ORDER BY sRETURN ltresult scoresgt b
lt/resultgt
37Research Challenges
38Challenge 1 System Architecture
Integration Layer
XQuery Engine
IR Engine
39Challenge 1 System Architecture
XQuery IR Engine
40Challenge 2 Structural Relaxation
- FOR b SCORE s in FUZZY
- /pub/book. ftcontains
Usability with stems - ORDER BY s
- RETURN ltresult scoresgt b lt/resultgt
41Adaptation of tf.idf to XML WhirlpoolMarian et
al ICDE 2005
42Challenge 3 Search Over Views
LET bookrevs FOR book IN //book
RETURN ltbookrevsgt
book
ltreviewsgt
FOR rev IN //review
WHERE rev/bookid book/id
RETURN
rev
lt/reviewsgt
lt/bookrevsgt
FOR bookrev IN bookrevs SCORE score AS
bookrev ftcontains Usability with stems ORDER
BY score RETURN bookrev
43 Challenge 4 LCA
- Given Query keywords
- Compute Least Common Ancestors (LCAs) that
contain all query keywords, in ranked order
44Naïve Method
- Naïve inverted lists
- Ricardo 1 5 6 8
- XQL 1 5 6 7
1
ltworkshopgt
date
lttitlegt
lteditorsgt
ltproceedingsgt
2
3
4
5
28 July
XML and
David Carmel
ltpapergt
ltpapergt
6
lttitlegt
ltauthorgt
7
8
Problems 1. Space Overhead 2. Spurious Results
XQL and
Ricardo
Main issue Decouples representation of ancestors
and descendants
45Dewey Encoding of IDs 1850s
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt
0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt
XQL and
Ricardo
46Other Open Issues
- Experimental evaluation of scoring functions and
ranking algorithms for XML (INEX). - Search over a mix of HTML and XML.
- Joint scoring on full-text and scalar predicates.
- Score-aware algebra for XML for the joint
optimization of queries on both structure and
text.
47Conclusion
- Unified querying of structured data and text is
one of the most promising benefits of XML - XQuery Full-Text is a language designed to enable
this goal - Many research challenges
- System implementation
- Scoring
- Requirements of a new class of applications
- Starting to see research prototypes
- Quark (Open-source software, Cornell)
- GalaTeX (Reference implementation, ATT)