Module 7 - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Module 7

Description:

Differences between Switzerland and Bangladesh are disappearing. From traditional drivers 'talent' ... frau ihren Mann zum. Mord? 13. Why not use SQL/MM? ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 48
Provided by: donaldk
Category:
Tags: frau | module

less

Transcript and Presenter's Notes

Title: Module 7


1
Module 7
26MKT-ECHER-67FEX-44B6P
  • XML and Information Retrieval
  • (XQuery FullText, Research)

2
The World is Flat (Friedman)
  • Differences between Switzerland and Bangladesh
    are disappearing
  • From traditional drivers talent, origin, and
    luck, talent will dominate
  • Individuals (not countries) compete
  • Chocolate sauce today, vanilla tomorrow
  • No premium for strength
  • No premium for memory
  • Premium for collaboration, communication
  • Premium for understanding, adaptivity
  • Premium for creativity
  • Good for Computer Scientists (lazy greedy)

3
The World is Flat (XML)
  • Remove barriers between machines
  • Machines talk to machines
  • communication, processing, storage
  • One data model
  • design and implementation
  • Declarative programming, pay as you go along
  • data and meta-data, structure/unstructured
  • One data model
  • here there, today tomorrow
  • Decouple data from schema / interpretation

4
References
  • XQuery 1.0 and XPath 2.0 Full-Text
  • http//www.w3.org/XML/Query
  • Latest version November 2005
  • Still work in progress!
  • S. Amer-Yahia, J. ShanmugasundaramXML Full-Text
    Search Challenges and Opportunities
  • Tutorial for VLDB Conf., August 2005

5
Motivation
  • A key benefit of XML is its ability to represent
    a mix of structured and unstructured (text) data
  • Applications
  • Digital libraries
  • Content management
  • Many such XML repositories already available
  • IEEE INEX collection
  • Library of Congress documents
  • Shakespeares plays
  • SIGMOD, DBLP,

6
XML in Library of Congresshttp//thomas.loc.gov/h
ome/gpoxmlc109/h2739_ih.xml
  • ltbill bill-stage"Introduced-in-House"gt
  • ltcongressgt109th CONGRESSlt/congressgt
    ltsessiongt1st Sessionlt/sessiongt
  • ltlegis-numgtH. R. 2739lt/legis-numgt
  • ltcurrent-chambergtIN THE HOUSE OF
    REPRESENTATIVESlt/current-chambergt
  • ltactiongt
  • ltaction-date date"20050526"gtMay 26,
    2005lt/action-dategt
  • ltaction-descgtltsponsor name-id"T000266"gtMr.
    Tierneylt/sponsorgt (for himself, ltcosponsor
    name-id"M001143"gtMs. McCollum of
    Minnesotalt/cosponsorgt, ltcosponsor
    name-id"M000725"gtMr. George Miller of
    Californialt/cosponsorgt) introduced the following
    bill which was referred to the ltcommittee-name
    committee-id"HED00"gtCommittee on Education and
    the Workforcelt/committee-namegt
  • lt/action-descgt
  • lt/actiongt

7
THOMAS Library of Congress
8
INEX Data
  • ltarticlegt ltfnogtK0271lt/fnogt
    ltdoigt10.1041/K0271s-2004lt/doigt
  • ltfmgt lthdrgtlthdr1gtlttigtIEEE TRANSACTIONS ON
    KNOWLEDGE AND DATA ENGINEERINGlt/tigt ltcrtgt
  • ltissngt1041-4347lt/issngt/04/20.00 copy 2004
    IEEE Published by the IEEE Computer
    Societylt/crtgtlt/hdr1gtlthdr2gtltobigtltvolnogtVol.
    16lt/volnogt, ltissnogtNo. 2lt/issnogtlt/obigt
    ltpdtgtltmogtFEBRUARYlt/mogtltyrgt2004lt/yrgtlt/pdtgt
  • ltppgtpp. 271-288lt/ppgtlt/hdr2gt lt/hdrgt
    lttiggtltatlgtA Graph-Based Approach for Timing
    Analysis and Refinement of OPS5 Knowledge-Based
    Systemslt/atlgtltpngtpp. 271-288lt/pngtltref
    rid"K02711aff" type"aff"gtlt/refgtlt/tiggt
  • ltau sequence"first"gtltfnmgtAlbert Mo
    Kimlt/fnmgtltsnmgt ltref aid"K0271a1
    type"prb"gtChenglt/refgtlt/snmgtltrolegtSenior
    Memberlt/rolegtltaffgtltonmgtIEEElt/onmgtlt/affgtlt/augtltau
    sequence"additional"gtltfnmgtHsiu-yenlt/fnmgtltsnmgt
    Tsailt/snmgtlt/augt
  • ltabsgtltpgtltbgtAbstractlt/bgtmdashThis paper
    examines the problem of predicting the timing
    behavior of knowledge-based systems for real-

9
Current Query Languages
  • Current XML query languages are mostly database
    languages
  • Examples XQuery, XPath
  • Provide very rudimentary text/IR support
  • fncontains(e, keywords)
  • Returns true iff element e contains keywords
  • No support for complex IR queries
  • Distance predicates, stemming, scoring,

10
Example Queries
  • XQuery Full-Text Use Cases Document
  • Find the titles of the books whose body contains
    the phrases Usability and Web site in that
    order, in the same paragraph, using stemming if
    necessary to match the tokens
  • Find the titles of the books whose body contains
    Usability and testing within a window of 3
    words, and return them in score order

11
Example INEX Query
  • ltinex_topic topic_id"275" query_type"CAS"gt
  • ltcastitlegt//articleabout(.//abs, "data
    mining")//secabout(., "frequent
    itemsets")lt/castitlegt
  • ltdescriptiongtsections about frequent
    itemsets from articles with abstract about data
    mininglt/descriptiongt
  • ltnarrativegtTo be relevant, a component
    has to be a section about "frequent itemsets".
    For example, it could be about algorithms for
    finding frequent itemsets, or uses of frequent
    itemsets to generate rules. Also, the article
    must have an abstract about "data mining". I need
    this information for a paper that I am writing.
    It is a survey of different algorithms for
    finding frequent itemsets. The paper will also
    have a section on why we would want to find
    frequent itemsets.lt/narrativegt
  • lt/inex_topicgt

12
Grand Challenge Queries Kossmann 98
  • Welcher ETH Professor spielt Landhockey und hat
    einen effizienten Algorithmus zur Berechnung der
    Pareto-Kurve entwickelt?
  • Wer hat meinen IDP Komplexitätsbeweis kopiert?
  • Wer ist auf diesem Foto?
  • In welchem Theaterstücktreibt die ehrgeizige
    Ehe-frau ihren Mann zum Mord?

13
Why not use SQL/MM?
  • Key difference No strict demarcation between
    structured and text data in XML
  • Can issue structured and text queries over same
    data
  • Find books with year gt 1995
  • Find books containing keyword 1998
  • Can embed structured queries in text queries
  • Find books that contain the keywords that occur
    in the title of Richard Dawkins books
  • Other important differences
  • XML/XQuery data model
  • Composability of full-text primitives

14
Challenges in XML FT Search
  • Searching over Semi-Structured Data
  • Users may specify a search context and return
    context.
  • Expressive Power and Extensibility
  • Users should be able to express complex full-text
    searches and combine them with structural
    searches.
  • Scores and Ranking
  • Users may specify a scoring condition, possibly
    over both full-text and structured predicates and
    obtain top-k results based on query relevance
    scores.
  • The language should allow for an efficient
    implementation.

15
XML FT Search Definition
  • Context expression XML elements searched
  • pre-defined XML nodes.
  • XPath/XQuery queries.
  • Return expression XML fragments returned
  • pre-defined meaningful XML fragments.
  • XPath/XQuery to build answers.
  • Search expression FT search conditions
  • Boolean keyword search.
  • proximity distance, scoping, thesaurus, stop
    words, stemming.
  • Score expression
  • system-defined scoring function.
  • user-defined scoring function.
  • query-dependent keyword weights.

16
Granularity of Results
  • Keyword queries
  • compute possibly different scores for LCAs.
  • Tag Keyword queries
  • compute scores based on tags and keywords.
  • Path Expression Keyword queries
  • compute scores based on paths and keywords.
  • XQuery Complex full-text queries
  • compute scores for (newly constructed) XML
    fragments satisfying XQuery (structural,
    full-text and scalar conditions).

17
Four Classes of Languages
  • Keyword search (INEX Content-Only Queries)
  • book xml
  • Tag Keyword search
  • book xml
  • Path Expression Keyword search
  • /book./title about xml db
  • XQuery Complex full-text search
  • for b in /booklet score s b ftcontains
    xml db
    distance 5

18
XPath W3C 2005
  • Special function in XQuery for keyword search.
    (First proposed FlorescuKossmann 2000
  • fncontains(e, string) returns true iff e
    contains string
  • What happens if string is generated by an
    expression that returns a sequence of strings?
  • Does not allow specification of stemming, stop
    words, scoring, etc.

//sectionfncontains(./title, XML Indexing)
19
XQuery Full-Text
  • Full-text search extension to XQuery
  • W3C Working Draft
  • Tightly integrated with the XQuery data model
  • Provides well defined model for reasoning about
    full-text operations and integration with XQuery
  • Composability
  • Fully composable full-text primitives, including
    Boolean connectives, distance predicates,
    stemming
  • Can embed XQuery Full-Text primitives in XQuery
    and vice versa
  • Flexible scoring construct
  • AllMatches Data Model Tokenization!

20
XQuery Full-Text Evolution
Quark Full-TextLanguage (Cornell)
2002
IBM, Microsoft,Oracle proposals
TeXQuery(Cornell, ATT)
2003
XQuery Full-Text
2004
XQuery Full-Text (Second Draft)
2005
21
Syntax Overview
  • Two new XQuery constructs
  • FTContainsExpr
  • Expresses Boolean full-text search predicates
  • Seamlessly composes with other XQuery expressions
  • FTScoreClause
  • Extension to FLWOR expression
  • Can score FTContainsExpr and other expressions

22
FTContainsExpr
  • Like other XQuery expressions
  • Takes in a sequence of items (nodes) as input
  • Produces a sequence of items (nodes) as output
  • Seamlessly compose with other XQuery exprs.
  • Do not confuse with fncontains function!

XQueryExpression
Evaluate to aSequence of items
23
FTContainsExpr
  • ContextExpr ftcontains FTSelection
  • ContextExpr (any XQuery expression) is context
    spec
  • FTSelection is search spec
  • Returns true iff at least one node in ContextExpr
    satisfies the FTSelection
  • Examples
  • //book ftcontains Usability testing
    distance 5
  • //book./content ftcontains Usability with
    stems/title
  • //book ftcontains /articleauthorDawkins/title

24
FTSelection
  • Encapsulates all full-text conditions in
    FTContainsExpr
  • Works in a new data model called AllMatch
  • Operates on positions within XML nodes (more fine
    grained than XQuery data model)
  • Fully composable similar to composition of
    relational (and XML) operators!

FTSelection
Evaluate toAllMatch
25
FTSelection Examples
  • Usability
  • /bookauthorDawkins/title
  • Usability /bookauthorDawkins/title
  • (Usability /bookauthorDawkins/title)
    same sentence
  • (Usability /bookauthorDawkins/title)
    same sentence window 5
  • All of these evaluate to an AllMatch!
  • Allows arbitrary composition of full-text
    primitives

26
AllMatches Data Model
  • Tokenization Extend XQuery data model to
    represent each invidual word (not just string or
    text). Each word is represented as a token
  • AllMatches Represent the results of FTSelections
  • The following TokenInfo is kept for each match
  • Word string (the matching word itsself)
  • Pos integer (position of word within document)
  • Para integer (position of paragraph)
  • Sentence integer (position of sentence)

27
FTContextModifier
  • Can be applied on any FTSelection to specify
    aspects such as stemming, thesauri, case, etc.
  • Fully composable with other context modifiers and
    FTSelections
  • Examples
  • Usability testing with stems
  • Usability testing without stop words
  • Usability testing case insensitive

28
Porter Algorithm for Stemming
  • Transform the word (sequence of vowels and
    consonants) to a stem
  • Works for the English language
  • Applies a set of heuristics e.g.,
  • Plural sses -gt ss ies -gt i
  • Tenses eed -gt ee (agreed -gt agree) ed -gt e
  • Use thesauri to separate composite words
  • Particularly useful in German
  • Schwimmvogel -gt Schwimm, Vogel
  • Stop Words Lists are available on Internet

29
Full-Text Scoring
  • Score value should reflect relevance of answer to
    user query.
  • Higher scores imply a higher degree of relevance.
  • Queries return document fragments.
  • Granularity of returned results affects scoring.
  • For queries containing conditions on structure,
  • structural conditions may affect scoring.
  • Existing proposals extend common scoring methods
    (standard does not care!)
  • probabilistic or vector-based similarity.

30
Vector Space Model
  • Consider document as a vector of weights
  • One weight per word tf idf
  • Term frequency (tf)
  • Inverse document frequence (idf)
  • Consider query as a vector of weights
  • One weight per word in query tf idf
  • Compute similarity of vectors of doc and query
  • Textbook cosine similarity
  • Black art in each search engine
  • Google PageRank, based on random walk
  • Goal Maximize Precision and Recall
  • Defined by humans! (AI-complete, no rigorous
    approach)

31
FTScoreClause
  • Two alternatives
  • Both extensions to FLWOR clause
  • Alternative 1
  • Score Boolean XQuery expressions, including
    FTContainsExpr
  • Current working draft syntax
  • Alternative 2
  • Score arbitrary XQuery expressions
  • Under discussion
  • Exact scoring is implementation-dependent!!!
  • Standard imposes competition between vendors

32
Alternative 1
  • FOR
  • LET
  • SCORE var AS Expr (Expr returns Boolean)
  • WHERE
  • ORDER BY
  • RETURN
  • Example
  • FOR b in /pubs/book
  • SCORE s AS
  • b ftcontains software weight 0.8
    testing weight 0.2ORDER BY sRETURN ltresult
    scoresgt b lt/resultgt

In any order
33
Alternative 1
  • FOR
  • LET
  • SCORE var AS Expr (Expr returns Boolean)
  • WHERE
  • ORDER BY
  • RETURN
  • Example
  • FOR b in /pubs/book
  • SCORE s AS
  • b/price lt 10.00ORDER BY sRETURN
    ltresult scoresgt b lt/resultgt

In any order
34
Alternative 1 Analysis
  • Not powerful enough for some XML IR queries
  • Case study XML INEX initiative
  • Want to relax /pubs/book (in addition to
    full-text predicates)
  • Boolean scoring expressions insufficient

/pubs/book. ftcontains Usability testing
35
Alternative 2
In any order
  • FOR v SCORE s? AT i? IN FUZZY Expr
  • LET
  • WHERE
  • ORDER BY
  • RETURN
  • Example
  • FOR b SCORE s in
  • /pub/book. ftcontains Usability
    testing
  • ORDER BY sRETURN ltresult scoresgt b
    lt/resultgt

36
Alternative 2
In any order
  • FOR v SCORE s? AT i? IN FUZZY Expr
  • LET
  • WHERE
  • ORDER BY
  • RETURN
  • Example
  • FOR b SCORE s in FUZZY
  • /pub/book. ftcontains Usability
    testing
  • ORDER BY sRETURN ltresult scoresgt b
    lt/resultgt

37
Research Challenges
38
Challenge 1 System Architecture
Integration Layer
XQuery Engine
IR Engine
39
Challenge 1 System Architecture
XQuery IR Engine
40
Challenge 2 Structural Relaxation
  • FOR b SCORE s in FUZZY
  • /pub/book. ftcontains
    Usability with stems
  • ORDER BY s
  • RETURN ltresult scoresgt b lt/resultgt

41
Adaptation of tf.idf to XML WhirlpoolMarian et
al ICDE 2005
42
Challenge 3 Search Over Views
LET bookrevs FOR book IN //book
RETURN ltbookrevsgt
book

ltreviewsgt
FOR rev IN //review

WHERE rev/bookid book/id
RETURN
rev
lt/reviewsgt
lt/bookrevsgt
FOR bookrev IN bookrevs SCORE score AS
bookrev ftcontains Usability with stems ORDER
BY score RETURN bookrev
43
Challenge 4 LCA
  • Given Query keywords
  • Compute Least Common Ancestors (LCAs) that
    contain all query keywords, in ranked order

44
Naïve Method
  • Naïve inverted lists
  • Ricardo 1 5 6 8
  • XQL 1 5 6 7

1
ltworkshopgt
date
lttitlegt
lteditorsgt
ltproceedingsgt
2
3
4
5
28 July
XML and
David Carmel
ltpapergt
ltpapergt
6

lttitlegt
ltauthorgt
7
8
Problems 1. Space Overhead 2. Spurious Results


XQL and
Ricardo
Main issue Decouples representation of ancestors
and descendants
45
Dewey Encoding of IDs 1850s
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt

0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt


XQL and
Ricardo
46
Other Open Issues
  • Experimental evaluation of scoring functions and
    ranking algorithms for XML (INEX).
  • Search over a mix of HTML and XML.
  • Joint scoring on full-text and scalar predicates.
  • Score-aware algebra for XML for the joint
    optimization of queries on both structure and
    text.

47
Conclusion
  • Unified querying of structured data and text is
    one of the most promising benefits of XML
  • XQuery Full-Text is a language designed to enable
    this goal
  • Many research challenges
  • System implementation
  • Scoring
  • Requirements of a new class of applications
  • Starting to see research prototypes
  • Quark (Open-source software, Cornell)
  • GalaTeX (Reference implementation, ATT)
Write a Comment
User Comments (0)
About PowerShow.com