XML Full-Text Search: Challenges and Opportunities - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

XML Full-Text Search: Challenges and Opportunities

Description:

2 September 2005. VLDB Tutorial on XML Full-Text Search. XML Full-Text Search: ... hdr2 obi volno Vol. 16 /volno , issno No. 2 /issno /obi pdt mo FEBRUARY ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 84
Provided by: jayav8
Category:

less

Transcript and Presenter's Notes

Title: XML Full-Text Search: Challenges and Opportunities


1
XML Full-Text Search Challenges and
Opportunities
Sihem Amer-Yahia ATT Labs Research
  • Jayavel Shanmugasundaram
  • Cornell University

2
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Query Processing
  • Open Issues

3
Motivation
  • XML is able to represent a mix of structured and
    text information.
  • XML applications digital libraries, content
    management.
  • XML repositories IEEE INEX collection,
    LexisNexis, the Library of Congress collection.

4
XML in Library of Congresshttp//thomas.loc.gov/h
ome/gpoxmlc109/h2739_ih.xml
  • ltbill bill-stage"Introduced-in-House"gt
  • ltcongressgt109th CONGRESSlt/congressgt
    ltsessiongt1st Sessionlt/sessiongt
  • ltlegis-numgtH. R. 2739lt/legis-numgt
  • ltcurrent-chambergtIN THE HOUSE OF
    REPRESENTATIVESlt/current-chambergt
  • ltactiongt
  • ltaction-date date"20050526"gtMay 26,
    2005lt/action-dategt
  • ltaction-descgtltsponsor name-id"T000266"gtMr.
    Tierneylt/sponsorgt (for himself, ltcosponsor
    name-id"M001143"gtMs. McCollum of
    Minnesotalt/cosponsorgt, ltcosponsor
    name-id"M000725"gtMr. George Miller of
    Californialt/cosponsorgt) introduced the following
    bill which was referred to the ltcommittee-name
    committee-id"HED00"gtCommittee on Education and
    the Workforcelt/committee-namegt
  • lt/action-descgt
  • lt/actiongt

5
THOMAS Library of Congress
6
INEX Data
  • ltarticlegt ltfnogtK0271lt/fnogt
    ltdoigt10.1041/K0271s-2004lt/doigt
  • ltfmgt lthdrgtlthdr1gtlttigtIEEE TRANSACTIONS ON
    KNOWLEDGE AND DATA ENGINEERINGlt/tigt ltcrtgt
  • ltissngt1041-4347lt/issngt/04/20.00 copy 2004
    IEEE Published by the IEEE Computer
    Societylt/crtgtlt/hdr1gtlthdr2gtltobigtltvolnogtVol.
    16lt/volnogt, ltissnogtNo. 2lt/issnogtlt/obigt
    ltpdtgtltmogtFEBRUARYlt/mogtltyrgt2004lt/yrgtlt/pdtgt
  • ltppgtpp. 271-288lt/ppgtlt/hdr2gt lt/hdrgt
    lttiggtltatlgtA Graph-Based Approach for Timing
    Analysis and Refinement of OPS5 Knowledge-Based
    Systemslt/atlgtltpngtpp. 271-288lt/pngtltref
    rid"K02711aff" type"aff"gtlt/refgtlt/tiggt
  • ltau sequence"first"gtltfnmgtAlbert Mo
    Kimlt/fnmgtltsnmgt ltref aid"K0271a1
    type"prb"gtChenglt/refgtlt/snmgtltrolegtSenior
    Memberlt/rolegtltaffgtltonmgtIEEElt/onmgtlt/affgtlt/augtltau
    sequence"additional"gtltfnmgtHsiu-yenlt/fnmgtltsnmgt
    Tsailt/snmgtlt/augt
  • ltabsgtltpgtltbgtAbstractlt/bgtmdashThis paper
    examines the problem of predicting the timing
    behavior of knowledge-based systems for real-

7
Example INEX Query
  • ltinex_topic topic_id"275" query_type"CAS"gt
  • ltcastitlegt//articleabout(.//abs, "data
    mining")//secabout(., "frequent
    itemsets")lt/castitlegt
  • ltdescriptiongtsections about frequent
    itemsets from articles with abstract about data
    mininglt/descriptiongt
  • ltnarrativegtTo be relevant, a component
    has to be a section about "frequent itemsets".
    For example, it could be about algorithms for
    finding frequent itemsets, or uses of frequent
    itemsets to generate rules. Also, the article
    must have an abstract about "data mining". I need
    this information for a paper that I am writing.
    It is a survey of different algorithms for
    finding frequent itemsets. The paper will also
    have a section on why we would want to find
    frequent itemsets.lt/narrativegt
  • lt/inex_topicgt

8
Challenges in XML FT Search
  • Searching over Semi-Structured Data
  • Users may specify a search context and return
    context.
  • Expressive Power and Extensibility
  • Users should be able to express complex full-text
    searches and combine them with structural
    searches.
  • Scores and Ranking
  • Users may specify a scoring condition, possibly
    over both full-text and structured predicates and
    obtain top-k results based on query relevance
    scores.
  • The language should allow for an efficient
    implementation.

9
XML FT Search Definition
  • Context expression XML elements searched
  • pre-defined XML nodes.
  • XPath/XQuery queries.
  • Return expression XML fragments returned
  • pre-defined meaningful XML fragments.
  • XPath/XQuery to build answers.
  • Search expression FT search conditions
  • Boolean keyword search.
  • proximity distance, scoping, thesaurus, stop
    words, stemming.
  • Score expression
  • system-defined scoring function.
  • user-defined scoring function.
  • query-dependent keyword weights.

10
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Query Processing
  • Open Issues

11
Four Classes of Languages
  • Keyword search (INEX Content-Only Queries)
  • book xml
  • Tag Keyword search
  • book xml
  • Path Expression Keyword search
  • /book./title about xml db
  • XQuery Complex full-text search
  • for b in /booklet score s b ftcontains
    xml db
    distance 5

12
Outline
  • Motivation
  • Full-Text Search Languages
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Scoring
  • Query Processing
  • Open Issues

13
XRank Guo et al., SIGMOD 2003
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
14
XRank Guo et al., SIGMOD 2003
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
ltsubsection nameRelated Workgt
The XQL language
lt/subsectiongt lt/sectiongt
ltcite
xmlnsxlinkhttp//www.acm.org/www8/paper/xmlqlgt
lt/citegt lt/papergt
15
XIRQL Fuhr Grobjohann, SIGIR 2001
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML ltemgt
The XQL language lt/emgt
lt/sectiongt
ltcite xmlnsxlinkhttp//www.acm.org/www8/paper/x
mlqlgt lt/citegt lt/papergt
Index Node
16
Similar Notion of Results
  • Nearest Concept Queries
  • Schmidt et al., ICDE 2002
  • XKSearch
  • Xu Papakonstantinou, SIGMOD 2005

17
Outline
  • Motivation
  • Full-Text Search Languages
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Scoring
  • Query Processing
  • Open Issues

18
XSearch Cohen et al., VLDB 2003
ltworkshop date28 July 2000gt lttitlegt XML
and Information Retrieval A SIGIR 2000 Workshop
lt/titlegt lteditorsgt David Carmel, Yoelle
Maarek, Aya Soffer lt/editorsgt ltproceedingsgt
ltpaper id1gt
lttitlegt XQL and Proximal Nodes lt/titlegt
ltauthorgt Ricardo Baeza-Yates lt/authorgt
ltauthorgt Gonzalo Navarro
lt/authorgt ltabstractgt We
consider the recently proposed language
lt/abstractgt ltsection
nameIntroductiongt
Searching on structured text is becoming more
important with XML
lt/papergt ltpaper id2gt
lttitlegt XML Indexing lt/titlegt
ltpaper id2gt
Not a meaningful result
19
Outline
  • Motivation
  • Full-Text Search Languages
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Scoring
  • Query Processing
  • Open Issues

20
XPath W3C 2005
  • fncontains(e, string) returns true iff e
    contains string

//sectionfncontains(./title, XML Indexing)
21
XIRQL Fuhr Grobjohann, SIGIR 2001
  • Weighted extension to XQL (precursor to XPath)

//section0.6 .// cw XQL
0.4 .//section cw syntax
22
XXL Theobald Weikum, EDBT 2002
  • Introduces similarity operator

Select Z From http//www.myzoos.edu/zoos.html W
here zoos..zoo As Z and
Z.animals.(animal)?.specimen as A and
A.species lion and
A.birthplace..country as B and
A.region B.content
23
NEXI Trotman Sigurbjornsson, INEX 2004
  • Narrowed Extended XPath I
  • INEX Content-and-Structure (CAS) Queries

//articleabout(.//title, apple) and
about(.//sec, computer)
24
Outline
  • Motivation
  • Full-Text Search Languages
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Scoring
  • Query Processing
  • Open Issues

25
Schema-Free XQuery Li, Yu, Jagadish, VLDB 2003
  • Meaningful least common ancestor (mlcas)

for a in doc(bib.xml)//author b in
doc(bib.xml)//title c in
doc(bib.xml)//year where a/text() Mary
and exists mlcas(a,b,c) return
ltresultgt b,c lt/resultgt
26
XQuery Full-Text W3C 2005
  • Two new XQuery constructs
  • FTContainsExpr
  • Expresses Boolean full-text search predicates
  • Seamlessly composes with other XQuery expressions
  • FTScoreClause
  • Extension to FLWOR expression
  • Can score FTContainsExpr and other expressions

27
FTContainsExpr
  • //book ftcontains Usability testing
    distance 5
  • //book./content ftcontains Usability with
    stems/title
  • //book ftcontains /articleauthorDawkins/title

28
FTScore Clause
In any order
  • FOR v SCORE s? IN FUZZY Expr
  • LET
  • WHERE
  • ORDER BY
  • RETURN
  • Example
  • FOR b SCORE s in
  • /pub/book. ftcontains Usability
    testing
  • ORDER BY sRETURN ltresult scoresgt b
    lt/resultgt

29
FTScore Clause
In any order
  • FOR v SCORE s? IN FUZZY Expr
  • LET
  • WHERE
  • ORDER BY
  • RETURN
  • Example
  • FOR b SCORE s in
  • /pub/book. ftcontains Usability
    testing

  • and ./price lt 10.00
  • ORDER BY sRETURN b

30
FTScore Clause
In any order
  • FOR v SCORE s? IN FUZZY Expr
  • LET
  • WHERE
  • ORDER BY
  • RETURN
  • Example
  • FOR b SCORE s in FUZZY
  • /pub/book. ftcontains Usability
    testing
  • ORDER BY sRETURN b

31
XQuery Full-Text Evolution
Quark Full-TextLanguage (Cornell)
2002
IBM, Microsoft,Oracle proposals
TeXQuery(Cornell, ATT Labs)
2003
XQuery Full-Text
2004
XQuery Full-Text (Second Draft)
2005
32
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Query Processing
  • Open Issues

33
Full-Text Scoring
  • Score value should reflect relevance of answer to
    user query. Higher scores imply a higher degree
    of relevance.
  • Queries return document fragments. Granularity of
    returned results affects scoring.
  • For queries containing conditions on structure,
    structural conditions may affect scoring.
  • Existing proposals extend common scoring methods
    probabilistic or vector-based similarity.

34
Granularity of Results
  • Keyword queries
  • compute possibly different scores for LCAs.
  • Tag Keyword queries
  • compute scores based on tags and keywords.
  • Path Expression Keyword queries
  • compute scores based on paths and keywords.
  • XQuery Complex full-text queries
  • compute scores for (newly constructed) XML
    fragments satisfying XQuery (structural,
    full-text and scalar conditions).

35
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Query Processing
  • Open Issues

36
Granularity of Results
  • Document as hierarchical structure of elements as
    opposed to flat document.
  • XXL Theobald Weikum, EDBT 2002
  • XIRQL Fuhr Grobjohann, SIGIR 2001
  • XRANK Guo et al., SIGMOD 2003
  • Propagate keyword weights along document
    structure.

37
XML Data Model
Containment edge
Hyperlink edge
38
XXLTheobald Weikum, EDBT 2002
  • Compute similar terms with relevance score r1
    using an ontology.
  • Compute tfidf of each term for a given element
    content with relevance score r2.
  • Relevance of an element content for a term is
    r1r2.
  • r1 and r2 are computed as a weighted distance in
    an ontology graph.
  • Probabilities of conjunctions multiplied
    (independence assumption) along elements of same
    path to compute path score.

39
Probabilistic ScoringXIRQL Fuhr Grobjohann,
SIGIR 2001
  • Extension of XPath.
  • Weighting and ranking
  • weighting of query terms
  • P(wsum((0.6,a), (0.4,b)) 0.6 P(a)0.4 P(b)
  • probabilistic interpretation of Boolean
    connectors
  • P(a b) P(a) P(b)

40
XIRQL Example
  • Query
  • Search for an artist named Ulbrich, living in
    Frankfurt, Germany about 100 years ago
  • Data
  • Ernst Olbrich, Darmstadt, 1899
  • Weights and ranking
  • P(Olbrich p Ulbrich)0.8 (phonetic similarity)
  • P(1899 n 1903)0.9 (numeric similarity)
  • P(Darmstadt g Frankfurt)0.7 (geographic distance)

41
PageRank Brin Page 1998
Hyperlink edge
w
1-d Probability of random jump
42
ElemRank Guo et al. SIGMOD 2003
Hyperlink edge
Containment edge
w
1-d1-d2-d3 Probability of random jump
43
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Query Processing
  • Open Issues

44
XSearchCohen et al., VLDB 2003
  • tfilf to compute weight of keyword for a leaf
    element.
  • A vector is associated with each non-leaf
    element.
  • sim(Q,N) sum of the cosine distances between the
    vectors associated with nodes in N and vectors
    associated with terms matched in Q.

45
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Query Processing
  • Open Issues

46
Vectorbased ScoringJuruXML Mass et al INEX
2002
  • Transform query into (term,path) conditions
  • article/bm/bib/bibl/bbabout(., hypercube
    mesh torus nonnumerical database)
  • (term,path)-pairs
  • hypercube, article/bm/bib/bibl/bb
  • mesh, article/bm/bib/bibl/bb
  • torus, article/bm/bib/bibl/bb
  • nonnumerical, article/bm/bib/bibl/bb
  • database, article/bm/bib/bibl/bb
  • Modified cosine similarity as retrieval function
    for vague matching of path conditions.

47
JuruXML Vague Path Matching
  • Modified vector-based cosine similarity

Example of length normalization cr
(article/bibl, article/bm/bib/bibl/bb) 3/6 0.5
48
Query Relaxation on Structure
  • Schlieder, EDBT 2002
  • Delobel Rousset, 2002
  • Amer-Yahia et al, VLDB 2005

49
XML Query RelaxationAmer-Yahia et al EDBT
2002FlexPath Amer-Yahia et al SIGMOD 2004
Query
book
  • Tree pattern relaxations
  • Leaf node deletion
  • Edge generalization
  • Subtree promotion

edition paperback
info
author Dickens
book
book
book
Data
edition?
info
author Dickens
info
edition (paperback)
info
edition paperback
author Charles Dickens
author C. Dickens
50
Adaptation of tf.idf to XML WhirlpoolMarian et
al ICDE 2005
Document Collection (Information Retrieval) XML Document
Document XML Node (result is a subtree rooted at a returned node with a given tag and satisfying structural predicates in the query)
Keyword(s) Tree Pattern
idf (inverse document frequency) is a function of the fraction of documents that contain the keyword(s) idf is a function of the fraction of returned nodes that match the query tree pattern
tf (term frequency) is a function of the number of occurrences of the keyword in the document tf is a function of the number of ways the query tree pattern matches the returned node
51
A Family of XML Scoring Methods Amer-Yahia et al
VLDB 2005
book
Query
  • Twig scoring
  • High quality
  • Expensive computation
  • Path scoring
  • Binary scoring
  • Low quality
  • Fast computation

edition (paperback)
info
author (Dickens)
52
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Query Processing
  • Open Issues

53
XIRQL Relaxation
  • XIRQL proposes vague predicates but it is not
    clear how to combine it with all of XQuery.
  • Open issue as how to relax all of XQuery
    including structured and scalar predicates.

54
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Query Processing
  • Open Issues

55
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Query Processing
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Open Issues

56
Main Issue
  • Given Query keywords
  • Compute Least Common Ancestors (LCAs) that
    contain query keywords, in ranked order

57
Naïve Method
  • Naïve inverted lists
  • Ricardo 1 5 6 8
  • XQL 1 5 6 7

1
ltworkshopgt
date
lttitlegt
lteditorsgt
ltproceedingsgt
2
3
4
5
28 July
XML and
David Carmel
ltpapergt
ltpapergt
6

lttitlegt
ltauthorgt
7
8
Problems 1. Space Overhead 2. Spurious Results


XQL and
Ricardo
Main issue Decouples representation of ancestors
and descendants
58
Dewey Encoding of IDs 1850s
ltworkshopgt
0
0.0
date
0.1
lttitlegt
0.2
lteditorsgt
0.3
ltproceedingsgt
28 July
XML and
David Carmel
0.3.0
ltpapergt
0.3.1
ltpapergt

0.3.0.0
lttitlegt
0.3.0.1
ltauthorgt


XQL and
Ricardo
59
XRank Dewey Inverted List (DIL)
Position List
Dewey Id
Score
5.0.3.0.0
85
32
XQL
Sorted by Dewey Id
91
8.0.3.8.3
38
89



5.0.3.0.1
82
38
Ricardo
Sorted by Dewey Id
8.2.1.4.2
99
52



Store IDs of elements that directly contain
keyword - Avoids space overhead
60
DIL Query Processing
  • Merge query keyword inverted lists in Dewey ID
    Order
  • Entries with common prefixes are processed
    together
  • Compute Longest Common Prefix of Dewey IDs during
    the merge
  • Longest common prefix ensures most specific
    results
  • Also suppresses spurious results
  • Keep top-m results seen so far in output heap
  • Calculate rank using two-dimensional proximity
    metric
  • Output contents of output heap after scanning
    inverted lists
  • Algorithm works in a single scan over inverted
    lists

61
XRank Ranked Dewey Inverted List (RDIL)
B-tree On Dewey Id
Inverted List
XQL
Sorted by Score
(other keywords)
62
RDIL Algorithm
  • An element may be ranked highly in one list and
    low in another list
  • B-tree helps search for low ranked element
  • When to stop scanning inverted lists?
  • Based on Threshold Algorithm Fagin et al.,
    2002, which periodically calculates a threshold
  • Can stop if we have sufficient results above the
    threshold
  • Extension to most specific results

63
RDIL Query Processing
Output Heap
Temp Heap
P
P
B-tree on Dewey Id
Ricardo
Inverted List
P 9.0.4.2.0
Rank(9.0.4)
threshold Score(P)Score(R)
threshold Score(P)Max-Score
XQL
9.0.4.1.2
R
8.2.1.4.2
9.0.4.1.2
9.0.5.6
10.8.3
9.0.5.6
9.0.4.1.2
B-tree on Dewey Id
9.0.4.2.0
64
ID Order vs. Rank Order
  • Approaches that combine benefits
  • Long ID inverted list, short score inverted list
  • HDIL (Guo et al., SIGMOD 2003)
  • Chunk inverted list based on score, organize by
    ID within chunk
  • FlexPath (Amer-Yahia et al., SIGMOD 2004)
  • SVR (Guo et al., ICDE 2005)

65
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Query Processing
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Open Issues

66
XSearch Technique
  • Given An interconnection relationship R between
    nodes (semantic relationship)
  • R is reflexive and symmetric
  • Node interconnection index
  • Given two nodes n and n in a document d, find if
    (n,n) are in R
  • Use dynamic programming to compute closure
  • Online vs. offline

67
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Query Processing
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Open Issues

68
XXL Indexing
  • Element Path Index (EPI)
  • Evaluates simple path expressions
  • Element Content Index (ECI)
  • Traditional inverted list (but replicates nested
    elements)
  • Ontology Index (OI)
  • Lookup similar concepts (for evaluating e)
  • Returned in ranked order

69
Myaeng et al. SIGIR 1994
Document ID
Element Tag
Element ID
Element Tag
Element Tag
Probability
Probability
Probability
5
85
act
XQL
0.3
play
0.2
plays
0.1


70
Integrating Structure and ILKaushik et al.,
SIGMOD 2004
Document ID
1
book
Start ID
Index ID
End ID
Depth
Score
2
edition
info
3
5
85
99
XQL
3
5
0.9
author
title

4
5

B Tree
71
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Query Processing
  • Simple Keyword Search
  • Tags Keyword Search
  • Path Expressions Keyword Search
  • XQuery Complex Full-Text Search
  • Open Issues

72
Scoring Functions Critical for Top-k Query
Processing
  • Top-k answer quality depends on scoring function.
  • Efficient top-k query processing requires scoring
    function to be
  • Monotone.
  • Fast to compute.

73
Structural Join Relaxation
//book./info./author ftcontains Dickens
./edition ftcontains
paperback
contains(edition,paperback)
paperback
contains(author,Dickens)
Dickens
pc(info,edition) or ad(book,edition)
edition
pc(info,author)
author
pc(book,info) or ad(book,info)
book
info
info
book
74
Quark/GalaTex Architecture
4
ltxmlgt ltdocgt Text Text Text Text lt/docgt lt/xml
Preprocessing Inverted Lists Generation
Full-Text Primitives (FTWord, FTWindow,
FTTimes etc.)
API on positions
.xml
ltdocgt Text Text Text Text lt/docgt
Quark/Galax XQuery Engine
evaluation
.xml
Full-Text Query
Equivalent XQuery Query
XQFT Parser
75
Outline
  • Motivation
  • Full-Text Search Languages
  • Scoring
  • Query Processing
  • Open Issues

76
System Architecture
Integration Layer
XQuery Engine
IR Engine
77
System Architecture
XQuery IR Engine
Quark/GalaTex use this architecture
78
Structural Relaxation
  • FOR b SCORE s in FUZZY
  • /pub/book.
    ftcontains Usability with stems
  • ORDER BY s
  • RETURN b

79
Search Over Views
Data Source 1
Data Source 2
ltbooksgt ltbookgt lt/bookgt ltbookgt
lt/bookgt lt/booksgt
ltreviewsgt ltreviewgt lt/reviewgt ltreviewgt
lt/reviewgt lt/reviewsgt
ltbookgt ltreviewsgt lt/reviewsgtlt/bookgt
Integrated View
80
Other Open Issues
  • Extensive experimental evaluation of scoring
    functions and ranking algorithms for XML (INEX).
  • Joint scoring on full-text and scalar predicates.
  • Score-aware algebra for XML for the joint
    optimization of queries on both structure and
    text.

81
Backup Slides
82
Why not use SQL/MM (or variant)?
  • Key difference No strict demarcation between
    structured and text data in XML
  • Can issue structured and text queries over same
    data
  • Find books with year gt 1995
  • Find books containing keyword 1998
  • Can embed structured queries in text queries
  • Find books that contain the keywords that occur
    in the title of Richard Dawkins books
  • Other important differences
  • XML/XQuery data model
  • Composability of full-text primitives

83
Scoring Function (monotonicity)
book
book
  • Required properties
  • Exact matches should be scored higher than
    relaxed matches (idf)
  • Returned elements with several matches should be
    ranked higher than those with fewer matches (tf)
  • How to combine tf and idf?
  • tf.idf, as used by IR, violates above properties
  • Ranking based on idf, then breaking ties using tf
    satisfies the properties

info
edition (paperback)
info
edition (paperback)
author (Dickens)
title (Great Expectations)
(a)
(b)
score(a) gt score(b)
score(a) lt score(b)
Write a Comment
User Comments (0)
About PowerShow.com