On Efficient Partmatch Querying of XML Data - PowerPoint PPT Presentation

About This Presentation
Title:

On Efficient Partmatch Querying of XML Data

Description:

The labelled path books,book,title with idLP 1 belongs. ... Path-Labelled Path-Term (PLT) index is added. ... Query: //uid='89071748' , 5 labelled paths were matched. ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 30
Provided by: kra146
Category:

less

Transcript and Presenter's Notes

Title: On Efficient Partmatch Querying of XML Data


1
On Efficient Part-match Querying of XML Data
  • Michal Krátký, michal.kratky_at_vsb.cz
  • Marek Andrt, marek.andrt_at_vsb.cz
  • Department of Computer Science
  • VBTechnical University of Ostrava
  • Czech Republic

DATESO 2004
2
Contents
  • Introduction XML, query languages, indexing XML
    data, part-match querying.
  • Multi-dimensional approach to indexing XML data.
  • Extension of the multi-dimensional approach for
    keyword-based querying.
  • Index data structures.
  • Preliminary experimental results.

2/21
3
Introduction
  • Native XML database. Set of documents is a
    database, DTD (XML Schema) is its database
    schema.
  • XML query languages (XPath, XQL, XQuery,).
  • A common feature is a possibility to formulate
    paths in the XML graph (regular path expressions,
    XPath axes and so on).
  • Approaches based on relational decomposition,
    trie, multi-dimensional, signatures and so on.

3/21
4
Part-match querying XML data
  • Some approaches for keyword or phrase based
    searching were published XQuery-IR (WebDb02),
    XKeyword (ICDE03) and so on.
  • Knowledges from IR are applied.
  • Query languages contain operators for matching
    term occurrence. For example contains(), .

4/21
5
Multi-dimensional approach to indexing XML data
  • A graph is a set of the paths. XML document is
    decomposed to paths and labelled paths.
  • labelled path lp ? XLP
  • s0,s1,...,slPN
  • path p ? XP
  • idU(u0),idU(u1),...,idU(ulLP),s
  • idU(ui) unique number
  • of a node ui

5/21
6
Indexes
  • Term index a storage of strings si of an XML
    document and their idT(si).
  • Labelled path index a storage of points
    representing labelled paths.
  • Path index a storage of points representing
    paths.

6/21
7
Examplelabelled path index, path index
  • books,book,id books,book,title and
    books,book,author. Points (0,1,2) (0,1,4) and
    (0,1,6) are created using idT of element and
    attribute names, idLP 0, 1 and 2.
  • For example, the path to value The Two Towers.
    The labelled path books,book,title with idLP 1
    belongs. Vector (1,0,1,3,5) is created using
    idLP, unique numbers idU of elements, and idT of
    the term.

7/21
8
Query for values of elements and attributes
  • XPath query books/bookauthorJoseph Heller
  • 3 phases of a query processing, finding
  • ? idT of terms from the term index,
  • ? idLP 2 of labelled path books,book,author
    from the labelled path index point query
    (0,1,6),
  • ? points from the path index range query
  • (2,0,0,0,12)(2,max,max,max,12).

8/21
9
Enhanced querying
  • XPath axes are processed by a range query or
    sequence of range queries. For example axis
    descendent (0,idU(u0),,idU(ul-1), idU(u),0,,
    0)(maxD,idU(u0),,idU(ul-1), idU(u),
    maxD,,maxD).
  • Regular path expression. For example
    //titlenameChaudhri is processed by a
    complex range query. The query is possible to
    process in one run in the multi-dimensional data
    structure.

9/21
10
Comparison of approaches
  • Mainline approaches (XISS, XPath Accelerator)
    index single element (attribute). For example
    query /e1e2dog is processed by joining
    single results.
  • Result formatting. For example a result of the
    query //name is all matched subtree.
  • Operation Update and Insert are simple possible.

10/21
11
Keyword-based searching
  • Motivation
  • /PLAYPERSONAE/PERSONAOTHELLO/TITLE
  • Path-Labelled Path-Term (PLT) index is added.
  • The index indexes an 3-dimensional space (idP,
    idLP, idT).
  • idP is added into the point representing path
    (idP,idLP,idU0,idU1,,idUl,s).

11/21
12
Path-Labelled Path-Term index Example
12/21
13
Query processing plan Example
13/21
14
Index data structures
  • Paged and balanced multi-dimensional data
    structures (B)UB-trees, variants of R-trees.
  • Problems
  • ? indexing points with different dimensions.
  • ? narrow range query the signature is
    applied for efficient processing Signature
    R-tree.
  • Efficient processing of the complex range query.

14/21
15
Efficient processing the complex range query
  • Complex range query sequence of range queries
    qb1,qb2,,qbn.
  • The query is possible to process in one run in
    the multi-dimensional data structure.

15/21
16
Experimental results
  • Protein Sequence Database XML document
  • ? the document size is 683MB,
  • ? number of elements 21,305,818,
  • ? number of attributes1,290,647.
  • ? maximal length of path 7.
  • BUB-forest, R-forest, Signature BUB-tree and
    R-tree. Index structures trees indexing spaces
    of dimension n7 and n9.

16/21
17
Experimental results
Queries ProteinDatabase/ProteinEntry/reference/r
efinfo/ authors/author'Smith, E.L.'
17/21
18
Experimental results Regular path expression
  • Query //uid'89071748' , 5 labelled paths were
    matched.
  • Naive processing the complex range query DAC
    368
  • Efficient processing the complex range query
    DAC 139
  • Time 0.03s, Improvement 2.5x

18/21
19
Preliminary experimental results Keyword-based
searching
  • othello.xml
  • ? document size is 250kB,
  • ? maximal length of the path 6
  • ? number of paths 4,967
  • ? number of labelled paths 13
  • ? number of terms 8,744
  • ? PLT index 27,127

19/21
20
Preliminary experimental results Keyword-based
searching
  • Query /PLAYPERSONAE/PERSONAOTHELLO/TITLE
  • Labelled path index result size 1, DAC 3
  • PLT index result size 1, DAC 3
  • Path index result size 1, DAC 13
  • Path index result size 1, DAC 4

20/21
21
Conclusion
  • T(m log n), T(c m log n) vs. T(m1 m2), m1
    ,m2 m.
  • Efficient processing a query with AND condition.
    Signature is applied.
  • Multi-dimensional approach for term searching may
    be applied (e.g. comp).
  • The update operation of XML documents.
  • Comparison with another approaches for test
    collections (INEX, XMark, ).

21/21
http//www.cs.vsb.cz/arg
22
References
  • M. Krátký, J. Pokorný, V. Snáel Implementation
    of XPath Axes in the Multi-dimensional Approach
    to Indexing XML Data. Accepted at International
    Workshop on Database Technologies for Handling
    XML information on the Web, DataX, Int'l
    Conference on EDBT, Heraklion - Crete, Greece,
    2004.
  • M. Krátký, J. Pokorný, T. Skopal, V. Snáel The
    Geometric Framework for Exact and Similarity
    Querying XML data. In Proceedings of EurAsia-ICT
    2002. Shiraz, Iran, Springer Verlag, LNCS 2510.
  • M. Krátký, T. Skopal, and V. Snáel
    Multidimensional Term Indexing for Efficient
    Processing of Complex Queries. Kybernetika,
    Journal of the Academy of Sciences of the Czech
    Republic, 2004, accepted.

23
Paths, labelled paths
  • Paths 0,1,2,003-04212 0,5,6,001-00863 and
    0,9,10,045-00012 belong to the labelled path
    books,book,id,
  • . . .
  • Paths 0,1,4,J.R.R. Tolkien 0,5,8,J.R.R.
    Tolkien and 0,9,12,Joseph Heller belong to the
    labelled path books,book,author.

24
Complex queries
  • Query for values and XPath axis processing,
  • e.g. books/bookauthor'Joseph Heller'/title
  • ? Combination of above described techniques
    query for value, XPath axis processing.
  • Regular path expression queries
  • for example books//author
  • ? A sequence of range queries processes this
    query in the path and labelled path index books,
    author - books,,author - books,,,,author.

25
(B)UB-tree, R-tree
26
Narrow range query signature multi-dimensional
ds
  • Regions intersecting a query hyper box are
    searched, O(NI logc n).
  • Ratio cR of relevant NR and intersect NI regions
  • 1 with an increasing dimension.
  • Signatures are applied to better filtration of
    irrelevant regions signature md structures.

27
Signature R-tree
28
Experimental results
Queries ProteinDatabase/ProteinEntry/reference/r
efinfo/ authors/author'Smith, E.L.'
29
Experimental results
Write a Comment
User Comments (0)
About PowerShow.com