Title: On Efficient Partmatch Querying of XML Data
1On Efficient Part-match Querying of XML Data
- Michal Krátký, michal.kratky_at_vsb.cz
- Marek Andrt, marek.andrt_at_vsb.cz
- Department of Computer Science
- VBTechnical University of Ostrava
- Czech Republic
DATESO 2004
2Contents
- Introduction XML, query languages, indexing XML
data, part-match querying. - Multi-dimensional approach to indexing XML data.
- Extension of the multi-dimensional approach for
keyword-based querying. - Index data structures.
- Preliminary experimental results.
2/21
3Introduction
- Native XML database. Set of documents is a
database, DTD (XML Schema) is its database
schema. - XML query languages (XPath, XQL, XQuery,).
- A common feature is a possibility to formulate
paths in the XML graph (regular path expressions,
XPath axes and so on). - Approaches based on relational decomposition,
trie, multi-dimensional, signatures and so on.
3/21
4Part-match querying XML data
- Some approaches for keyword or phrase based
searching were published XQuery-IR (WebDb02),
XKeyword (ICDE03) and so on. - Knowledges from IR are applied.
- Query languages contain operators for matching
term occurrence. For example contains(), .
4/21
5Multi-dimensional approach to indexing XML data
- A graph is a set of the paths. XML document is
decomposed to paths and labelled paths. - labelled path lp ? XLP
- s0,s1,...,slPN
- path p ? XP
- idU(u0),idU(u1),...,idU(ulLP),s
- idU(ui) unique number
- of a node ui
5/21
6Indexes
- Term index a storage of strings si of an XML
document and their idT(si). - Labelled path index a storage of points
representing labelled paths. - Path index a storage of points representing
paths.
6/21
7Examplelabelled path index, path index
- books,book,id books,book,title and
books,book,author. Points (0,1,2) (0,1,4) and
(0,1,6) are created using idT of element and
attribute names, idLP 0, 1 and 2. - For example, the path to value The Two Towers.
The labelled path books,book,title with idLP 1
belongs. Vector (1,0,1,3,5) is created using
idLP, unique numbers idU of elements, and idT of
the term.
7/21
8Query for values of elements and attributes
- XPath query books/bookauthorJoseph Heller
- 3 phases of a query processing, finding
- ? idT of terms from the term index,
- ? idLP 2 of labelled path books,book,author
from the labelled path index point query
(0,1,6), - ? points from the path index range query
- (2,0,0,0,12)(2,max,max,max,12).
8/21
9Enhanced querying
- XPath axes are processed by a range query or
sequence of range queries. For example axis
descendent (0,idU(u0),,idU(ul-1), idU(u),0,,
0)(maxD,idU(u0),,idU(ul-1), idU(u),
maxD,,maxD). - Regular path expression. For example
//titlenameChaudhri is processed by a
complex range query. The query is possible to
process in one run in the multi-dimensional data
structure.
9/21
10Comparison of approaches
- Mainline approaches (XISS, XPath Accelerator)
index single element (attribute). For example
query /e1e2dog is processed by joining
single results. - Result formatting. For example a result of the
query //name is all matched subtree. - Operation Update and Insert are simple possible.
10/21
11Keyword-based searching
- Motivation
- /PLAYPERSONAE/PERSONAOTHELLO/TITLE
- Path-Labelled Path-Term (PLT) index is added.
- The index indexes an 3-dimensional space (idP,
idLP, idT). - idP is added into the point representing path
(idP,idLP,idU0,idU1,,idUl,s).
11/21
12Path-Labelled Path-Term index Example
12/21
13Query processing plan Example
13/21
14Index data structures
- Paged and balanced multi-dimensional data
structures (B)UB-trees, variants of R-trees. - Problems
- ? indexing points with different dimensions.
- ? narrow range query the signature is
applied for efficient processing Signature
R-tree. - Efficient processing of the complex range query.
14/21
15Efficient processing the complex range query
- Complex range query sequence of range queries
qb1,qb2,,qbn. - The query is possible to process in one run in
the multi-dimensional data structure.
15/21
16Experimental results
- Protein Sequence Database XML document
- ? the document size is 683MB,
- ? number of elements 21,305,818,
- ? number of attributes1,290,647.
- ? maximal length of path 7.
- BUB-forest, R-forest, Signature BUB-tree and
R-tree. Index structures trees indexing spaces
of dimension n7 and n9.
16/21
17Experimental results
Queries ProteinDatabase/ProteinEntry/reference/r
efinfo/ authors/author'Smith, E.L.'
17/21
18Experimental results Regular path expression
- Query //uid'89071748' , 5 labelled paths were
matched. - Naive processing the complex range query DAC
368 - Efficient processing the complex range query
DAC 139 - Time 0.03s, Improvement 2.5x
18/21
19Preliminary experimental results Keyword-based
searching
- othello.xml
- ? document size is 250kB,
- ? maximal length of the path 6
- ? number of paths 4,967
- ? number of labelled paths 13
- ? number of terms 8,744
- ? PLT index 27,127
19/21
20Preliminary experimental results Keyword-based
searching
- Query /PLAYPERSONAE/PERSONAOTHELLO/TITLE
- Labelled path index result size 1, DAC 3
- PLT index result size 1, DAC 3
- Path index result size 1, DAC 13
- Path index result size 1, DAC 4
20/21
21Conclusion
- T(m log n), T(c m log n) vs. T(m1 m2), m1
,m2 m. - Efficient processing a query with AND condition.
Signature is applied. - Multi-dimensional approach for term searching may
be applied (e.g. comp). - The update operation of XML documents.
- Comparison with another approaches for test
collections (INEX, XMark, ).
21/21
http//www.cs.vsb.cz/arg
22References
- M. Krátký, J. Pokorný, V. Snáel Implementation
of XPath Axes in the Multi-dimensional Approach
to Indexing XML Data. Accepted at International
Workshop on Database Technologies for Handling
XML information on the Web, DataX, Int'l
Conference on EDBT, Heraklion - Crete, Greece,
2004. - M. Krátký, J. Pokorný, T. Skopal, V. Snáel The
Geometric Framework for Exact and Similarity
Querying XML data. In Proceedings of EurAsia-ICT
2002. Shiraz, Iran, Springer Verlag, LNCS 2510. - M. Krátký, T. Skopal, and V. Snáel
Multidimensional Term Indexing for Efficient
Processing of Complex Queries. Kybernetika,
Journal of the Academy of Sciences of the Czech
Republic, 2004, accepted.
23Paths, labelled paths
- Paths 0,1,2,003-04212 0,5,6,001-00863 and
0,9,10,045-00012 belong to the labelled path
books,book,id, - . . .
- Paths 0,1,4,J.R.R. Tolkien 0,5,8,J.R.R.
Tolkien and 0,9,12,Joseph Heller belong to the
labelled path books,book,author.
24Complex queries
- Query for values and XPath axis processing,
- e.g. books/bookauthor'Joseph Heller'/title
- ? Combination of above described techniques
query for value, XPath axis processing. - Regular path expression queries
- for example books//author
- ? A sequence of range queries processes this
query in the path and labelled path index books,
author - books,,author - books,,,,author.
25(B)UB-tree, R-tree
26Narrow range query signature multi-dimensional
ds
- Regions intersecting a query hyper box are
searched, O(NI logc n). - Ratio cR of relevant NR and intersect NI regions
- 1 with an increasing dimension.
- Signatures are applied to better filtration of
irrelevant regions signature md structures.
27Signature R-tree
28Experimental results
Queries ProteinDatabase/ProteinEntry/reference/r
efinfo/ authors/author'Smith, E.L.'
29Experimental results