Title: BLAS: An Efficient XPath Processing System
1BLAS An Efficient XPath Processing System
- Zhimin Song
- Advanced Database System
- Professor Dr. Mengchi Liu
2Outline
- Introduction
- BLAS System
-
- Experimental Results
- Conclusions
3- ltProteinDatabasegt
- ltProteinEntrygt
- ltProteingt
- ltNamegt cytochrome c validatedlt/namegt
- ltclassificationgt
- ltsuperfamilygtcytochrome clt/superfamilygt
- lt/classificationgt
- lt/proteingt
- ltreferencegt
- ltrefinfogt
- ltauthorsgt
- ltauthorgtEvans, M.J.lt/authorgt
- lt/authorsgt
- ltyeargt2001lt/yeargt
- lttitlegt The human somatic cytochrome c gene
lt/titlegt - lt/refinfogt
- lt/referencegt
- lt/ProteinEntrygt
- lt/ProteinDatabasegt
4Introduction
- XML has complex, tree-like structure(nodes).
- Languages for Querying XML are based on path
navigation(XPath 1). - Given node ? Child node(Child axis)
- Given node ? Descendant node(Descendant axis)
5Introduction(cont..)
- Some techniques were already proposed in order to
improve XPath Processing. For example, D-labeling
which is used to efficiently handle descendant
axis traversal. - What about complex queries including child axis,
branch??? - In this case P-labeling is proposed in this
paper. It optimizes an important class of queries
called suffix path queries.
6BLAS(Bi-LAbeling based System)
- Basic definitions
- The labeling scheme(Index generator)
- Query translator
7- Basic definitions
- BLAS a system for efficiently process complex
queries based D-labeling and P-labeling. - The BLAS deals with a subset of XPath queires
consisting of - Child axis navigation ( / )
- Descendant axis navigation ( // )
- Branches ( .. )
- The evaluation of a path expression P( P )
returns the set of nodes in an XML tree T which
are reachable by P starting from the root of T. - Since P can be evaluated to retrieve a set of XML
nodes, we use Path expression and query
interchangeably. - P Q if and only if P Q.
- P Q if and only if P Q
8- Basic definitions(cont..)
- Suffix path expression a path expression P which
optionally begins with a descendant axis
step(//), followed by zero or more child axis
steps (/). - Example //protein/name
- Another one /proteinDatabase/proteinEntry/protei
n/name - SP(n) the unique simple path P from the root to
the node n. - So evaluating a suffix path expression Q is to
find all the nodes n such that SP(n) Q.
9Architecture of BLAS
10- The labeling scheme(Index generator)
- D-labeling scheme triplet ltd1,d2,d3gt for a XML
node n(n.d1 lt n.d2) and m(m.d1ltm.d2). - m is a descendant of n if and only if n.d1ltm.d1
and n.d2gtm.d2. - m is a child of n if and only if m is a
descendant of n and n.d31m.d3. - Let d1 and d2 for a node n be the position of the
start tag and end tag. - d3 is set to be the level of n in the XML tree
which is the length of the path from the root to
n. - ? D-label will be represented as ltstart,end,levelgt
11proteinDatabase
proteinEntry
protein
reference
superfamily
//
refinfo
cytochrome c
//
year
author
Title
Select pDB.start,pDB.end,refinfo.start,refinfo.end
From pDB, refinfo Where pDB.start lt
refinfo.start and pDB.end gt refinfo.end
Evans, M.J.
2001
12- P-labeling Scheme
- It is also important to implement child axis
navigation efficiently. - e.g. /proteinDatabase/proteinEntry/protein/name
- Target improve / evaluation
- Focus on suffix path queries
- e.g. //protein/name
13- Assign each node a numberltp1gt, and each suffix
path an interval ltp1,p2gt such that - For any two suffix paths Q1 and Q2, Q1 is
contained in Q2 if - Q1.p1lt Q2.p1 and Q1.p2gt Q2.p2
- A node n is contained in the suffix path Q if
- Q.p1lt SP(n).p1 ltQ.p2.
- Let Q be a suffix path query. Then
- Q n Q.p1 lt n.plabelltQ.p2 when
n.plabelSP(n).p1
14- P-labeling Construction(algorithm)
- Suppose that there are n distinct tags
(t1,t2,.,tn). - Assign / a ratio r0 and each tag ti a ratio ri
such that - r0r1r2.ri 1.
- Let ri 1/(n1).
- Define the domain of the numbers in a P-label to
be integers in 0, m-1, here m is chosen such
that - mgt , where h is the longest path
in an XML tree. - Algorithms as follows
- Path // is assigned an interval(P-label) of lto,
m-1gt. - Partition the interval lt0, m-1gt in tag order
proportional to tis ratio ri, for each path //ti
and child axis navigations ratio r0. - This means we allocate the intervallt0, mr0 -1gt
to / and ltpi, pi1gt to each ti such that (pi1
- pi)/mri and p1/m r0
15- P-labeling Construction(Example)
Query //protein/name M1012 99 tags Ri0.01
16- Query translatortranslates an input XPath query
into standard SQL. - Query decomposition
- Splits the query in to a set of suffix path
queries and records the ancestor-descendant
relationship. - SQL generation
- Computes the querys p-labeling and generates a
corresponding subquery in SQL. - SQL composition
- The subqueries are combined into a single SQL
query based on D-labeling and the
ancestor-descendant relationship.
17- Split algorithm
- D-elimination(query tree Q)
P//q ? p and //q
Q1
proteinDatabase
proteinEntry
Depth-first traversal
protein
reference
Split p//q into p and //q
Q2
Invokes the B-elimination if branches in Q.
Otherwise, it evaluates Q using P-labels.
refinfo
//
superfamily
year
cytochrome c
Title
2001
Join intermediate results by their D-labels
//
author
Q3
Evans, M.J.
18- B-elimination(query tree Q1)
Pq1,q2.qi/r ? p, //q1, //q2,..,//qi, //r
19B-elimination(cont..)
Q4
proteinDatabase
proteinEntry
Q7
//
Q5
reference
//
refinfo
Q8
Q9
//
year
//
Title
2001
20- Push up algorithm optimize the branch
elimination (B-elimination).
Since p/qi and p/r are more specific than //qi
and //r,
Then split Pq1,q2,.,qi/r ? p, p/q1, p/q2,
..p/qi, p/r
proteinDatabase
Q4
proteinDatabase
proteinEntry
proteinEntry
proteinDatabase
reference
proteinEntry
refinfo
reference
Q5
proteinDatabase
refinfo
proteinDatabase
proteinEntry
year
reference
proteinEntry
2001
refinfo
protein
title
21- Unfold algorithmA further optimization of
descendant-axis elimination(D-elimination). -
- There is example as follows
- Q2/ProteinDatabase/ProteinEntry/protein//superfam
ilycytochrome c -
- Q21 /ProteinDatabase/ProteinEntry/protein/classi
fication/ - superfamilycytochrome c ,
P//q ? p/r1/q, p/r2/q, .., p/ri/q
22Experimental Results
- Data sets
- Query sets
- Suffix path queries
- Path queries
- XPath queries
- Query Engine RDBMS or File System
23Query Execution Time
1 suffix path query 2 path query 3 XPath
query
AAuction P Protein S Shakespeare
Query time for Shakespeare, Protein and Auction
data sets
24Scalability
The performance of D-labeling, Split and Push up
for the suffix path query
25Conclusion
- P-labeling scheme is proposed to evaluate suffix
path queries efficiently. - BLAS combines P-labeling and D-labeling to
evaluate XPath queries. - BLAS is more efficient because the queries
translated from XPath queries require - fewer disk accesses
- fewer joins
- Experiments show the effectiveness of BLAS
26- 1J. Clark and S. DeRose. XML Path language
(XPath), November 1999. http//www.w3.org/TR/xpat
h. - 13 D. DeHaan, D. Toman, M. Consens, and M. T.
Ozsu. A - comprehensive XQuery to SQL translation
using dynamic interval encoding. In Proceedings
of SIGMOD, 2001. - 26 J.-K. Min, M.-J. Park, and C.-W. Chung.
XPRESS A queriable compression for XML data. In
Proceedings of SIGMOD, 2003.
27Thank you!
Question ?