Title: On Wrapping Query Languages and Efficient XML Integration
1On Wrapping Query Languages and Efficient XML
Integration
- A paper by Vassilis Christophides and Sophie
Cluet - Speaker Yu Wang
2Outline
- Introduction
- YAT System
- YAT XML Algebra
- Wrapping source query languages
- Optimization techniques
- Conclusion
3Introduction
- Application require integrated access to various
information sources, fast deployment and low
maintenance cost - XML
- Enable easy wrapping of external sources
- Enable easy wrapping of declarative integration
-
4Advantage in using XML
- Have flexible format can be used to represent
structured/semistructured information - Convert data into XML easily
- Exist many languages allowing declarative
integration of XML data( e.g. MSL, YATL) - Facilitates interoperability as a standard
5Hard issues
- Wrapping type information
- XMLs current form of typing is not sufficient to
capture rich type systems(e.g. an object database
schema). - Recent proposals( e.g. XML Schema, DCD) dont
provide definitive standard yet. - Wrapping source query capabilities
- TSIMMIS system query templates are used to
describe source capabilities - Not allow an exhaustive description of a source
capabilities - Processing XML queries efficiently
- Not have a well-understood algebra
6Solution
- This paper propose an algebraic framework and
optimization techniques to address the last 2
issues - An algebra for XML
- Introduce an operational model based on a
general-purpose algebra for XML - A source description language
- Use the algebra to wrap full text queries/
structured query languages( e.g. OQL/ SQL) - Query processing techniques
- Show the algebra is appropriate to optimize
integration application
7Example
- Use an example to show the improvements this
paper propose - Goal
- Integrate two sources
- Highly structured an object database
- Partially structured document repository
full-text indexed with Wais
8Sample XML Data
- ltworkgt
- ltartistgt Claude Monet lt/artistgt
- lttitlegt Nympheas lt/titlegt
- ltstylegt Impressionist lt/stylegt
- ltsizegt 21 x 61 lt/sizegt
- ltcplacegtGivernylt/cplacegt
- lt/workgt
- ....
- ltworkgt
- ltartistgt Claude Monet lt/artistgt
- lttitlegt Waterloo Bridge lt/titlegt
- ltstylegt Impressionist lt/stylegt
- ltsizegt 29.2 x 46.4 lt/sizegt
- lthistorygtPainted with lttechniquegt Oil on canvas
- lt/techniquegt in ...
- lt/workgt
- ltobject id"a1" class"artifact"gt
- lttuplegt
- lttitlegt Nympheas lt/titlegt
- ltyeargt 1897 lt/yeargt
- ltcreatorgt Claude Monet lt/creatorgt
- ltpricegt 10.000.000 lt/pricegt
- ltowners refs "p1 p2 p3"/gt
- lt/tuplegt
- lt/objectgt
- .....
- ltobject id"p3" class"person"gt
- lttuplegt
- ltnamegt Doctor X lt/namegt
- ltauctiongt 10.1500.000lt/auctiongt
- lt/tuplegt
- lt/objectgt
9YAT System
- A semistructured data conversion system
- Rely on a library of generic wrappers and a
declarative integration languages, YATL - Use 3 steps to setup the application example with
YAT - Structural information exported by the two
wrappers o2 and xmlwais
10Installing Wrappers and Mediators
- logossimeon o2-wrapper -server
gringos.inria.fr -system cultural -base art -port
6066 - o2-wrapper is running at logos.inria.fr6066
- logossimeon
- --------------------------------------------------
---------------------------- - sapphochristop xmlwais-wrapper -directory
christop/wais-sources/museum.src -port 6060 - xmlwais-wrapper is running at sappho.ics.forth.gr
6060 - sapphochristop
- --------------------------------------------------
---------------------------- - cosmoscluet yat-mediator -port 6666
- yat-mediator is running at cosmos.inria.fr6666
- yatgt connect o2artifact logos.inria.fr6066
- yatgt connect xmlartwork sappho.ics.forth.gr6060
- yatgt import o2artifact
- yatgt import xmlartwork
- yatgt load "/u/cluet/YAT/view1.yat"
11YAT Type System
- Allow to represent information at various levels
of genericity( model, schema , data) - Understand the connection existing between these
levels - Using this feature to wrap query languages
- A graphical representation of YAT data model
- O2 data model
- Described as atomic type/ a tuple/ a collection /
a reference - A tuple type is represented as a collection of
linear subtrees
12YAT Type System( Cont)
- The representation of the document exported by
the xmlwais wrapper - Described as a sequence of mandatory elements
- Capture partially structured information
- YAT meta-model
- Capture any tree
- Others are instances of this model
13O2, XML-Wais and YAT mediator structural metadata
14Integration Programs
- Compose
- A sequence of rules
- A sequence of queries 3 clauses
- MATCH
- Perform pattern-matching
- Filters are used to navigate in the source data
and bind variables - WHERE
- MAKE
- Construct the result by creating a new tree
15Integrating information about the works of Art
- artworks()
- MAKE doc artwork(t,c) work title t,
- artist a,
- year y,
- price p,
- style s,
- size si,
- owners o,
- more elds
- MATCH artifacts WITH set class artifact
tuple - title t,
- year y,
- creator c,
- price p,
- Owners list class person tuple
- name o,
- auction au,
- works WITH works work artist a,
- title t',
16YAT XML Algebra
- Overview
- Characteristic
- Mail tool for both the generic description of
source query capabilities and the XML query
optimization - Provides a fixed set of predefined operations
- Satisfy the requirement
- Expressive power
- Capture evaluation of query and integration
languages - Support for flexible typing
- Support for optimization
- An extension of one object algebra
- Independent of any underlying physical access
structure
17YAT XML Algebra( Cont)
- Operators
- Bind operator
- Extract data from input tree according to the
filter - Produce a tabular representation of the variable
- Tree operator
- Returns a collection of trees conforming to the
input pattern - Equivalent to a grouping operation
- Skolem functions
- Create new identifier
- Perform value assignment
18YAT XML Algebra( Cont)
- Object algebra
- Select/ Project / Join / Union / Intersection
- Group/ Sort/ Map / D-Join
- Applied on the top level of a Tab structure
except Map
19YAT XML Algebra( Cont)
20A Bind operation and resulting Tab structure
21The Tree operation
22YATL Algebraic Translation
- Steps
- Named documents are the input
- MATCH clause is translated into a Bind operation
- The connection between the various inputs is
materialized using a Join operation - Where clauses are translated into a Select
operation - MAKE clause is translated using the Tree
operation - The example shows the algebraic translation of
the view definition of Figure 2 and example query
23Query Example 1
- Q1 What are the artifacts created at \Giverny" ?
- MAKE t
- MATCH artworks WITH doc.work. title.t,
more.cplace.cl - WHERE cl "Giverny"
24Algebraization of YATL queries
25Wrapping source query languages
- Wrapping source operations in YAT is performed in
two steps - Signature
- Essential
- Manual
- Semantics
- Example function imported by the O2 wrapper
- 1 ltoperation name"external"gt
- 2 ltoperation name"current_price"gt
- 3 ltinputgt
- 4 ltvalue model"Artifact_Schema"
pattern"Artifact"/gtlt/inputgt - 5 ltoutputgt
- 6 ltleaf labelFloat /gtlt/outputgt
- 7 lt/operationgt
- 8 lt/operationgt
26Describing OQL capabilities
- YAT operational model borrows a large part of OQL
algebra - OQL binding capabilities are more restricted
- Take this restriction into account by restrict
Bind operation - Bind is always the first operation in a query
27Describing OQL capabilities( Cont )
- Capturing Binding capabilities
- Bind operation has 2 parameters a filter and the
data that has to be filtered/ bound - Need to specify which are the acceptable filters
for OQL - E.g. valid filters, called Fpattern
- Integration programmer does not need to see it
- Coded by YAT developers
- Embedded within the O2 wrapper
28O2 Filter patterns exported in XML
- 1 ltinterface name"o2artifact"gt
- 2 ltoperatgt
- 3 ltfmodel name"o2fmodel"gt
- 4 ltfpattern name"Fclass"gt
- 5 ltnode label"class" bind"tree"gt
- 6 ltnode label"Symbol"
bind"none" inst"ground"gt - 7 ltvalue pattern"Ftype"/gtlt/n
odegtlt/nodegt - 8 lt/fpatterngt
- 9
- 10 ltfpattern name"Ftype"gt
- 11 ltuniongt
- 12 ltleaf label"Bool"/gt
- 13 ltleaf label"Char"/gt
- 14 ltleaf label"Int"/gt
- 15 ltleaf label"Float"/gt
- 16 ltleaf label"String"/gt
- 17 ltnode label"tuple" col"set"
bind"tree"gt - 18 ltstar inst"ground"gt
- ltnode label"Symbol" bind"none"gt
- 21 ltnode label"set" col"set" bind"tree"gt
- ltstar inst"none"gtltvalue label"Ftype"/gtlt/stargtlt
/nodegt - 23 ltnode label"bag" col"bag" bind"tree"gt
- ltstar inst"none"gtltvalue label"Ftype"/gtlt/stargtlt/n
odegt - 25 ltnode label"list" bind"tree"gt
- 26 ltstar inst"none"gtltvalue
label"Ftype"/gtlt/stargtlt/nodegt - 27 ltnode label"array" bind"tree"gt
- 28 ltstar inst"none"gtltvalue
label"Ftype"/gtlt/stargtlt/nodegt - 29 ltref pattern"Fclass"/gt
- 30 lt/uniongt
- 31 lt/fpatterngt
- 32 lt/fmodelgt
- 33 lt/operatgt
- 34 lt/interfacegt
29Description for OQL.Below is a subset of the
operational interface of the O2 wrapper
- 1 ltomodel name"o2omodel"gt
- 2 ltoperation name"algebraic"gt
- 3 ltuniongt
- 4 ltoperation name"bind"gt
- 5 ltinputgt
- 6 ltvalue model"o2model" pattern"Type"/gt
- 7 ltfilter model"o2fmodel" pattern"Ftype"/gt
- 8 lt/inputgt
- 9 ltoutputgt
- 10 ltvalue model"yatstruc" pattern"Tab"/gt
- 11 lt/outputgt
- 12 lt/operationgt
- 13 ltoperation name"select"gtlt/operationgt
- 14 ltoperation name"map"gtlt/operationgt
- 15 ...
- 16 lt/uniongt
- 17 lt/operationgt
- 18
- 19 ltoperation name"boolean"gt
30Describing Wais capabilities
- Three steps to wrap the query capabilities of the
XML-Wais source - Specify the source Fpatterns
- Declare the source supporting Bind and Select
- Describe the full-text predicate contains
supplied by Wais
31Interface to the XML-Wais wrapper
32Interface to the XML-Wais wrapper( Cont)
33Optimization techniques
- The algebra has two parts
- Object algebra
- Two operations to manipulate XML data
- Bind and Tree
- Optimization is divided into two parts
- Optimization techniques proposed for the
relational or object models are directly
applicable - Rewriting techniques for the Bind/Tree operations
- Optimize user queries with views locally or by
pushing queries to the external sources
34Bind Rewriting
- Reason
- A simpler Bind has a better chance to be pushed
to a source - Bind entails navigation that can be costly and
should be transformed into more traditional
associative access as much as possible - 2 ways
- Vertical navigation
- Horizontal navigation and type filtering
35Bind and vertical navigation
- Two ways
- Split a complex Bind into elementary Binds, each
one connecting together through DJoins - Split a complex Bind into a linear sequence of
elementary ones, each one navigating down the
result of the previous one
36From Bind to Join
37Splitting Binds
38Bind, horizontal navigation and type filtering
- When absence the type information
- In purely semistructured systems, the strategy is
to navigate through the whole data graph - Using type information about the data or the
filter is useful for XML queries mixing
structured and semistructured data
39Bind, horizontal navigation and type filtering(
Cont)
- Semistructured queries over structured data
- Queries access both structure and content
- E.g. Since having precise type information, we
can simplify the filter - Structured queries over Semistructured data
- By using the projection to rewrite the Bind
operation, we can simplify the query - Be careful not to change the type filtering
semantics of the Bind
40Bind and Map or Project
41Tree-Bind Rewriting
- Tree captures the restructuring semantics of a
query or view definition - Tree can be rewritten as sequence of Group, Sort
and nested Map operations - It is important to eliminate the intermediate
Tree operations resulting from the composition of
queries and view definition
42Tree-Bind Rewriting( Cont)
- Optimization process
- Get ride of the Bind-Tree sequence that appears
at the frontier between view definition and query - Transform the Bind-Tree sequence into a simple
projection with renaming - Eliminate the branch corresponding to the O2
source and simplify the Bind on the XML source - Merge the remaining Bind filters to obtain the
final expression
43Optimization of Q1
44Source Capability-based Rewriting
- Exploiting source capabilities during query
processing is the most important technique in a
distributed context - Pushing query evaluations to an external source
allows - Reduce the processing time
- Minimize the communication costs
- Limit the system resources
- Benefit from possible parallelism
45Source Capability-based Rewriting( Cont)
- Optimization steps
- the Bind-Tree simplification
- the projection is used to simplify the Bind on
each source and selections are pushed - Push as much evaluation as possible to the source
- On the O2 side, little work is required since
both Bind and selection can be transformed into
an OQL query - On the XML-Wais side, the possibility is to push
a simple Bind on XML documents along with a
contains predicate - Introduce a select with contains
- Split the Bind to match the wais capabilities
description - Determine possible information passing between
sources based on standard rewriting between Joins
and DJoins
46Query Example 2
- Q2 Which impressionist artworks are sold for
less than 200,000.00? - MAKE answer title t, artist a, price p
- MATCH works WITH doc work title t,
-
artist a, - price
p, - style s
- WHERE p lt 200000 AND s "Impressionist"
47Algebraic translation and optimization of Q2
48Conclusion
- Present an algebraic framework to support
efficient query evaluation in XML integration
systems - Rely on a general purpose algebra
- Wrap with appropriate type information, more
structured query languages such as OQL and SQL - Equip algebra with a number of equivalence
offering optimization opportunities