Title: Debugging Schema Mappings with Routes
1Debugging Schema Mappings with Routes
- Laura Chiticariu
- UC Santa Cruz
- (joint work with Wang-Chiew Tan)
2SPIDER A Schema Mapping Debugger
Demo group B
Today 1400-1530 Thursday 1100-1230
3Schema Mappings
- A schema mapping is a logical assertion that
describes the correspondence between two schemas - Key element in data exchange and data integration
systems - Data Exchange FKMP05
- Translate data conforming to a source schema S
into data conforming to a target schema T so that
the schema mapping M is satisfied
M
Schema S
Schema T
J
Target instance
4Debugging a Data Exchange Today
M
Schema S
Schema T
XQuery/XSLT/Java
J
Target instance
- Debugging at the (low) level of the
implementation - Specific to the data exchange engine
- Specific to the implementation language XQuery,
SQL, etc - Debugging at the level of schema mappings
- NO SUPPORT!!!
5Debugging Schema Mappings
M
Schema S
Schema T
J
Target instance
- Debugging schema mappings the process of
exploring, understanding and refining a schema
mapping through the use of (test) data at the
level of schema mappings
6Outline
- Overview
- Motivation
- Debugging schema mappings with routes
- Motivating example
- What are routes?
- Computing routes
- Related work
- Performance evaluation
- Conclusions
7Motivation
- Schema mappings are good
- Higher-level, declarative programming constructs
- Hide implementation details, allow for
optimization - Typically easier to understand vs.
SQL/XSLT/XQuery/Java - Serve a similar goal as model management
Bernstein03, MBHR05 - Uniformity in specifying and debugging
- Reduce programming effort by allowing a user to
specify and debug at the level of schema mappings - Schema mappings are often generated by schema
matching tools - Close to users intention, but may need further
refinements - Hard to understand without the help of tools
8Language for Schema Mappings
- Tuple generating dependencies (tgds)
- 8 x (?(x) ! 9 y ?(x,y))
- Equality generating dependencies (egds)
- 8 x (?(x) ! x1 x2)
- Remarks
- Widely used for relational schema mappings in
data exchange and data integration
Kolaitis05,Lenzerini02 - TGDs generalize LAV, GAV and are equivalent to
GLAV assertions in the terminology of data
integration - Extended to handle XML data exchange PVMHF02
9Relational Schema Mappings FKMP03
- Schema mapping M (S, T, ?st?t)
- S, T relational schemas with no relation symbols
in common - Source-to-target dependencies ?st
- Source-to-target tgds (s-t tgds) ?S(x) ! 9y
?T(x,y) - Target dependencies ?t
- Target tgds ?T(x) ! 9y ?T(x,y)
- Target egds ? T(x) ! x1 x2
?st
?t
Schema S
Schema T
J
Target instance
10Example Schema Mapping
S
T
MANHATTAN CREDIT CardHolders cardNo
² limit ² ssn
² name ²
Dependents accNo ² ssn
² name ²
Source-to-target dependencies, ?st m1
CardHolders(cn,l,s,n) ! 9L
(Accounts(cn,L,s) Æ Clients(s,n)) m2
Dependents(an,s,n) ! Clients(s,n) Target
dependencies, ?t m3 Clients(s,n) ! 9A 9L
(Accounts(A,L,s))
FARGO FINANCE Accounts ² accNo ² creditLine ²
accHolder Clients ² ssn ² name
m1
fk1
m3
m2
Solution for I under the schema mapping
Target instance J
Source instance I
CardHolders
Accounts
Clients
Dependents
11Example Debugging Scenario 1
Source instance I
Target instance J
CardHolders
Accounts
Clients
Dependents
Unknown credit limit?
A route for the Accounts tuple
Accounts
CardHolders
ID1
L1
123
m1
Alice
ID1
15K
123
Clients
Alice
ID1
15K is not copied over to the target
m1 CardHolders(cn,l,s,n) ! 9L (Accounts(cn,L,s)
Clients(s,n))
12Example Debugging Scenario 1
Source instance I
Target instance J
CardHolders
Accounts
Clients
Dependents
Unknown credit limit?
A route for the Accounts tuple
Accounts
CardHolders
ID1
L1
123
m1
Alice
ID1
15K
123
Clients
Alice
ID1
15K is not copied over to the target
m1 CardHolders(cn,l,s,n) !
(Accounts(cn,l,s) Clients(s,n))
13Example Debugging Scenario 2
Source instance I
Target instance J
CardHolders
Accounts
Clients
Dependents
Unknown account number?
123 is not copied over to the target as Bobs
account number
m2 Dependents(an,s,n) ! Clients(s,n)
14Example Debugging Scenario 2
Source instance I
Target instance J
CardHolders
Accounts
Clients
Dependents
Unknown account number?
123 is not copied over to the target as Bobs
account number
m2 CardHolders(an,l,s,n) Dependents(an,s,n)
! Accounts(an,l,s)
Clients(s,n)
15Debugging Schema Mappings with Routes
- Main intuition routes describe the relationships
between source and target data with the schema
mapping - Definition Let
- M be a schema mapping
- I be a source instance
- J be a solution for I under M and Js µ J
- A route for Js with M and (I,J) is a finite
non-empty sequence of satisfaction steps -
- (I,) ! (I,J1) ! !
(I,Jn) - such that
- Ji µ J, mi 2 ?st ?t, where 1 i n
- Js µ Jn
mn, hn
m1, h1
m2, h2
16Example of Satisfaction Step
Source instance I
Target instance J
CardHolders
Accounts
Clients
Dependents
Unknown credit limit?
Accounts
CardHolders
m1, h1
Clients
m1 CardHolders(cn, l, s, n) ! 9L (Accounts(cn,
L, s ) Clients(s, n ))
h1cn ! 123, l ! 15K, s ! ID1, n ! Alice, L !
L1
17Compute all routes
- The schema mapping M is fixed
- Input source instance I, a solution J for I
under M, a set of target tuples Js µ J - Output a forest representing all routes for Js
- Algorithm idea
- For each tuple t in Js, consider every possible ?
2 ?st ?t and h for witnessing t - Do the same for all target tuples encountered
during the process until tuples from the source
instance are obtained
18Compute all routes A simple example
T7(a)
- ?st
- ?1 S1(x) ! T1(x)
- ?2 S2(x) ! T2(x) Æ T6(x)
- ?t
- ?3 T2(x) ! T3(x)
- ?4 T3(x) ! T4(x)
- ?5 T4(x) Æ T1(x) ! T5(x)
- ?6 T4(x) Æ T6(x) ! T7(x)
- ?7 T5(x) ! T3(x)
- Source instance, I
- S1(a), S2(a)
- A solution, J
- T1(a), , T7(a)
19Compute all routes A simple example
T7(a)
- ?st
- ?1 S1(x) ! T1(x)
- ?2 S2(x) ! T2(x) Æ T6(x)
- ?t
- ?3 T2(x) ! T3(x)
- ?4 T3(x) ! T4(x)
- ?5 T4(x) Æ T1(x) ! T5(x)
- ?6 T4(x) Æ T6(x) ! T7(x)
- ?7 T5(x) ! T3(x)
- Source instance, I
- S1(a), S2(a)
- A solution, J
- T1(a), , T7(a)
?6
T4(a) T6(a)
20Compute all routes A simple example
T7(a)
- ?st
- ?1 S1(x) ! T1(x)
- ?2 S2(x) ! T2(x) Æ T6(x)
- ?t
- ?3 T2(x) ! T3(x)
- ?4 T3(x) ! T4(x)
- ?5 T4(x) Æ T1(x) ! T5(x)
- ?6 T4(x) Æ T6(x) ! T7(x)
- ?7 T5(x) ! T3(x)
- Source instance, I
- S1(a), S2(a)
- A solution, J
- T1(a), , T7(a)
?6
T4(a) T6(a)
?4
T3(a)
21Compute all routes A simple example
T7(a)
- ?st
- ?1 S1(x) ! T1(x)
- ?2 S2(x) ! T2(x) Æ T6(x)
- ?t
- ?3 T2(x) ! T3(x)
- ?4 T3(x) ! T4(x)
- ?5 T4(x) Æ T1(x) ! T5(x)
- ?6 T4(x) Æ T6(x) ! T7(x)
- ?7 T5(x) ! T3(x)
- Source instance, I
- S1(a), S2(a)
- A solution, J
- T1(a), , T7(a)
?6
T4(a) T6(a)
?4
T3(a)
?7
T5(a)
22Compute all routes A simple example
T7(a)
- ?st
- ?1 S1(x) ! T1(x)
- ?2 S2(x) ! T2(x) Æ T6(x)
- ?t
- ?3 T2(x) ! T3(x)
- ?4 T3(x) ! T4(x)
- ?5 T4(x) Æ T1(x) ! T5(x)
- ?6 T4(x) Æ T6(x) ! T7(x)
- ?7 T5(x) ! T3(x)
- Source instance, I
- S1(a), S2(a)
- A solution, J
- T1(a), , T7(a)
?6
T4(a) T6(a)
?4
T3(a)
?7
T5(a)
23Compute all routes A simple example
T7(a)
- ?st
- ?1 S1(x) ! T1(x)
- ?2 S2(x) ! T2(x) Æ T6(x)
- ?t
- ?3 T2(x) ! T3(x)
- ?4 T3(x) ! T4(x)
- ?5 T4(x) Æ T1(x) ! T5(x)
- ?6 T4(x) Æ T6(x) ! T7(x)
- ?7 T5(x) ! T3(x)
- Source instance, I
- S1(a), S2(a)
- A solution, J
- T1(a), , T7(a)
?6
T4(a) T6(a)
?4
?8
T3(a)
S2(a)
?7
?3
T5(a)
T2(a)
?2
S2(a)
Route for T7(a) ?2, ?3, ?4, ?8, ?6
24Properties of compute all routes
- Completeness
- Let F denote the route forest by our algorithm
returned on Js. If R is a minimal route for Js,
then it is represented in F. - Running time polynomial in the sizes of I, J and
Js - Every branch of a tuple once explored, is never
explored again - Polynomial number of branches for each tuple
since M is fixed - Challenge
- Exponentially many routes, but polynomial-size
representation constructed in polynomial time
25Compute one route
- Our experimental results indicate that compute
all routes can be expensive - Generate one route fast and alternative routes as
needed? - Our solution adapt compute all routes to compute
only one route - Non-exhaustive Stops when one witness is found.
A witness that uses source tuples is preferred - Inference procedure to deduce all consequences
of a proven tuple and avoid recomputation of
branches - Key step for polynomial time analysis
- Completeness If there is a route for Js, then
our algorithm will produce a route for Js
26Related work
- Commercial data exchange systems
- e.g., Altova MapForce, Stylus Studio
- Use lower-level languages (e.g., XSLT, XQuery)
to specify the exchange - Debugging is done at this low level
- Source tuple centric
- Data viewer YMHF01
- Constructs an example source instance
illustrative for the behavior of the schema
mapping - Complementary to our approach
- Works only for relational schema mappings
27Related work
- Computing routes for target data is related to
computing provenance (aka lineage) of data
28Empirical Evaluation
- Implementation on top of the Clio data exchange
system from IBM Almaden Research Center - Scalable push computation to the database
- Handles relational and XML schema mappings
PVMHF02 - Testbed
- Created relational and XML schema mappings based
on the TPCH schema - Created schema mappings based on Mondial, DBLP
and Amalgam schemas - Methodology - measured the influence of
- The sizes of I, J and Js
- The complexity of ?st ?t
- i.e., the number of tgds and the number of atoms
in each tgd - Setup P4 2.8GHz, 2Gb RAM, 256MB DB2 buffer pool
- Our regret No benchmark to base our comparisons
29ComputeOneRoute with Rel. schema
mappingInfluence of the Sizes of I and J
30ComputeOneRoute with Rel. schema
mappingInfluence of the Complexity of ?st ?t
31ComputeOneRoute vs. ComputeAllRoutes
32Experimental results with Mondial, DBLP and
Amalgam
33Experimental results with Mondial, DBLP and
Amalgam
- Two DBLP schemas and datasets, both XML
- DBLP1, DBLP2
- First relational schema from Amalgam test suite
34Experimental results with Mondial, DBLP and
Amalgam
- Two DBLP schemas and datasets, both XML
- DBLP1, DBLP2
- First relational schema from Amalgam test suite
- Two Mondial schemas and datasets
- one relational (Mondial1), the other XML
(Mondial2) - Designed ?st and used the foreign key constraints
as ?t
35Experimental results with Mondial, DBLP and
Amalgam
- Compute one route under 3 seconds for 1-10
randomly selected tuples - Compute all routes can take much longer
- 18 seconds to construct the route forest for 10
selected tuples in the target instance of Mondial - Compute one route took under 1 second
36Conclusions
- Debugging schema mappings with routes
- Complete, polynomial time algorithms for
computing routes - Extension for routes for selected source data
- Routes have declarative semantics, based on the
logical satisfaction of tgds - What we dont do illustrate data merging
- Future work
- Illustrate grouping semantics for nested schema
mappings - Adapt target instance to changes in the schema
mapping and data sources
37SPIDER A Schema Mappings Debugger
Demo group B
Today 1400-1530 Thursday 1100-1230
- Compute one/all routes
- Alternative routes
- Guided computation of routes
- Standard debugging features
- Breakpoints
- Watch windows
- Schema-level routes
38Thank you!