Title: Data Provenance in ETL Scenarios
1Data Provenance in ETL Scenarios
- Panos Vassiliadis
- University of Ioannina
- (joint work with Alkis Simitsis, IBM Almaden
Research Center, - Timos Sellis and Dimitrios Skoutas, NTUA ICCS)
2Outline
- Introduction
- Conceptual Level
- Logical Level
- Physical Level
- Provenance ETL
3Outline
- Introduction
- Conceptual Level
- Logical Level
- Physical Level
- Provenance ETL
4Data Warehouse Environment
5Extract-Transform-Load (ETL)
6ETL importance
- ETL and Data Cleaning tools cost
- 30 of effort and expenses in the budget of the
DW - 55 of the total costs of DW runtime
- 80 of the development time in a DW project
- ETL market a multi-million market
- IBM paid 1.1 billion dollars for Ascential
- ETL tools in the market
- software packages
- in-house development
- No standard, no common model
- most vendors implement a core set of operators
and provide GUI to create a data flow
7Fundamental research question
- Now currently, ETL designers work directly at
the physical level (typically, via libraries of
physical-level templates) - Challenge can we design ETL flows as
declaratively as possible? - Detail independence
- no care for the algorithmic choices
- no care about the order of the transformations
- (hopefully) no care for the details of the
inter-attribute mappings
8Now
DW
Physical templates
Involved data stores
Physical scenario
Engine
9Vision
DW
Schema mappings
ETL tool
DW
Conceptual to logical mapping
Conceptual to logical mapper
Physical templates
Involved data stores
Logical templates
Logical scenario
Optimizer
Physical scenario
Physical templates
Physical scenario
Engine
Engine
10Detail independence
DW
Schema mappings
ETL tool
Automate (as much as possible) Conceptual the
details of the inter-attribute mappings Logical
the order of the transformations Physical the
algorithmic choices
Conceptual to logical mapping
Conceptual to logical mapper
Logical templates
Logical scenario
Optimizer
Physical templates
Physical scenario
Engine
11Outline
- Introduction
- Conceptual Level
- Logical Level
- Physical Level
- Provenance ETL
12Conceptual Model first attempts
13Conceptual Model The Data Mapping Diagram
- Extension of UML to handle inter-attribute
mappings
14Conceptual Model The Data Mapping Diagram
- Aggregating computes the quarterly sales for each
product.
15Conceptual Model Skoutas annotations
- Datastore mappings
- Datastore annotation
16Conceptual Model Skoutas annotations
- Definition for class DS1_Products
17Outline
- Introduction
- Conceptual Level
- Logical Level
- Physical Level
- Provenance ETL
18Logical Model
DSA
DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY
DS.PSNEW2
QTY,COST
SOURCE
COST
DATE
DS.PS2
AddAttr2
?
DIFF2
SK2
2
A2EDate
DS.PSOLD2
rejected
rejected
rejected
rejected
Log
Log
Log
Log
DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE
DS.PSNEW1
DS.PSNEW1.PKEY, DS.PSOLD1.PKEY
COST
DATESYSDATE
PKEY,DATE
DS.PS1
U
PK
SK1
AddDate
NotNULL
DIFF1
DS.PSOLD1
rejected
rejected
rejected
Log
Log
Log
PKEY, DAY MIN(COST)
S2.PARTS
DW.PARTS
FTP2
V1
Aggregate1
PKEY, MONTH AVG(COST)
DW.PARTSUPP.DATE, DAY
S1.PARTS
TIME
FTP1
V2
Aggregate2
??
Sources
DW
19Logical Model
- Main question
- What information should we put inside a metadata
repository to be able to answer questions like - what is the architecture of my DW back stage?
- which attributes/tables are involved in the
population of an attribute? - what part of the scenario is affected if we
delete an attribute?
20Architecture Graph
DSA
DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY
DS.PSNEW2
QTY,COST
COST
DATE
SOURCE
DS.PS2
?
AddAttr2
2
A2EDate
DIFF2
SK2
DS.PSOLD2
rejected
rejected
rejected
rejected
Log
Log
Log
Log
DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE
DS.PSNEW1
DS.PSNEW1.PKEY, DS.PSOLD1.PKEY
COST
DATESYSDATE
PKEY,DATE
DS.PS1
U
PK
SK1
AddDate
NotNULL
DIFF1
DS.PSOLD1
rejected
rejected
rejected
Log
Log
Log
PKEY, DAY MIN(COST)
S2.PARTS
DW.PARTS
FTP2
V1
Aggregate1
PKEY, MONTH AVG(COST)
DW.PARTSUPP.DATE, DAY
S1.PARTS
TIME
FTP1
V2
Aggregate2
??
Sources
DW
21Architecture Graph
Example
2
22Architecture Graph
Example
2
23Optimization
which is the proper execution order?
24Optimization
order equivalence?
SK,f1,f2 or SK,f2,f1 or ... ?
25Logical Optimization
- Can we push selection early enough?
- Can we aggregate before 2 takes place?
26Outline
- Introduction
- Conceptual Level
- Logical Level
- Physical Level
- Provenance ETL
27identify the best possible physical
implementation for a given logical ETL workflow
Logical to Physical
DW
Schema mappings
ETL tool
Conceptual to logical mapper
Conceptual to logical mapping
Logical templates
Logical scenario
Optimizer
Physical templates
Physical scenario
Engine
28Problem formulation
- Given a logical-level ETL workflow GL
- Compute a physical-level ETL workflow GP
- Such that
- the semantics of the workflow do not change
- all constraints are met
- the cost is minimal
29Solution
- We model the problem of finding the physical
implementation of an ETL process as a state-space
search problem. - States. A state is a graph GP that represents a
physical-level ETL workflow. - The initial state G0P is produced after the
random assignment of physical implementations to
logical activities w.r.t. preconditions and
constraints. - Transitions. Given a state GP, a new state GP is
generated by replacing the implementation of a
physical activity aP of GP with another valid
implementation for the same activity. - Extension introduction of a sorter activity (at
the physical-level) as a new node in the graph. - Sorter introduction
- Intentionally introduce sorters to reduce
execution resumption costs
30Sorters impact
- We intentionally introduce orderings, (via
appropriate physical-level sorter activities)
towards obtaining physical plans of lower cost. - Semantics unaffected
- Price to pay
- cost of sorting the stream of processed data
- Gain
- it is possible to employ order-aware algorithms
that significantly reduce processing cost - It is possible to amortize the cost over
activities that utilize common useful orderings
31Sorter gains
- Without order
- cost(si) n
- costSO(?) nlog2(n)n
- With appropriate order
- cost(si) seli n
- costSO(?) n
- Cost(G) 100.00010.000 35.000log2(5.000)5.0
00 309.316 - If sorter SA,B is added to V
- Cost(G) 100.00010.000 25.0005.000log2(5.0
00)5.000 247.877
32Interesting orders
A asc
A desc
A,B, A,B
33Outline
- Introduction
- Conceptual Level
- Logical Level
- Physical Level
- Provenance ETL
34A principled architecture for ETL
DW
Schema mappings
ETL tool
WHY
Conceptual to logical mapper
Conceptual to logical mapping
Logical templates
Logical scenario
WHAT
Optimizer
Physical templates
Physical scenario
HOW
Engine
35Logical Model Questions revisited
- What information should we put inside a metadata
repository to be able to answer questions like - what is the architecture of my DW back stage?
- it is described as the Architecture Graph
- which attributes/tables are involved in the
population of an attribute? - what part of the scenario is affected if we
delete an attribute? - follow the appropriate path in the Architecture
Graph
36Fundamental questions on provenance ETL
- Why do we have a certain record in the DW?
- Because there is a process (described by the
Architecture Graph at the logical level the
conceptual model) that produces this kind of
tuples - Where did this record come from in my DW?
- Hard! If there is a way to derive an inverse
workflow that links the DW tuples to their
sources you can answer it. - Not always possible transformations are not
invertible, and a DW is supposed to progressively
summarize data - Widoms work on record lineage
37Fundamental questions on provenance ETL
- How are updates to the sources managed?
- (update takes place at the source, DWdata marts
must be updated) - Done, although in a tedious way log sniffing,
mainly. Also, diff comparison of extracted
snapshots - When errors are discovered during the ETL
process, how are they handled? - (update takes place at the data staging area,
sources must be updated) - Too hard to back-fuse data into the sources,
both for political and workload issues.
Currently, this is not automated.
38Fundamental questions on provenance ETL
- What happens if there are updates to the schema
of the involved data sources? - Currently this is not automated, although the
automation of the task is part of the detail
independence vision - What happens if we must update the workflow
structure and semantics? - Nothing is versioned back still, not really any
user requests for this to be supported - What is the equivalent of citations in ETL?
- nothing really
39