Data Provenance in ETL Scenarios - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Data Provenance in ETL Scenarios

Description:

Data Provenance in ETL Scenarios. Panos Vassiliadis ... Fundamental questions on provenance & ETL. Why do we have a certain ... on provenance & ETL ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 40
Provided by: alki
Category:

less

Transcript and Presenter's Notes

Title: Data Provenance in ETL Scenarios


1
Data Provenance in ETL Scenarios
  • Panos Vassiliadis
  • University of Ioannina
  • (joint work with Alkis Simitsis, IBM Almaden
    Research Center,
  • Timos Sellis and Dimitrios Skoutas, NTUA ICCS)

2
Outline
  • Introduction
  • Conceptual Level
  • Logical Level
  • Physical Level
  • Provenance ETL

3
Outline
  • Introduction
  • Conceptual Level
  • Logical Level
  • Physical Level
  • Provenance ETL

4
Data Warehouse Environment
5
Extract-Transform-Load (ETL)
6
ETL importance
  • ETL and Data Cleaning tools cost
  • 30 of effort and expenses in the budget of the
    DW
  • 55 of the total costs of DW runtime
  • 80 of the development time in a DW project
  • ETL market a multi-million market
  • IBM paid 1.1 billion dollars for Ascential
  • ETL tools in the market
  • software packages
  • in-house development
  • No standard, no common model
  • most vendors implement a core set of operators
    and provide GUI to create a data flow

7
Fundamental research question
  • Now currently, ETL designers work directly at
    the physical level (typically, via libraries of
    physical-level templates)
  • Challenge can we design ETL flows as
    declaratively as possible?
  • Detail independence
  • no care for the algorithmic choices
  • no care about the order of the transformations
  • (hopefully) no care for the details of the
    inter-attribute mappings

8
Now
DW
Physical templates
Involved data stores

Physical scenario
Engine
9
Vision
DW
Schema mappings
ETL tool
DW
Conceptual to logical mapping
Conceptual to logical mapper
Physical templates
Involved data stores

Logical templates
Logical scenario
Optimizer
Physical scenario
Physical templates
Physical scenario
Engine
Engine
10
Detail independence
DW
Schema mappings
ETL tool
Automate (as much as possible) Conceptual the
details of the inter-attribute mappings Logical
the order of the transformations Physical the
algorithmic choices
Conceptual to logical mapping
Conceptual to logical mapper
Logical templates
Logical scenario
Optimizer
Physical templates
Physical scenario
Engine
11
Outline
  • Introduction
  • Conceptual Level
  • Logical Level
  • Physical Level
  • Provenance ETL

12
Conceptual Model first attempts
13
Conceptual Model The Data Mapping Diagram
  • Extension of UML to handle inter-attribute
    mappings

14
Conceptual Model The Data Mapping Diagram
  • Aggregating computes the quarterly sales for each
    product.

15
Conceptual Model Skoutas annotations
  • Application vocabulary
  • Datastore mappings
  • Datastore annotation

16
Conceptual Model Skoutas annotations
  • The class hierarchy
  • Definition for class DS1_Products

17
Outline
  • Introduction
  • Conceptual Level
  • Logical Level
  • Physical Level
  • Provenance ETL

18
Logical Model
DSA
DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY
DS.PSNEW2
QTY,COST
SOURCE
COST
DATE
DS.PS2
AddAttr2
?
DIFF2
SK2
2
A2EDate
DS.PSOLD2
rejected
rejected
rejected
rejected
Log
Log
Log
Log
DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE
DS.PSNEW1
DS.PSNEW1.PKEY, DS.PSOLD1.PKEY
COST
DATESYSDATE
PKEY,DATE
DS.PS1
U
PK
SK1
AddDate
NotNULL
DIFF1
DS.PSOLD1
rejected
rejected
rejected
Log
Log
Log
PKEY, DAY MIN(COST)
S2.PARTS
DW.PARTS
FTP2
V1
Aggregate1
PKEY, MONTH AVG(COST)
DW.PARTSUPP.DATE, DAY
S1.PARTS
TIME
FTP1
V2
Aggregate2
??
Sources
DW
19
Logical Model
  • Main question
  • What information should we put inside a metadata
    repository to be able to answer questions like
  • what is the architecture of my DW back stage?
  • which attributes/tables are involved in the
    population of an attribute?
  • what part of the scenario is affected if we
    delete an attribute?

20
Architecture Graph
DSA
DS.PS1.PKEY, LOOKUP_PS.SKEY, SOURCE
DS.PSNEW2.PKEY, DS.PSOLD2.PKEY
DS.PSNEW2
QTY,COST
COST
DATE
SOURCE
DS.PS2
?
AddAttr2
2
A2EDate
DIFF2
SK2
DS.PSOLD2
rejected
rejected
rejected
rejected
Log
Log
Log
Log
DS.PS2.PKEY, LOOKUP_PS.SKEY, SOURCE
DS.PSNEW1
DS.PSNEW1.PKEY, DS.PSOLD1.PKEY
COST
DATESYSDATE
PKEY,DATE
DS.PS1
U
PK
SK1
AddDate
NotNULL
DIFF1
DS.PSOLD1
rejected
rejected
rejected
Log
Log
Log
PKEY, DAY MIN(COST)
S2.PARTS
DW.PARTS
FTP2
V1
Aggregate1
PKEY, MONTH AVG(COST)
DW.PARTSUPP.DATE, DAY
S1.PARTS
TIME
FTP1
V2
Aggregate2
??
Sources
DW
21
Architecture Graph
Example
2
22
Architecture Graph
Example
2
23
Optimization
  • Execution order

which is the proper execution order?
24
Optimization
  • Execution order

order equivalence?
SK,f1,f2 or SK,f2,f1 or ... ?
25
Logical Optimization
  • Can we push selection early enough?
  • Can we aggregate before 2 takes place?

26
Outline
  • Introduction
  • Conceptual Level
  • Logical Level
  • Physical Level
  • Provenance ETL

27
identify the best possible physical
implementation for a given logical ETL workflow
Logical to Physical
DW
Schema mappings
ETL tool
Conceptual to logical mapper
Conceptual to logical mapping
Logical templates
Logical scenario
Optimizer
Physical templates
Physical scenario
Engine
28
Problem formulation
  • Given a logical-level ETL workflow GL
  • Compute a physical-level ETL workflow GP
  • Such that
  • the semantics of the workflow do not change
  • all constraints are met
  • the cost is minimal

29
Solution
  • We model the problem of finding the physical
    implementation of an ETL process as a state-space
    search problem.
  • States. A state is a graph GP that represents a
    physical-level ETL workflow.
  • The initial state G0P is produced after the
    random assignment of physical implementations to
    logical activities w.r.t. preconditions and
    constraints.
  • Transitions. Given a state GP, a new state GP is
    generated by replacing the implementation of a
    physical activity aP of GP with another valid
    implementation for the same activity.
  • Extension introduction of a sorter activity (at
    the physical-level) as a new node in the graph.
  • Sorter introduction
  • Intentionally introduce sorters to reduce
    execution resumption costs

30
Sorters impact
  • We intentionally introduce orderings, (via
    appropriate physical-level sorter activities)
    towards obtaining physical plans of lower cost.
  • Semantics unaffected
  • Price to pay
  • cost of sorting the stream of processed data
  • Gain
  • it is possible to employ order-aware algorithms
    that significantly reduce processing cost
  • It is possible to amortize the cost over
    activities that utilize common useful orderings

31
Sorter gains
  • Without order
  • cost(si) n
  • costSO(?) nlog2(n)n
  • With appropriate order
  • cost(si) seli n
  • costSO(?) n
  • Cost(G) 100.00010.000 35.000log2(5.000)5.0
    00 309.316
  • If sorter SA,B is added to V
  • Cost(G) 100.00010.000 25.0005.000log2(5.0
    00)5.000 247.877

32
Interesting orders
A asc
A desc
A,B, A,B
33
Outline
  • Introduction
  • Conceptual Level
  • Logical Level
  • Physical Level
  • Provenance ETL

34
A principled architecture for ETL
DW
Schema mappings
ETL tool
WHY
Conceptual to logical mapper
Conceptual to logical mapping
Logical templates
Logical scenario
WHAT
Optimizer
Physical templates
Physical scenario
HOW
Engine
35
Logical Model Questions revisited
  • What information should we put inside a metadata
    repository to be able to answer questions like
  • what is the architecture of my DW back stage?
  • it is described as the Architecture Graph
  • which attributes/tables are involved in the
    population of an attribute?
  • what part of the scenario is affected if we
    delete an attribute?
  • follow the appropriate path in the Architecture
    Graph

36
Fundamental questions on provenance ETL
  • Why do we have a certain record in the DW?
  • Because there is a process (described by the
    Architecture Graph at the logical level the
    conceptual model) that produces this kind of
    tuples
  • Where did this record come from in my DW?
  • Hard! If there is a way to derive an inverse
    workflow that links the DW tuples to their
    sources you can answer it.
  • Not always possible transformations are not
    invertible, and a DW is supposed to progressively
    summarize data
  • Widoms work on record lineage

37
Fundamental questions on provenance ETL
  • How are updates to the sources managed?
  • (update takes place at the source, DWdata marts
    must be updated)
  • Done, although in a tedious way log sniffing,
    mainly. Also, diff comparison of extracted
    snapshots
  • When errors are discovered during the ETL
    process, how are they handled?
  • (update takes place at the data staging area,
    sources must be updated)
  • Too hard to back-fuse data into the sources,
    both for political and workload issues.
    Currently, this is not automated.

38
Fundamental questions on provenance ETL
  • What happens if there are updates to the schema
    of the involved data sources?
  • Currently this is not automated, although the
    automation of the task is part of the detail
    independence vision
  • What happens if we must update the workflow
    structure and semantics?
  • Nothing is versioned back still, not really any
    user requests for this to be supported
  • What is the equivalent of citations in ETL?
  • nothing really

39
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com