Title: Tracing Data Lineage Using Schema Transformation Pathways
1Tracing Data Lineage Using Schema Transformation
Pathways
- Hao Fan Alexandra Poulovassilis
- School of Computer Science Information Systems
- Birkbeck college, University of London
- ECAI Workshop on Knowledege Transformation for
the Semantic Web 21st 26th July 2002
2What is Data Lineage
For a given ware-house data item, how to identify
the exact set of source data items.
3The open problems of Data Lineage Tracing
- Definition of Data Lineage
- Derivation of Tracing Queries
- Data Lineage Tracing procedures
- Lineage Tracing with set/bag semantics
- Lineage tracing using auxiliary views
4Applications
- On-line Data analysis and mining (OLAP/OLAM)
- Scientific Databases
- Data cleaning
- Authorization management
- View update problem
5AutoMed
- HDM (Hypergraph Data Model)
- ltNodes, Edges, Constraintsgt
- Instance / Extension Mapping (Exts,I(c))
- 8 Primitive transformations
- Composite transformations / Transformation
Pathways - IQL language
6Simple IQL queries
- 7. q gc aggFun D
- (aggFun max min count sum
avg) - /group and compute a bag of pairs on their
first component and apply an aggregation function
to the second component/ - 8. q p p ? D1 member D2 e /members of D1
that are members of D2/ - 9. q p p ? D1 not (member D2 e) /members
of D1 that are not members of D2/ - 10. q p ? p1 ? D1 pr ? Dr c1 ck / a
comprehension/
- 1. q D1 D2 Dr
- / bag union/
- 2. q D1 -- D2
- / bag monus /
- 3. q group D
- / group a bag of pairs on their first
component/ - 4. q sort D
- 5. q sortDistinct D
- /sort and remove duplicates/
- 6. q aggFun D
- (aggFun max min count sum
avg)
7Example for transforming between HDM schemas
person
mathematician
compScientist
dept
salary
salary
TS1,S2
avgDeptSalary
Schema S1
Schema S2
8The Transformation Pathway TS1,S2
- addNode (dept, Maths,CompSci)
- addNode (person, x x ? mathematician x x
? compScientist) - addNode (avgDeptSalary, avg s (m,s)?_,
mathematician, salary - avg s (c,s)?_, compScientist,
salary) - addEdge (_, dept, person, ( Maths, x) x ?
mathematician - (CompSci, x) x ? compScientist)
- addEdge (_, person, salary, _,
mathematician,salary_, compScientist,
salary) - addEdge (_, dept, avgDeptSalary,(Maths, avg
s (m,s)?_,mathematician,salary), - (CompSci, avg s (c,s)?_, compScientist,
salary)) - delEdge (_, mathematician, salary, (p, s)
(d, p) ? _, dept, person - (p, s) ? _, person, salary d Maths p
p) - delEdge (_, compScientist, salary, (p, s)
(d, p) ? _, dept, person - (p, s)? _, person, salary d CompSci
p p) - delNode (mathematician, p (d, p)? _, dept,
person d Maths) - delNode (compScientist, p (d, p) ? _, dept,
person d CompSci)
9Data Lineage Tracing
There are two kinds of Data Lineage
- Affect-provenance
- includes all of the source data that had some
influence on the result data
- Origin-provenance
- the specific data in the source databases from
which the resulting data is extracted.
10Our approach is to
- Consider data lineage tracing for simple IQL
queries - Handle one transformation step
- Extend to an algorithm for whole transformation
pathways - Extend to handle arbitrary IQL queries
11Data Lineage with set semantics in IQL
Affect-set for a simple query in IQL ts
affect-set in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are maximal subsets of T1, , Tm such
that (a) q(T1, , Tm) t (b) ?Ti
q(T1, , Ti, , Tm) t ? Ti ? Ti (c)
?Ti ?t ? Ti q(T1, , t, ,
Tm) ? Ø
Origin-set for a simple query in IQL ts
origin-set in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are minimal subsets of T1, , Tm such
that (a) q(T1, , Tm) t (b) ?Ti Ti ?
Ti q(T1, , Ti, , Tm) ? t (c)
?Ti ?t ? Ti q(T1, , t, ,
Tm) ? Ø
12Example
Set T1 1, 2, 3, T2 3, 4, 5, V T1 --T2
1, 2. We obtain 1s affect-set as follows 1.
To satisfy the conditions (a) and (b), T1
1, 3 T2 3, 4, 5 2. ? 3 -- 3, 4, 5
Ø, not satisfying (c) ? delete 3 fromT1,
then T1 1 3. The affect-set of 1 (1 ? V)
is T1 1 T2 3, 4, 5
13Data Lineage with bag semantics in IQL
Affect-Pool for a simple query in IQL ts
affect-pool in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are maximal subsets of T1, , Tm such
that (a) q(T1, , Tm) x x ? T x t
(b) ?Ti q(T1, , Ti, , Tm) x x
? T x t ? Ti ? Ti (c) ?Ti ?t ? Ti
q(T1, , t, , Tm) ? Ø
Origin-Pool for a simple query in IQL ts
origin-pool in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are minimal subsets of T1, , Tm such
that (a) q(T1, , Tm) x x ? T x
t (b) ?Ti ?t ? Ti q(T1, , x x ? Ti x
? t, , Tm) ? x x ? T x t (c) ?Ti
?t ? Ti q(T1, , t, , Tm) ? Ø
(d) ?Ti ??t t ? Ti, t ? (Ti -- Ti)
14Affect- and Origin-pool for a tuple with IQL
simple queries
15Affect- and Origin-pool for a query-sequence
V Q(D) q1?q2??qr(D) qr(qr-1((q1(D))))
ts affect-pool in D according to Q to be
QAPD(t) D, where Di qiAP(Di1) (1 i ?
r), Di1 t and D D1. ts origin-pool
in D according to Q to be QOPD(t) D, where
Di qiOP(Di1) (1 i ? r), Di1 t and
D D1.
16Analysis of the data lineage problem for each
Automed transformation step
a) For an addConstruct(O, q) transformation
The lineage of data in schema construct O is
located in the constructs that appear in the
query q. b) For a renameConstruct(O, O)
transformation The lineage of data in O is
located in the source construct O. c) All
delConstruct(O, q) transformations can be ignored
since they create no schema construct.
17Attributes for each input item
t the tracing tuple. O a construct in
integrated schema GS relateTP the
transformation step that created O. extent
the current extent of O. tp a transformation
step in the transformation pathway
opreatorType add , del or ren query
the query used in tp (if any) source
O (for renameConstruct (O ,O)) or all
constructs appearing in the q (for addConstrucr
(O, q)) result O (for renameConstruct (O
,O) and addConstrucr (O, q))
18Output of the procedures
- The result of all tracing procedures is
- a sequence of pair
- lt(D1, O1), , (Dn, On)gt
- In which
- Di a bag containing the derivation
- Oi the construct whose extent contains Di
19Tracing derivation for a tuple
procedure affectPoolofTuple(t, O) begin
if (O.realteTP Ø) DL lt(t, O)gt else
D (O.extent O) O ? O.relateTP.source
D TQAPD(t) DL (B, B.construct) B?
D return(DL) end
procedure originPoolofTuple(t, O) begin
if (O.realteTP Ø) DL lt(t, O)gt
else D (O.extent O) O ?
O.relateTP.source D TQOPD(t) DL (B,
B.construct) B? D return(DL) end
Tracing affect-pool for a tuple t
Tracing origin-pool for a tuple t
20Tracing Affect-Pool for a set of tuples
procedure affectPoolOfSet(T, O) input a tracing
tuple set T t1, , tn, the construct O which
contains tuple set T. output Ts affect pool,
DL begin DL ltgt //the empty
sequence for each t ? T do DL
merge(DL, affectPoolOfTuple(ti, O))
return(DL) end
The procedure originPoolOfSet is similar
21The merge procedure
proc merge(DL, DLnew) input data lineage
sequence DL lt(D1, O1) , (Dn, On)gt new data
lineage sequence DLnew output merged data
lineage sequence DL begin for each
(Dnew, Onew) ? DLnew do if Onew Oi for some Oi
in DL then oldData Di
newData oldData x x ?
Dnew not (member oldData x) DL (DL
(oldData, Oi)) (newData, Oi) else DL
DL (Dnew, Onew) return(DL) end
22Algorithm for tracing Affect-Pool through a
transformation pathway
procedure traceAffectPool(B, O) begin DL lt(B,
O)gt //initiating the DL for j r downto 1
do case (tpj.transfType del)
continue case (tpj.transfType
ren) if tpj.result Oi
for some Oi in DL then
DL (DL (Di, Oi)) (Di, tpj.source)
case (tj.transfType add)
if tpj.result Oi for some Oi in DL
then DL DL
(Di, Oi) Di
sortDistinct Di
DL merge(DL, affectPoolOfSet(Di, Oi))
endfor return(DL) end
The procedure traceOriginPool is similar
23Contributions
- We have shown how individual steps of schema
transformation pathways can be used for tracing
affect-pool and origin-pool - Although shown for HDM data model and IQL query
language, our approach is more generally
applicable to other data models and query
languages - It is also applicable to inter-model
transformation pathways - In particular, could be applied to data lineage
tracing for derived information in the Semantic
Web, to trace the source resources for the
information
24Ongoing work
- Handling more complex IQL queries appearing in
transformation pathways. - Implementing our lineage tracing algorithms.
- Combining our approach for tracing data lineage
with the problem of incremental view maintenance. - Implementing the integrated approach.
- Extending the algorithms to a more expressive
transformation language.