Title: Schema Evolution in Data Warehousing Environments a schema transformationbased approach
1Schema Evolution in Data Warehousing
Environments a schema transformation-based
approach
- Hao Fan Alexandra Poulovassilis
- School of Computer Science Information Systems
- Birkbeck college, University of London
- ER 2004
- 8th 12th November 2004
2Outline
- Motivation
- Overview of AutoMed
- Expressing Schema and Data Model Evolution
- Handling Schema Evolution
- Conclusion
3Motivation - 1
There are two kinds of heterogeneity in data
integrations
- Model Heterogeneity
- Heterogeneous data expressed in different data
models - Schema Heterogeneity
- Heterogeneous data within different data
schemas expressed the same data model
4Motivation - 2
In data warehousing environments, schema
evolution includes
- Evolution of the schema
- Evolution of the modelling language in which the
schema is expressed
It is important for the integration framework to
be evolvable if a schema evolves, so that the
previous data integration effort can be reused
as much as possible.
5Overview of AutoMed
AutoMed is a heterogeneous data transformation
and integration system, which offers the
capability to handle data integration across
multiple data models.
- HDM data model
- The IQL query language
- Transformation Pathways
- The AutoMed repository
(See http//www.doc.ic.ac.uk/automed/ for details)
6HDM Data Model
HDM is a low-level Hypergraph Data Model.
HDM Schemas consist of a set of Nodes, Edges
and Constraints.
Schemas within conceptual data models, such as
Relational, ER, XML, and Flat-text, can be
represented in terms of HDM.
7A Simple Relational Data Model
CStab
CStab
Cstab,SID
Cstab,C03
Cstab,SName
Cstab,C02
Cstab,C01
The Extent of each construct CStab
CS01, CS02, Cstab,SID
CS01, CS01, CS02, CS02, Cstab,SNa
me CS01, Jack, CS02,
Tom, Cstab,C01 CS01, 95,
CS02, 88, Cstab,C02 CS01,
82, CS02, 94, Cstab,C03
CS01, 75, CS02, 81,
8The IQL Query Language
IQL is a comprehension-based functional query
language. Simply, the Syntax of IQL can be
described as following queries
6. aggFun E (aggFun max min
count sum avg) 7. gc aggFun
E /group and compute a bag of pairs on their
first component and apply an aggregation function
to the second component/ 8. e Q1
Qr /comprehension/ 9. map f E
- 1. E1 E2 Er
- / bag union/
- 2. E1 -- E2 / bag monus /
- 3. group E
- / group a bag of pairs on their first
component/ - 4. sort E
- 5. distinct E
- /remove duplicates/
9Primitive Transformations
- addC (c, q)
- extendC (c)
- deleteC (c, q)
- contractC (c)
- renameC (c, c)
- For simple relational model, C can be Rel, and
Att - For XML, C can be Element, Attribute, and
NestSet - For ER, C can be Entity, Attribute, and
Constraint - For HDM, C can be Node, Edge, and Constraint
10Transformation Pathways
11A Data Integration Example
7. contractHierarchy (ltltSalary,
Person,id,idgtgt) 8. contractHierarchy
(ltltSalary,Job,job_id,job_idgtgt) 9. contractAtt
(ltltSalary, salarygtgt) 10. contractAtt
(ltltSalary, dept_idgtgt) 11. contractFact
(ltltSalary, id, job_idgtgt) 12. contractAtt
(ltltJob, job_descrgtgt) 13. contractDim
(ltltJob, job_idgtgt) 14. contractAtt
(ltltPerson, namegtgt) 15. contractDim
(ltltPerson, idgtgt)
1. addRel (ltltDept,idgtgt, ltltDept_idgtgt) 2. addAtt
(ltltDept,dept_namegtgt, ltlt_,Dept_id,namegtgt) 3.
addAtt (ltltDept,total_salarygtgt,
gc sum (d,s)(i,j,s)lt-ltltSalary,salarygtgt
(i',j',d)lt-ltSalary,dept_idgtgt ii'
jj') 4. delEdge (ltlt_,Dept_id,namegtgt,
ltltDept,dept_namegtgt) 5. delNode
(ltltnamegtgt,n(d,n)lt-ltltDept,dept_namegtgt) 6.
delNode (ltltDept_idgtgt, ltltDept,idgtgt)
12The AutoMed Repository
13Discussion AutoMed vs. CDM - 1
14Discussion AutoMed vs. CDM - 2
Conceptual Data Model
AutoMed
- No semantic mismatch
- Possible to extend data warehouse views into a
different data model - Easily reuse the trans-formation and integration
efforts if a source schema is changed
- Semantic mismatches
- Tightly coupled with the CDM
- Not straightforward to reuse the integration
effort if a source schema is changed
15Materialised Data Integration in AutoMed
16Expressing Schema and Data Model Evolution
- Expressing the evolution of schema S to Snew
- S and Snew have different schema but expressed in
the same modelling language. - Snew is an equivalent schema of S expressed in
different modelling language. - S to Snew is an evolution of both a change in the
schema and in the data model.
17Handling Schema Evolution
Supposing evolution S ? Snew consisting of one
primitive Transformation, t , two scenarios need
to be considered
- Evolution of the global schema
- Evolution of a local schema
18Evolution of the Global Schema
- t is rename c c, nothing further need to do.
- t is add c q , nothing further need to do at the
schema level. The new construct c must be
populated. - t is extend c , an empty new construct c is
populated. - t is delete c q or contract c , the extend of c
must be removed from the global database, and the
transformation pathway US ? GSnew can be
optimized.
GSnew
t
GS
GD
T
US
Other transformations
19Evolution of a Local Schema
- t is rename c c, Tnew rename c c T.
- t is add c q , Tnew delete c q T.
- t is delete c q , Tnew add c q T.
- t is contract c , Tnew extend c T. Since the
extend of c is Void, the materialised data in CD
and GD must be modified so as to remove any data
derived from the old extend of c. - t is extend c , Tnew can be contract cT, but it
will lose information.(see next slide for
details)
GS
GD
US
CSi
CDi
T
Tnew
LSi
t
LSinew
LDi
20t extend c
- c appears in schema US and has the same semantics
as the newly added c in LSinew. - Removing extend c from CSi?US and propagating c
into CSi. - c does NOT appears in schema US but can be
derived from US by means of some transformation
T. - Keeping c in CSi and US, and populating c into
LDi. - Repairing other pathways LSj? CSj (i ! j),
adding c in each CSj. - c does NOT appears in schema US but CANNOT be
derived from US. - Keeping c in CSi and US, and populating c into
LDi. - Extending other pathways LSj? CSj (i ! j),
adding c in each CSj. - c appears in schema US but has the different
semantics as the newly added c in LSinew. - renaming c in LSinew to a new construct c.
21Conclusion Future Work
- Our approach is mainly automatic for handling
schema evolution, except for the aspects that
require domain or expert human knowledge
regarding the semantics of new schema constructs. - In our case, the new mappings are a composition
of the original transformation pathway and the
transformation pathway which expresses the schema
evolution. - In this paper, we use IQL. However, the AutoMed
toolkit allows any other query languages. Our
approach could be extended to other query
language. - Our approach is expected to handle the problem of
incremental view maintenance at the date and
schema levels.
Future Work
Implementing our approach and evaluating it in
the context of biological data warehousing
(ISPIDE project).
22Acknowledge
Thank you!