Title: A Principled Approach to Data Integration and Reconciliation in Data Warehousing
1A Principled Approach to Data Integration and
Reconciliation in Data Warehousing
- Diego Calvanese
- Giuseppe De Giacomo
- Maurizio Lenzerini
- Daniele Nardi
- Riccardo Rosati
- Presented by Alan Wessman
2Introduction
- Problem Acquire data from a set of sources for a
particular application - Typical architecture wrappers and mediators
- Core problem specify and implement mediators
- Paper focus Data warehouses
3Data Warehouse Integration
- Most sources internal to organization
- Need global corporate view of data
- Conceptual model defines sources and data
warehouse (local-as-view) - Three levels of architecture
- Conceptual Global model
- Logical Query specifications for sources and
warehouse - Physical Wrappers and mediators implementing
query specifications
4Architecture
5Specifying Logical Schemas
- For each table of source S, create an adorned
query - Head Table name, columns
- Body Content of table (query over conceptual
model) - Adornment
- Domains (data types) of columns
- Key attributes
6Adorned Query Example
Halibut(Date, Price) lt- Menu(Date, Halibut,
Price) Price Lira, Date
JulianDate Swordfish(Date, Price) lt- Menu(Date,
Swordfish, Price) Price Lira, Date
JulianDate SushiMenu(TunaPrice, SquidPrice, Date)
lt- Menu(Date, Tuna, TunaPrice), Menu(Date,
Squid, SquidPrice) TunaPrice Yen,
SquidPrice Yen, Date JulianDate
7Query Consistency
- Let Q be an adorned query and B its body.
- Let M be the conceptual model.
- B is inconsistent wrt M if for every
interpretation of M, evaluation of B is empty - Q is inconsistent wrt M if either B is
inconsistent or the annotations are inconsistent - Inference techniques exist for checking query
consistency
8Interschema Correspondences
- Specify how data in different schemas relates
- Non-materialized relational tables (computed
on-demand) - Like adorned query but annotations identify
helper programs - Reusable by other correspondences
9Interschema Correspondences
- Three types of correspondence
- Conversion
- How data from one source is converted into data
fitting a different schema - Matching
- How data from different sources matches
- Reconciliation
- How data from different sources is reconciled to
become data in the warehouse
10Conversion Correspondence
- How data from one source is converted into data
fitting a different schema - convert(x, y) lt- conj(x, y, z)
- through program(x, y, z)
- conj Conjunctive query, specifies when
conversion applies - program Program that performs the conversion
- x Input tuple of values satisfying conditions
for x in conj - y Output tuple of values satisfying conditions
for y in conj - z Additional parameters required by program
11Matching Correspondence
- How data from different sources matches
- match(x1, , xk) lt- conj(x1, , xk, z)
- through program(x1, , xk, z)
- Differs from Conversion Correspondence in use of
k tuples that may be matched - program returns true if the k tuples match
12Reconciliation Correspondence
- How data from different sources is reconciled to
the warehouse - reconcile(x1, , xk, z) lt- conj(x1, , xk,
z, w) - through program(x1, , xk, z, w)
- z Data warehouse tuple result of
reconciliation. - w Additional parameters (like z in previous
slides)
13Reusing Correspondences
- Only reuse if previously defined
- Example 1
- match(x, y) lt- convert1(x, z),
convert2(y, z), conj(x, y, z, w) - through none
- Example 2
- reconcile(x, y, z) lt- convert1(x,
w1), convert2(y, w2), match1(w1,
w2), convert3(w1, z), conj(x, y, z, w) - through none
14Specifying Mediators
- Aim Specify for each relation in warehouse how
the tuples should be constructed from the sources - Task Materialize a new relation T in the
warehouse - Steps
- Specify T as an adorned query q lt- q c1, ,
cn - Look for a rewriting of q in terms of queries q1,
, qs corresponding to materialized views in the
warehouse - Look for a rewriting of (what remains of q) in
terms of queries corresponding to tables in the
sources and the conversion, matching, and
reconciliation correspondences - Resulting query is specification for the mediator
for T
15Computing the Rewriting
- Rewriting typically needs to merge results of
several queries - Produce set of merging clausesFormmerging
tuple-spec1 and and tuple-specnsuch that
matching-conditioninto tuple-spect1 and and
tuple-spectm - Generates template designer specifies such
that and into parts, or writes custom merging
clauses
16Conclusion
- Start with conceptual model and several types of
correspondences - Query rewriting algorithm generates mediator
specifications - Designer fills in any remaining details
- No empirical results