Title: New England Database Society NEDS
1- New England Database Society (NEDS)
- Friday, January 23, 2004
- Volen 101, Brandeis University
Sponsored by Sun Microsystems
2From Data Integration to Data Sharing
- Renée J. Miller
- University of Toronto
- Periklis Andritsos, Ariel Fuxman
- Tasos Kementsietsidis, Yannis Velegrakis
3Pre-history Schema Integration
Integrated schema
Source schema S1
Source schema Sn
Design Problem integrated schema designed to
match sources
4Using Integrations
financial aid date amount
grant gid amount
project received gid date
Queries on target are queries on (integrated) view
create view G.financial(aid, amount, date) as (
select S1.grant.gid, S1.grant.amount,
S1.received.date from S1.grant, S1.received
where S1.grant.gid S1.received.gid) union (
select S2.contract.cid, S2.contract.funds,
S2.contract.date from S2.contract )
5Data Integration
Source Schema S
Source Schema S
data
Source Schema S
Target Schema T
data
- Data Integration answer target queries using
data from source(s) - Target data is virtual
6Data Exchange
Source Schema S
Source Schema S
data
Source Schema S
Target Schema T
data
- Data Exchange answer target queries using local
data - Target data is materialized
7Mappings
- Mappings from source to integration are views (on
sources) - Easy query answering
- Easy data exchange
- Notice
- Integration has no constraints! It is a view!
Global-as-view
8Data Sharing
Source Schema S
Target Schema T
conforms to
data
- Views are not enough in practice
9A New Look at Views
funding gid proj aid financial
aid date amount
grant gid amount project
Global-as-View (GAV)
create view funding(gid, proj, aid) as ( select
gid, project, NULL from grant
) create view financial(aid, date, amount) as
( select NULL, NULL, amount
from grant )
Local-as-View (LAV) LRO, VLDB96
create view grant(gid, amount, project) as (
select gid, amount, proj from
funding, financial where
funding.aid financial.aid )
10Using Views
Source schema S (Local)
Target schema T (Global)
queries
?st
I
J
- Views ?st have a special form
- GAV Qs(S) ? Ti where Ti is a relation in T, Qs
is a query on S - Grant(G,A,P),Received(G,D) ? Financial(G,D,A)
- Plain old view
- LAV Si ? Qt(T) where Si is a relation in S ,
Qt is a query on T - Grant(G,A,P) ? Funding(G,A,Aid),
Financial(Aid,D,A) - Notice Aid is existentially quantified
11Schema Mappings
queries
Source schema S (Local)
Target schema T (Global)
?t
?st
Virtual or Materialized
I
J
- Schema Mappings
- Qs(S) ? Qt(T)
- More general than views
- Generality often required when S, T are fixed
- No design control
- Note that both S, T may have constraints (?s , ?t
)
12Project Overview
- Schema Mapping reconciling differences in
schemas - Clio Creating mappings (VLDB00,VLDB02)
- Using semantics of schemas and data (SIGMOD01)
- ToMAS Managing schema mappings (VLDB03)
- Evolving schemas and semantics
- Using Mappings
- Data Exchange (ICDT03)
- Querying Inconsistent Data (IJCAI/IIWeb03)
- Data Mapping reconciling differences in data
- Hyperion managing data mappings (SIGMOD03)
- Using networks of P2P data mappings
13Clio Schema Mapping Data Exchange
- University of Toronto
- Renée J. Miller
- Periklis Andritsos, Ariel Fuxman
- Tasos Kemesetsidis, Yannis Velegrakis
- IBM Almaden
- Laura Haas
- Ron Fagin, Mauricio Hernández
- Howard Ho, Felix Naumann
- Lucian Popa, Ling-Ling Yan
- UC Santa Cruz
- Phokion Kolaitis
14Overview
- Goal interoperability between independent data
sources - Creating Mappings
- Managing Mappings as sources change
- Using Mappings to query and exchange data
- Even when data is dirty or inconsistent
- Challenges
- Schemas can be arbitrarily different
- Still, data must not lose its meaning
- Use semantics embedded in schemas data
- Facilitate specification of any additional
semantics - Performed manually complex user queries,
programs, etc. - Hard to debug understand verify correctness
15Schema Mapping
- Wants data from S
- Understands T
- May not understand S
- XML Schema
- DTD
- Relational
Source schema S
Target schema T
conforms to
data
- Automate (to the extent possible) the creation
of mappings - Mappings used for (virtual) data integration or
(materialized) data exchange
16Illustration Mapping Creation
- Support Nested Structures
- Element correspondences
- Human friendly
- Automatic discovery
- Preserve data meaning
- Discover data associations
- Use constraints schema
17Schema Matching
Creating Correspondences
- Graphical User Interface
- DBA interactively specifies
- Automatic Discovery
- Attribute (Element) Classifier
- Extensible to
- Other Schema Matchers
- RahmBernstein VLDB J. 01
- Based on syntactic information
- Within schema or data
18Interpreting Correspondences
statDB Set of Rcd cityStat Rcd
orgs Set of Rcd org Rcd
cid name
fundings Set of Rcd funding Rcd
gid proj aid
financials Set of Rcd financial
Rcd aid
date amount city
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount project
What semantics do we associate to an arrow?
Good enough for one arrow !
?cid expenseDB.companies ? ?cid
statDB.cityStat.orgs
Still works for these two arrows!
?cid,name expenseDB.companies ? ?cid,name
statDB.cityStat.orgs
?gid expenseDB.grants ? ?gid
statDB.cityStat.orgs.fundings
How about now ?
19Associations btw Elements
statDB Set of Rcd cityStat Rcd
orgs Set of Rcd org Rcd
cid name
fundings Set of Rcd funding Rcd
gid proj aid
financials Set of Rcd financial
Rcd aid
date amount city
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount project
- Association (in the source) grants ? companies
- Association (in the target) statDB ? orgs ?
fundings ? financials
20Schema Mapping
- Enumerate ALL logical associations consistent
with schema semantics - Constraints
- Nesting (schema structure)
- Data
- Interpret correspondences (arrows) over pair
source target association
21Clio Mappings
q
Source schema S (Local)
Target schema T (Global)
?t
?st
Virtual or Materialized
I
J
- Schema Mapping
- Source association Qs(S)
- Target association Qs(T)
- Use constraints on S and T (?s , ?t )
- Create queries
- Ensure target instance/query answers satisfy ?t
- Mapping Qs(S) ? Qt(T)
22Using Views State of the Art
q
Target schema T
Source schema S
?t
?st
virtual !
I
J
- Data Integration Lenzerini, PODS 02
- The target is not materialized it is just a
querying interface - Queries are posed on the target schema data is
in the source. - Problem how to answer the query in the best
possible way - AKA Answering queries using views
- GAV/LAV (mostly) assumes conjunctive queries
- (mostly) assumes no target constraints target
is a view - Uses relational (not nested relational) model
23Using Clio Mappings
q
Source schema S (Local)
Target schema T (Global)
?t
?st
Materialize!
I
J
- Mapping Qs(S) ? Qt(T) and constraints on S and T
(?s , ?t ) - Grant(C,G,A,P,S) ? Funding(C,G,Aid),Financials(Aid
,D,A) - Data Integration
- Not just relation, but also nested XML schemas
with constraints! - Data Exchange
- Queries on the target schema answered using
target data - Given instance of S there may be many instances
of T - Problem what is best instance to exchange
24Semantics
- Data Integration
- Query is answered using source data
- Answer is set of tuples in query result on ALL
possible target instances certain answers - Data Exchange
- Query is answered using ONE materialized target
- Can single target give same information as
source(s)? - Is query result the same in both settings?
- ICDT03
25When Can This Tool Help?
- In general, whenever
- schemas are complex (nesting constraints) or
different - it matters whether we lose data or not
- it is not obvious to a human user that there are
many forms of data that can coexist within a
schema - target has constraints that must be satisfied
- Specific applications
- data migration from one schema to another (schema
redesign) - data transmission (between independently created
schemas) - data publishing
- data integration and warehousing
26Clio
- Can map between any combination of
- XML schemas/DTDs/Relational schemas
- The generated queries are XQuery Skolem fcts
- (or XSLT or SQL)
- Tested Clio with many complex real data
- LifeScience schemas (private and public GeneX,
SwissProt) - Amalgam integration benchmark, DBLP, Mondial,
several others, - Demos SIGMOD01 (Relational), ICDE02 (XML)
- Papers VLDB00, SIGMOD01, VLDB02, ICDT03
- Clio technology already transferred into DB2 V8
27ToMAS
Source Schema S
Target Schema T
mappings
X
?
New Source Schema S
New Target Schema T
- Mapping Adaptation (Velegrakis et al, VLDB 03)
- When schemas change, mappings may become invalid
or inconsistent. - Old mappings (views and queries) need to be
rewritten for the new schemas - Requires schema experts
- ToMAS Provides tool support to reduce the
required user effort - Detects updates mappings affected by the schema
changes - VLDB03 (paper) ICDE04 (demo) www.cs.toronto.edu/
db/tomas
28Target Constraints
- Create Mappings that
- Ensure all referential constraints in target are
satisfied - Keyrefs
- Foreign keys
- Note that satisfaction does not require omitting
any source data - What about key constraints?
- Or any equality-generating dependency
29Data Inconsistency
Financial
Target Database
Human Resources
Mapping Financial(e,s) ? Global(e,s) HumanRes(e,s)
? Global(e,s)
30Data Inconsistency
Financial
Target Database
Human Resources
Mapping Financial(e,s) ? Global(e,s) HumanRes(e,s)
? Global(e,s)
31Reconciling Inconsistencies (I)
1 Delete all tuples for John
Financial
Target Database
Human Resources
Mapping Financial(e,s) ? Global(e,s) HumanRes(e,s)
? Global(e,s)
32Reconciling Inconsistencies (II)
2 Delete the salaries of John
Financial
Target Database
Human Resources
Mapping Financial(e,s) ? Global(e,s) HumanRes(e,s)
? Global(e,s)
33Reconciling Inconsistencies (III)
3 Delete only one tuple for John
Financial
Target Database
Human Resources
Mapping Financial(e,s) ? Global(e,s) HumanRes(e,s)
? Global(e,s)
34Repairing an integrated database
Repair 1
An integrated inconsistent database
35Repairing an integrated database
Repair 1
An integrated inconsistent database
Repair 2
36Consistent Query Answers
Repair 1
- Intuition
- Input query Q
- For each repair Î
- Get a query result Q(Î)
- Consistent answers
- Tuples that appear in all query results.
Repair 2
37Consistent Query Answers
Repair 1
Q(e,s)Target(e,s)
Get all employees and their salaries
Repair 2
38Consistent Query Answers
Repair 1
Q(e,s)Target(e,s)
Get all employees and their salaries
Repair 2
ConsistentS(Q,I)(Mary,3000)
39Consistent Query Answers
Repair 1
Q(e) ? s Global(e,s)
Get all employees
Repair 2
40Consistent Query Answers
Result 1
Q(e) ? s Global(e,s)
Get all employees
Result 2
ConsistentS(Q,I)(John),(Mary)
41Finding Consistent Answers
Problem Retrieving consistent answers is co-NP
complete in general (i.e., we need to explore
an exponential number of repairs) Chomicki and
Marcinkowski 2002, Cali et al. 2003
42Our work
Problem Retrieving consistent answers is co-NP
complete in general (i.e., we need to explore
an exponential number of repairs) Chomicki and
Marcinkowski 2002, Cali et al. 2003
- Goal Find a class of tractable queries (i.e.,
the consistent answers can be retrieved in
polynomial time without explicitly building all
repairs). - And those that can be retrieved using a query
43Inexpressibility result
- Query rewriting
- Input query Q
- Output query Q s.t. Q(I)consistentS(Q,I) for
every I. - Appealing approach
- tractable
- reuses existing DBMSs
- BUT so far known to be applicable only to a
restricted classes of queries (Arenas et al,
PODS 1999) - Our work rewriting is not always possible
44Data Mapping
- What if sources unwilling to share schemas?
- Common in more autonomous P2P settings
- How can such sources share data?
- Shared schema mappings not available
- Need to manage and share
- Data mappings
- Data associations and aliases
- Results of cleaning
- Results of consistency analysis
- Hyperion P2P data sharing
- Kementsietsidis et al, SIGMOD03
45A P2P Genome Example
- Peers store information about genes, proteins,
etc.
SwissProt(pid, name)
alias
Gene (gid, name)
- Characteristics of mapping tables
- The recorded associations can be 11, 1n or mn
- They are, in general, non-binary
- They associate values within or across domains
46Contributions
- State of the art
- Mapping tables represent expert knowledge.
- Currently, they are created manually by domain
specialists - or are the results of cleaning analysis tools.
- Our contributions
- We automate the creation and maintenance of
mappings. - More specifically
- We investigate alternative semantics for mapping
tables. - We motivate why reasoning capabilities are needed
to manage them. - We propose efficient algorithms for both finding
inconsistencies in mapping tables and inferring
new mapping tables
47Conclusions
- Managing Schema Inconsistency
- Very general mapping language for relational
XML schemas - Algorithms for query translation (data
integration) - Algorithms for creating target instance (data
exchange) - Managing Data Inconsistency
- Manage data mappings
- Tolerate inconsistency
- Permit retrieval of consistent data at query time
- www.cs.toronto.edu/db
- www.cs.toronto.edu/db/tomas
- www.cs.toronto.edu/db/hyperion
- www.cs.toronto.edu/db/miller