New England Database Society NEDS - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

New England Database Society NEDS

Description:

Of Toronto. New England Database Society (NEDS) Friday, January 23, 2004 ... University of Toronto. Periklis Andritsos, Ariel Fuxman. Tasos Kementsietsidis, ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 45
Provided by: cseU3
Category:

less

Transcript and Presenter's Notes

Title: New England Database Society NEDS


1
  • New England Database Society (NEDS)
  • Friday, January 23, 2004
  • Volen 101, Brandeis University

Sponsored by Sun Microsystems
2
From Data Integration to Data Sharing
  • Renée J. Miller
  • University of Toronto
  • Periklis Andritsos, Ariel Fuxman
  • Tasos Kementsietsidis, Yannis Velegrakis

3
Pre-history Schema Integration
Integrated schema
Source schema S1
Source schema Sn
Design Problem integrated schema designed to
match sources
4
Using Integrations
financial aid date amount
grant gid amount
project received gid date
Queries on target are queries on (integrated) view
create view G.financial(aid, amount, date) as (
select S1.grant.gid, S1.grant.amount,
S1.received.date from S1.grant, S1.received
where S1.grant.gid S1.received.gid) union (
select S2.contract.cid, S2.contract.funds,
S2.contract.date from S2.contract )
5
Data Integration
Source Schema S
Source Schema S
data
Source Schema S
Target Schema T
data
  • Data Integration answer target queries using
    data from source(s)
  • Target data is virtual

6
Data Exchange
Source Schema S
Source Schema S
data
Source Schema S
Target Schema T
data
  • Data Exchange answer target queries using local
    data
  • Target data is materialized

7
Mappings
  • Mappings from source to integration are views (on
    sources)
  • Easy query answering
  • Easy data exchange
  • Notice
  • Integration has no constraints! It is a view!

Global-as-view
8
Data Sharing
Source Schema S
Target Schema T
conforms to
data
  • Views are not enough in practice

9
A New Look at Views
funding gid proj aid financial
aid date amount
grant gid amount project
Global-as-View (GAV)
create view funding(gid, proj, aid) as ( select
gid, project, NULL from grant
) create view financial(aid, date, amount) as
( select NULL, NULL, amount
from grant )
Local-as-View (LAV) LRO, VLDB96
create view grant(gid, amount, project) as (
select gid, amount, proj from
funding, financial where
funding.aid financial.aid )
10
Using Views
Source schema S (Local)
Target schema T (Global)
queries
?st
I
J
  • Views ?st have a special form
  • GAV Qs(S) ? Ti where Ti is a relation in T, Qs
    is a query on S
  • Grant(G,A,P),Received(G,D) ? Financial(G,D,A)
  • Plain old view
  • LAV Si ? Qt(T) where Si is a relation in S ,
    Qt is a query on T
  • Grant(G,A,P) ? Funding(G,A,Aid),
    Financial(Aid,D,A)
  • Notice Aid is existentially quantified

11
Schema Mappings
queries
Source schema S (Local)
Target schema T (Global)
?t
?st
Virtual or Materialized
I
J
  • Schema Mappings
  • Qs(S) ? Qt(T)
  • More general than views
  • Generality often required when S, T are fixed
  • No design control
  • Note that both S, T may have constraints (?s , ?t
    )

12
Project Overview
  • Schema Mapping reconciling differences in
    schemas
  • Clio Creating mappings (VLDB00,VLDB02)
  • Using semantics of schemas and data (SIGMOD01)
  • ToMAS Managing schema mappings (VLDB03)
  • Evolving schemas and semantics
  • Using Mappings
  • Data Exchange (ICDT03)
  • Querying Inconsistent Data (IJCAI/IIWeb03)
  • Data Mapping reconciling differences in data
  • Hyperion managing data mappings (SIGMOD03)
  • Using networks of P2P data mappings

13
Clio Schema Mapping Data Exchange
  • University of Toronto
  • Renée J. Miller
  • Periklis Andritsos, Ariel Fuxman
  • Tasos Kemesetsidis, Yannis Velegrakis
  • IBM Almaden
  • Laura Haas
  • Ron Fagin, Mauricio Hernández
  • Howard Ho, Felix Naumann
  • Lucian Popa, Ling-Ling Yan
  • UC Santa Cruz
  • Phokion Kolaitis

14
Overview
  • Goal interoperability between independent data
    sources
  • Creating Mappings
  • Managing Mappings as sources change
  • Using Mappings to query and exchange data
  • Even when data is dirty or inconsistent
  • Challenges
  • Schemas can be arbitrarily different
  • Still, data must not lose its meaning
  • Use semantics embedded in schemas data
  • Facilitate specification of any additional
    semantics
  • Performed manually complex user queries,
    programs, etc.
  • Hard to debug understand verify correctness

15
Schema Mapping
  • Wants data from S
  • Understands T
  • May not understand S
  • XML Schema
  • DTD
  • Relational

Source schema S
Target schema T
conforms to
data
  • Automate (to the extent possible) the creation
    of mappings
  • Mappings used for (virtual) data integration or
    (materialized) data exchange

16
Illustration Mapping Creation
  • Support Nested Structures
  • Element correspondences
  • Human friendly
  • Automatic discovery
  • Preserve data meaning
  • Discover data associations
  • Use constraints schema
  • Create New Target Values
  • Produce Correct Grouping

17
Schema Matching
Creating Correspondences
  • Graphical User Interface
  • DBA interactively specifies
  • Automatic Discovery
  • Attribute (Element) Classifier
  • Extensible to
  • Other Schema Matchers
  • RahmBernstein VLDB J. 01
  • Based on syntactic information
  • Within schema or data

18
Interpreting Correspondences
statDB Set of Rcd cityStat Rcd
orgs Set of Rcd org Rcd
cid name
fundings Set of Rcd funding Rcd
gid proj aid
financials Set of Rcd financial
Rcd aid
date amount city
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount project
What semantics do we associate to an arrow?
Good enough for one arrow !
?cid expenseDB.companies ? ?cid
statDB.cityStat.orgs
Still works for these two arrows!
?cid,name expenseDB.companies ? ?cid,name
statDB.cityStat.orgs
?gid expenseDB.grants ? ?gid
statDB.cityStat.orgs.fundings
How about now ?
19
Associations btw Elements
statDB Set of Rcd cityStat Rcd
orgs Set of Rcd org Rcd
cid name
fundings Set of Rcd funding Rcd
gid proj aid
financials Set of Rcd financial
Rcd aid
date amount city
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount project
  • Association (in the source) grants ? companies
  • Association (in the target) statDB ? orgs ?
    fundings ? financials

20
Schema Mapping
  • Enumerate ALL logical associations consistent
    with schema semantics
  • Constraints
  • Nesting (schema structure)
  • Data
  • Interpret correspondences (arrows) over pair
    source target association

21
Clio Mappings
q
Source schema S (Local)
Target schema T (Global)
?t
?st
Virtual or Materialized
I
J
  • Schema Mapping
  • Source association Qs(S)
  • Target association Qs(T)
  • Use constraints on S and T (?s , ?t )
  • Create queries
  • Ensure target instance/query answers satisfy ?t
  • Mapping Qs(S) ? Qt(T)

22
Using Views State of the Art
q
Target schema T
Source schema S
?t
?st
virtual !
I
J
  • Data Integration Lenzerini, PODS 02
  • The target is not materialized it is just a
    querying interface
  • Queries are posed on the target schema data is
    in the source.
  • Problem how to answer the query in the best
    possible way
  • AKA Answering queries using views
  • GAV/LAV (mostly) assumes conjunctive queries
  • (mostly) assumes no target constraints target
    is a view
  • Uses relational (not nested relational) model

23
Using Clio Mappings
q
Source schema S (Local)
Target schema T (Global)
?t
?st
Materialize!
I
J
  • Mapping Qs(S) ? Qt(T) and constraints on S and T
    (?s , ?t )
  • Grant(C,G,A,P,S) ? Funding(C,G,Aid),Financials(Aid
    ,D,A)
  • Data Integration
  • Not just relation, but also nested XML schemas
    with constraints!
  • Data Exchange
  • Queries on the target schema answered using
    target data
  • Given instance of S there may be many instances
    of T
  • Problem what is best instance to exchange

24
Semantics
  • Data Integration
  • Query is answered using source data
  • Answer is set of tuples in query result on ALL
    possible target instances certain answers
  • Data Exchange
  • Query is answered using ONE materialized target
  • Can single target give same information as
    source(s)?
  • Is query result the same in both settings?
  • ICDT03

25
When Can This Tool Help?
  • In general, whenever
  • schemas are complex (nesting constraints) or
    different
  • it matters whether we lose data or not
  • it is not obvious to a human user that there are
    many forms of data that can coexist within a
    schema
  • target has constraints that must be satisfied
  • Specific applications
  • data migration from one schema to another (schema
    redesign)
  • data transmission (between independently created
    schemas)
  • data publishing
  • data integration and warehousing

26
Clio
  • Can map between any combination of
  • XML schemas/DTDs/Relational schemas
  • The generated queries are XQuery Skolem fcts
  • (or XSLT or SQL)
  • Tested Clio with many complex real data
  • LifeScience schemas (private and public GeneX,
    SwissProt)
  • Amalgam integration benchmark, DBLP, Mondial,
    several others,
  • Demos SIGMOD01 (Relational), ICDE02 (XML)
  • Papers VLDB00, SIGMOD01, VLDB02, ICDT03
  • Clio technology already transferred into DB2 V8

27
ToMAS
Source Schema S
Target Schema T
mappings
X
?
New Source Schema S
New Target Schema T
  • Mapping Adaptation (Velegrakis et al, VLDB 03)
  • When schemas change, mappings may become invalid
    or inconsistent.
  • Old mappings (views and queries) need to be
    rewritten for the new schemas
  • Requires schema experts
  • ToMAS Provides tool support to reduce the
    required user effort
  • Detects updates mappings affected by the schema
    changes
  • VLDB03 (paper) ICDE04 (demo) www.cs.toronto.edu/
    db/tomas

28
Target Constraints
  • Create Mappings that
  • Ensure all referential constraints in target are
    satisfied
  • Keyrefs
  • Foreign keys
  • Note that satisfaction does not require omitting
    any source data
  • What about key constraints?
  • Or any equality-generating dependency

29
Data Inconsistency
Financial
Target Database
Human Resources
Mapping Financial(e,s) ? Global(e,s) HumanRes(e,s)
? Global(e,s)
30
Data Inconsistency
Financial
Target Database
Human Resources
Mapping Financial(e,s) ? Global(e,s) HumanRes(e,s)
? Global(e,s)
31
Reconciling Inconsistencies (I)
1 Delete all tuples for John
Financial
Target Database
Human Resources
Mapping Financial(e,s) ? Global(e,s) HumanRes(e,s)
? Global(e,s)
32
Reconciling Inconsistencies (II)
2 Delete the salaries of John
Financial
Target Database
Human Resources
Mapping Financial(e,s) ? Global(e,s) HumanRes(e,s)
? Global(e,s)
33
Reconciling Inconsistencies (III)
3 Delete only one tuple for John
Financial
Target Database
Human Resources
Mapping Financial(e,s) ? Global(e,s) HumanRes(e,s)
? Global(e,s)
34
Repairing an integrated database
Repair 1
An integrated inconsistent database
35
Repairing an integrated database
Repair 1
An integrated inconsistent database
Repair 2
36
Consistent Query Answers
Repair 1
  • Intuition
  • Input query Q
  • For each repair Î
  • Get a query result Q(Î)
  • Consistent answers
  • Tuples that appear in all query results.

Repair 2
37
Consistent Query Answers
Repair 1
Q(e,s)Target(e,s)
Get all employees and their salaries
Repair 2
38
Consistent Query Answers
Repair 1
Q(e,s)Target(e,s)
Get all employees and their salaries
Repair 2
ConsistentS(Q,I)(Mary,3000)
39
Consistent Query Answers
Repair 1
Q(e) ? s Global(e,s)
Get all employees
Repair 2
40
Consistent Query Answers
Result 1
Q(e) ? s Global(e,s)
Get all employees
Result 2
ConsistentS(Q,I)(John),(Mary)
41
Finding Consistent Answers
Problem Retrieving consistent answers is co-NP
complete in general (i.e., we need to explore
an exponential number of repairs) Chomicki and
Marcinkowski 2002, Cali et al. 2003
42
Our work
Problem Retrieving consistent answers is co-NP
complete in general (i.e., we need to explore
an exponential number of repairs) Chomicki and
Marcinkowski 2002, Cali et al. 2003
  • Goal Find a class of tractable queries (i.e.,
    the consistent answers can be retrieved in
    polynomial time without explicitly building all
    repairs).
  • And those that can be retrieved using a query

43
Inexpressibility result
  • Query rewriting
  • Input query Q
  • Output query Q s.t. Q(I)consistentS(Q,I) for
    every I.
  • Appealing approach
  • tractable
  • reuses existing DBMSs
  • BUT so far known to be applicable only to a
    restricted classes of queries (Arenas et al,
    PODS 1999)
  • Our work rewriting is not always possible

44
Data Mapping
  • What if sources unwilling to share schemas?
  • Common in more autonomous P2P settings
  • How can such sources share data?
  • Shared schema mappings not available
  • Need to manage and share
  • Data mappings
  • Data associations and aliases
  • Results of cleaning
  • Results of consistency analysis
  • Hyperion P2P data sharing
  • Kementsietsidis et al, SIGMOD03

45
A P2P Genome Example
  • Peers store information about genes, proteins,
    etc.

SwissProt(pid, name)
alias
Gene (gid, name)
  • Characteristics of mapping tables
  • The recorded associations can be 11, 1n or mn
  • They are, in general, non-binary
  • They associate values within or across domains

46
Contributions
  • State of the art
  • Mapping tables represent expert knowledge.
  • Currently, they are created manually by domain
    specialists
  • or are the results of cleaning analysis tools.
  • Our contributions
  • We automate the creation and maintenance of
    mappings.
  • More specifically
  • We investigate alternative semantics for mapping
    tables.
  • We motivate why reasoning capabilities are needed
    to manage them.
  • We propose efficient algorithms for both finding
    inconsistencies in mapping tables and inferring
    new mapping tables

47
Conclusions
  • Managing Schema Inconsistency
  • Very general mapping language for relational
    XML schemas
  • Algorithms for query translation (data
    integration)
  • Algorithms for creating target instance (data
    exchange)
  • Managing Data Inconsistency
  • Manage data mappings
  • Tolerate inconsistency
  • Permit retrieval of consistent data at query time
  • www.cs.toronto.edu/db
  • www.cs.toronto.edu/db/tomas
  • www.cs.toronto.edu/db/hyperion
  • www.cs.toronto.edu/db/miller
Write a Comment
User Comments (0)
About PowerShow.com