Clio: Translating Web Data - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Clio: Translating Web Data

Description:

Laura Haas. Ron Fagin, Mauricio Hern ndez. Howard Ho, Phokion Kolaitis, Felix Naumann. Lucian Popa, Ling-Ling Yan. 7/3/09. R.J. Miller ... – PowerPoint PPT presentation

Number of Views:202
Avg rating:3.0/5.0
Slides: 26
Provided by: queensDb
Category:
Tags: clio | data | laura | ling | translating | web

less

Transcript and Presenter's Notes

Title: Clio: Translating Web Data


1
Clio Translating Web Data
  • University of Toronto
  • Renée J. Miller
  • Periklis Andritsos, Ariel Fuxman
  • Tasos Kemesetsidis, Yannis Velegrakis
  • IBM Almaden
  • Laura Haas
  • Ron Fagin, Mauricio Hernández
  • Howard Ho, Phokion Kolaitis, Felix Naumann
  • Lucian Popa, Ling-Ling Yan

2
Schema Mapping Data Translation
  • Wants data from S
  • Understands T
  • May not understand S
  • XML Schema
  • DTD
  • Relational

Source schema S
Target schema T
conforms to
data
User interactively chooses among (or corrects)
translation queries
3
Mapping is not Schema Integration
Integrated schema
Source schema S1
Source schema Sn
Design Problem integrated schema designed to
match sources
4
Overview
  • Goal interoperability between independent data
    sources
  • Schema Mapping
  • Data Translation
  • Challenges
  • Schemas can be arbitrarily different
  • Still, data must not lose its meaning during
    translation
  • Maximum advantage of semantics embedded in
    schemas data
  • Used in compilation
  • Facilitate user specification of any additional
    semantics
  • As by-product user learns if semantics
    incorrect/incomplete
  • Performed manually complex user queries,
    programs, etc.
  • Output correct data translation program

5
Illustration Clio Schema Mapping
  • Support Nested Structures
  • Element correspondences
  • Human friendly
  • Automatic discovery
  • Preserve data meaning
  • Discover data associations
  • Use constraints schema
  • Create New Target Values
  • Produce Correct Grouping
  • And

6
Generate Queries (XQuery)
  • ltstatisticsDBgt
  • FOR x0 IN /expenseDB/grant, x1 IN
    /expenseDB/project, x2 IN /expenseDB/company
  • WHERE
  • x2/cid/text() x0/cid/text()
  • x0/project/text()
    x1/name/text()
  • RETURN
  • ltcityStatisticsgt
  • FOR x0L1 IN /expenseDB/grant,
    x1L1 IN /expenseDB/project, x2L1 IN
    /expenseDB/company
  • WHERE
  • x2L1/cid/text()
    x0L1/cid/text()
  • x0L1/project/text()
    x1L1/name/text()
  • x2/city/text()
    x2L1/city/text()
  • RETURN
  • ltorganizationgt
  • ltcidgt
    x0L1/cid/text() lt/cidgt
  • ltcnamegt
    x2L1/name/text() lt/cnamegt
  • FOR x0L2 IN
    /expenseDB/grant, x1L2 IN /expenseDB/project,
    x2L2 IN /expenseDB/company
  • WHERE

7
Outline
  • Schema Mapping (VLDB00, VLDB02)
  • Interpret high-level mapping (correspondences)
  • Relies on semantics of schemas
  • Generate low-level mapping (logical mapping)
  • Data Translation (VLDB02)
  • Query generation
  • Retrieve from source (join/unnest)
  • Insert into target (partition/nest)
  • Accommodate differences in data content
  • Skolemization used to create missing data
  • Understand Mapping (SIGMOD 01)
  • Explain mappings, resolve ambiguities using data

8
Creating Correspondence
  • Graphical User Interface
  • DBA interactively specifies
  • Automatic Discovery
  • Attribute (Element) Classifier
  • Extensible to
  • Other Schema Matchers
  • VLDB J. 01 Survey
  • Correspondence based on syntactic information

9
Interpreting Arrows
statDB Set of Rcd cityStat Rcd
orgs Set of Rcd org Rcd
cid name
fundings Set of Rcd funding Rcd
gid proj aid
financials Set of Rcd financial
Rcd aid
date amount city
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount project
Good enough for one arrow !
?cid expenseDB.companies ? ?cid
statDB.cityStat.orgs
Still works for two arrows!
?cid,name expenseDB.companies ? ?cid,name
statDB.cityStat.orgs
?gid expenseDB.grants ? ?gid
statDB.cityStat.orgs.fundings
How about now ?
10
Data Associations
statDB Set of Rcd cityStat Rcd
orgs Set of Rcd org Rcd
cid name
fundings Set of Rcd funding Rcd
gid proj aid
financials Set of Rcd financial
Rcd aid
date amount city
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount project
  • We must recognize that grants are associated to
    companies
  • Association (in the source) grants ? companies
  • Association (in the target) statDB ? orgs ?
    fundings ? financials

11
Logical Associations
statDB Set of Rcd cityStat Rcd
city orgs Set of Rcd org
Rcd cid name
fundings Set of Rcd
funding Rcd gid proj
aid financials Set of Rcd
financial Rcd aid
date amount
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount sponsor
project
12
Multiple Logical Associations
statDB Set of Rcd cityStat Rcd
orgs Set of Rcd org Rcd
cid name
fundings Set of Rcd funding Rcd
gid proj aid
financials Set of Rcd financial
Rcd aid
date amount city
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount project sponsor
  • Grants may be associated with companies in
    multiple ways
  • Association 1 grants ? companies join on cid
    cid
  • Association 2 grants ? companies join on
    sponsor cid
  • We do not make the one flavor assumption (URA)

13
Schema Mapping
  • Enumerate ALL logical associations consistent
    with schema semantics
  • Constraints
  • Nesting (schema structure)
  • Data (statistical models of data)
  • Interpret correspondences (arrows) over pair
    source target association

14
Features of Algorithm
  • Generate associations
  • Modified dependency chase algorithm
  • Nested schemas
  • Broad class of (nested) referential constraints
  • Including (nested) foreign keys
  • Consider both source and target constraints
  • Algorithm is complete generate all associations
    (in the UR sense) for acyclic dependencies
  • Logical mapping
  • Set of (alternative) inter-schema dependencies

15
Mapping Algorithm
Attribute-Attribute correspondences
Source schema S
Target schema T
1 Generate logical mappings
2 Select subset mappings
16
Outline
  • Schema Mapping
  • Interpret high-level mapping
  • Relies on semantics of schemas
  • Generate logical (low-level) mapping
  • Data Translation
  • Query generation
  • Retrieve from source (join/unnest)
  • Insert into target (partition/nest)
  • Accommodate differences in data content
  • Skolemization used to create missing data

17
From Dependencies to Queries
  • Logical mappings inter-schema dependencies
  • Convenient intermediate representation (in Clio)
  • can be reasoned about (query optimization)
  • can even be shown (graphically) to a user
  • but
  • not all target attributes determined source
  • flat representation of how schemas correspond
  • we need to materialize a nested target

18
Query Generation Issues
  • We translate logical mappings into queries
  • Two parts, conceptually
  • Unnest (join) the source association
  • Nest (partition) according to the shape of the
    target association
  • Skolemization algorithm the heart of query
    generation
  • Achieves a good nesting (grouping)
  • Generates new values (ids)
  • Not to many ids, but not too few either !

19
Query Generation
statDB Set of Rcd cityStat Rcd
city orgs Set of Rcd org
Rcd cid name
fundings Set of Rcd
funding Rcd gid proj
aid financials Set of Rcd
financial Rcd aid
date amount
expenseDB Rcd companies Set of Rcd
company Rcd cid
name city
grants Set of Rcd grant Rcd
cid gid
amount sponsor
project
Sk2
M2
Sk1name
Sk3name
  • Correspondences map only into some of the atomic
    attributes
  • We use Skolem functions to control the creation
    of the other elements
  • sets (this controls how we group elements in the
    target)
  • atomic values (this enforces the integrity of the
    target)

20
Mapping Algorithm
Source schema S
Target schema T
?1
?2

Target Associations
Source Associations
BUT we dont have It we need to create it!!
21
Correctness Guarantees
  • Information preservation
  • For every logical mapping M ?1 ? ?2
    generate query QM
  • For any source instance Is ?2
    (QM(?1(Is))) ?1(Is)
  • Map no less and no more data than was intended
    to be mapped
  • In fact, ?2 and QM are inverse to each other
    (view inversion)
  • Lossless data transmission
  • Target integrity constraints satisfied!!
  • Target nested referential constraints ?t
  • QM(?1(Is)) It where It ? ?t

22
When Can This Tool Help?
  • In general, whenever
  • schemas are complex (nesting constraints) or
    different
  • it matters whether we lose data or not
  • it is not obvious to a human user that there are
    many forms of data that can coexist within a
    schema
  • Specific applications
  • data migration from one schema to another (schema
    redesign)
  • data transmission (between independently created
    schemas)
  • data integration and warehousing

23
Project Status So Far
  • Can map between any combination of
    XML schemas/DTDs/Relational schemas
  • The generated queries are XQuery Skolem fcts
  • (or XSLT or SQL)
  • Tested Clio with many complex real data
  • LifeScience schemas (private and public GeneX,
    SwissProt)
  • Amalgam integration benchmark, DBLP, Mondial,
    several others,
  • Demos SIGMOD01 (Relational), ICDE02 (XML)
  • Papers VLDB00, SIGMOD01, VLDB02

No commercial tool correctly generates XML
transformations that are lossless !
24
Conclusion
  • Schema mapping framework with solid semantic
    foundation
  • Covers relational and core of XML Schema
  • Maps source associations to target associations
  • We build data transformations (XQuery with Skolem
    functions) rather than just queries
  • Separated into 2 phases
  • Logical mapping (arrows -gt dependencies)
  • Data translation (dependencies -gt queries)
  • It works (and we have users!)

25
Answering Queries Using Views
q
Target schema T
Source schema S
?t
?st
virtual !
I
J
  • The target is not materialized it is just a
    querying interface
  • Queries are posed on the target schema data is
    in the source.
  • Problem how to answer the query in the best
    possible way
  • Classical setting ?st have a special form
  • GAV Q(S) ? Ti where Ti is a relation in T, Q
    is a query on S
  • LAV Si ? Q(T) where Si is a relation in S , Q
    is a query on T
  • Schema Mapping Q1(S) ? Q2(T) and constraints on
    S and T (?s , ?t )
  • GAV/LAV (mostly) assumes conjunctive queries on
    relations
  • (mostly) assumes no target constraints target
    is a view
Write a Comment
User Comments (0)
About PowerShow.com