Title: Bridging Different Data Representations
1Bridging Different Data Representations
- Zachary G. Ives
- University of Pennsylvania
- CIS 550 Database Information Systems
- November 11, 2004
2A Special Type of Query Conjunctive Queries
- A single Datalog rule with no Ç, , 8 can
express select, project, and join a conjunctive
query - Conjunctive queries are possible to reason about
statically - (Note that we can write CQs in other languages,
e.g., SQL!) - We know how to minimize conjunctive queries
- An important simplification that cant be done
for general SQL - We can test whether one conjunctive querys
answers always contain another conjunctive
querys answers (for ANY instance) - Why might this be useful?
3Example of Containment
- Suppose we have two queriesq1(S,C) -
Student(S, N), Takes(S, C), Course(C, X),
inCIS(C), Course(C, DB Info
Systems)q2(S,C) - Student(S, N), Takes(S, C),
Course(C, X) - Intuitively, q1 must contain the same or fewer
answers vs. q2 - It has all of the same conditions, except one
extra conjunction (i.e., its more restricted) - Theres no union or any other way it can add more
data - We can say that q2 contains q1 because this holds
for any instance of our DB Student, Takes,
Course
4Wrapping up Datalog
- Weve seen a new language, Datalog
- Its basically a glorified DRC with a special
feature, recursion - Its much cleaner than SQL for reasoning about
- But negation (as in the DRC) poses some
challenges - Weve seen that a particular kind of query, the
conjunctive query, is written naturally in
Datalog - Conjunctive queries are possible to reason about
- We can minimize them, or check containment
- Conjunctive queries are very commonly used in our
next problem, data integration
5A Problem
- Weve seen that even with normalization and the
same needs, different people will arrive at
different schemas - In fact, most people also have different needs!
- Often people build databases in isolation, then
want to share their data - Different systems within an enterprise
- Different information brokers on the Web
- Scientific collaborators
- Researchers who want to publish their data for
others to use - This is the goal of data integration tie
together different sources, controlled by many
people, under a common schema
6Building a Data Integration System
- Create a middleware mediator or data
integration system over the sources - Can be warehoused (a data warehouse) or virtual
- Presents a uniform query interface and schema
- Abstracts away multitude of sources consults
them for relevant data - Unifies different source data formats (and
possibly schemas) - Sources are generally autonomous, not designed to
be integrated - Sources may be local DBs or remote web
sources/services - Sources may require certain input to return
output (e.g., web forms) binding patterns
describe these
7Typical Data Integration Components
Query
Results
Data Integration System / Mediator
Mediated Schema
Source Catalog
Mappings in Catalog
Wrapper
Wrapper
Wrapper
Source Relations
8Typical Data Integration Architecture
Source Descrs.
Query
Reformulator
Source Catalog
Query over sources
QueryProcessor
Results
Queries bindings
Data in mediated format
Wrapper
Wrapper
Wrapper
9Challenges of Mapping Schemas
- In a perfect world, it would be easy to match up
items from one schema with another - Every table would have a similar table in the
other schema - Every attribute would have an identical attribute
in the other schema - Every value would clearly map to a value in the
other schema - Real world as with human languages, things
dont map clearly! - May have different numbers of tables different
decompositions - Metadata in one relation may be data in another
- Values may not exactly correspond
- It may be unclear whether a value is the same
10A Few Simple Examples
- Movie(Title, Year, Director, Editor, Star1,
Star2) - Movie(Title, Year, Director, Editor, Star1,
Star2)
- PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
- MotionPicture(ID, Title, Year)Participant(ID,
Name, Role)
CustID CustName
1234 Ives, Z.
PennID EmpName
46732 Zachary Ives
11How Do We Relate Schemas?
- General approach is to use a view to define
relations in one schema (typically either the
mediated schema or the source schema), given data
in the other schema - This allows us to restructure or recompose
decompose our data in a new way - We can also define mappings between values in a
view - We use an intermediate table defining
correspondences a concordance table - It can be filled in using some type of code, and
corrected by hand
12Mapping Our Examples
- Movie(Title, Year, Director, Editor, Star1,
Star2) - Movie(Title, Year, Director, Editor, Star1,
Star2)
- PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
- MotionPicture(ID, Title, Year)Participant(ID,
Name, Role)
PieceOfArt(I, A, S, T, Movie) - Movie(T, Y, A,
_, S1, S2), ID T Y, S S1 S2
Movie(T, Y, D, E, S1, S2) - MotionPicture(I, T,
Y), Participant(I, D, Dir), Participant(I, E,
Editor), Participant(I, S1, Star1),
Participant(I, S2, Star2)
T1
CustID CustName
1234 Ives, Z.
PennID EmpName
46732 Zachary Ives
T2
Need a concordance table from CustIDs to PennIDs
13Two Important Approaches
- TSIMMIS Garcia-Molina97 Stanford
- Focus semistructured data (OEM), OQL-based
language (Lorel) - Creates a mediated schema as a view over the
sources - Spawned a UCSD project called MIX, which led to a
company now owned by BEA Systems - Other important systems of this vein Kleisli/K2
_at_ Penn - Information Manifold Levy96 ATT Research
- Focus local-as-view mappings, relational model
- Sources defined as views over mediated schema
- Requires a special
- Spawned Tukwila at Washington, and eventually a
company as well - Led to peer-to-peer integration approaches
(Piazza, etc.)
14TSIMMIS and Information Manifold
- Focus Web-based queryable sources
- CGI forms, online databases, maybe a few RDBMSs
- Each needs to be mapped into the system not as
easy as web search but the benefits are
significant vs. query engines - A few parenthetical notes
- Part of a slew of works on wrappers, source
profiling, etc. - The creation of mappings can be partly automated
systems such as LSD, Cupid, Clio, do this - Today most people look at integrating large
enterprises (thats where the is!) Nimble,
BEA, IBM
15TSIMMIS
- The Stanford-IBM Manager of Multiple Information
Sources or, a Yiddish stew - An instance of a global-as-view mediation
system - One of the first systems to support
semi-structured data, which predated XML by
several years
16Semi-structured Data OEM
- Observation given a particular schema, its
attributes may be unavailable from certain
sources inherent irregularity - Proposal Object Exchange Model, OEM
- OID ltlabel, type, valuegt
- How does it relate to XML?
- What problems does OEM solve, and not solve, in
a heterogeneous system?
17OEM Example
Show this XML fragment in OEM ltbookgt
ltauthorgtBernsteinlt/authorgt ltauthorgtNewcomerlt/aut
horgt lttitlegtPrinciples of TPlt/titlegtlt/bookgt ltbo
okgt ltauthorgtChamberlinlt/authorgt lttitlegtDB2
UDBlt/titlegtlt/bookgt
18Queries in TSIMMIS
- Specified in OQL-style language called Lorel
- OQL was an object-oriented query language
- Lorel is, in many ways, a predecessor to XQuery
- Based on path expressions over OEM structures
- select bookwhere book.title DB2 UDB and
book.author Chamberlin - This is basically like XQuery, which well use in
place of Lorel and the MSL template language.
Previous query restated - for b in AllData()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b
19Query Answering in TSIMMIS
- Basically, its view unfolding, i.e., composing a
query with a view - The query is the one being asked
- The views are the MSL templates for the wrappers
- Some of the views may actually require
parameters, e.g., an author name, before theyll
return answers - Common for web forms (see Amazon, Google, )
- XQuery functions (XQuerys version of views)
support parameters as well, so well see these in
action
20A Wrapper Definition in MSL
- Wrappers have templates and binding patterns (X)
in MSL - B - B ltbook ltauthor Xgtgt // select
from book where author X // - This reformats a SQL query over Book(author,
year, title) - In XQuery, this might look like
- define function GetBook(x AS xsdstring) as book
for b in sql(Amazon.DB, select
from book where author x )return
ltbookgtb/titleltauthorgtxlt/authorgtlt/bookgt
book
author
title
The union of GetBooks results is unioned with
others to form the view AllData()
21How to Answer the Query
- Given our query
- for b in AllData()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b - Find all wrapper definitions that
- Contain output enough structure to match the
conditions of the query - Or have already tested the conditions for us!
22Query Composition with Views
- We find all views that define book with author
and title, and we compose the query with each - define function GetBook(x AS xsdstring) as book
for b in sql(Amazon.DB, select
from book where author x )return
ltbookgt b/title ltauthorgtxlt/authorgtlt/bookgt -
- for b in AllData()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b
book
author
title
23Matching View Output to Our Querys Conditions
- Determine that b/book/author/text() ?? x by
matching the pattern on the functions output - define function GetBook(x AS xsdstring) as book
for b in sql(Amazon.DB, select
from book where author x )return
ltbookgt b/title ltauthorgtxlt/author
gtlt/bookgt -
- let x Chamberlinfor b in
GetBook(x)/bookwhere b/title/text() DB2
UDB return b
book
author
title
24The Final Step Unfolding
- let x Chamberlinfor b in ( for b in
sql(Amazon.com, select from book where
author x ) return ltbookgt b/title
ltauthorgtxlt/authorgtlt/bookgt - )/bookwhere b/title/text() DB2 UDB
return b - How do we simplify further to get to here?
- for b in sql(Amazon.com, select from
book where authorChamberlin)where
b/title/text() DB2 UDB return b
25Virtues of TSIMMIS
- Early adopter of semistructured data, greatly
predating XML - Can support data from many different kinds of
sources - Obviously, doesnt fully solve heterogeneity
problem - Presents a mediated schema that is the union of
multiple views - Query answering based on view unfolding
- Easily composed in a hierarchy of mediators
26Limitations of TSIMMIS Approach
- Some data sources may contain data with certain
ranges or properties - Books by Aho, Students at UPenn,
- If we ask a query for students at Columbia, dont
want to bother querying students at Penn - How do we express these?
- Mediated schema is basically the union of the
various MSL templates as they change, so may
the mediated schema
27An Alternate ApproachThe Information Manifold
(Levy et al.)
- When you integrate something, you have some
conceptual model of the integrated domain - Define that as a basic frame of reference,
everything else as a view over it - Local as View
- May have overlapping/incomplete sources
- Define each source as the subset of a query over
the mediated schema - We can use selection or join predicates to
specify that a source contains a range of values - ComputerBooks() ? Books(Title, , Subj), Subj
Computers
28The Local-as-View Model
- The basic model is the following
- Local sources are views over the mediated
schema - Sources have the data mediated schema is
virtual - Sources may not have all the data from the domain
open-world assumption - The system must use the sources (views) to answer
queries over the mediated schema
29Answering Queries Using Views
- Assumption conjunctive queries, set semantics
- Suppose we have a mediated schema author(aID,
isbn, year), book(isbn, title, publisher) - A conjunctive query might be q(a, t, p) -
author(a, i, _), book(i, t, p), t DB2 UDB - Recall intuitions about this class of queries
- Adding a conjunct to a query removes answers from
the result but never adds any - Any conjunctive query with at least the same
constraints conjuncts will give valid answers
30Query Answering
- Suppose we have the query
- q(a, t, p) - author(a, i, _), book(i, t, p)
- and sources
- s1(a,t) ? author(a, i, _), book(i, t, p), t
123 -
- s5(a,i) ? author(a, i, _)
- s6(i,p) ? book(i, t, p)
- We want to compose the query with the source
mappings but theyre in the wrong direction!
31Inverse Rules
- We can take every mapping and invert it, though
sometimes we may have insufficient information - If
- s5(a,i) ? author(a, i, _)
- then we can also infer that
- author(a, i, ???) ? s5(a,i)
- But how to handle the absence of the 3rd
(publisher) attribute? - We know that there must be AT LEAST one instance
of ??? in author for each (a,i) pair - So we might simply insert a NULL and define that
NULL means unknown (as opposed to missing)
32But NULLs Lose Information
- Suppose we take these rules and ask for
- q(a,t) - author(a, i, _), book(i, t, p)
- If we look at the rule
- s1(a,t) ? author(a, i, _), book(i, t, p), t
123 - Clearly q(a,t) ? s1(a,t)
- But if apply our inversion procedure, we get
- author(a, NULL, NULL) ? s1(a,t)
- book(NULL, t, p) ? s1(a,t), t 123
- and theres no way to kow to join author and book
on NULL! - We need a special NULL for each a-t combo so we
can figure out which as and ts go together
33The Solution Skolem Functions
- Skolem functions
- Conceptual perfect hash functions
- Each function returns a unique, deterministic
value for each combination of input values - Every function returns a non-overlapping set of
values (Skolem function F will never return a
value that matches any of Skolem function Gs
values) - Skolem functions wont ever be part of the answer
set or the computation it doesnt produce real
values - Theyre just a way of logically generating
special NULLs
34Revisiting Our Example
- Query
- q(a,t) - author(a, i, _), book(i, t, p)
- Mapping rule
- s1(a,t) ? author(a, i, _), book(i, t, p), t
123 - Inverse rules
- author(a, f(a,t), NULL) ? s1(a,t)
- book(f(a,t), t, p) ? s1(a,t), t 123
- Expand the query as follows
- q(a,t) - author(a, i, NULL), book(i, t, p), i
f(a,t) - q(a,t) - s1(a,t), s1(a,t), t 123, i f(a,t)
35Query Answering Using Inverse Rules
- Invert all rules using the procedures described
- Take the query and the possible rule expansions
and execute them in a Datalog interpreter - In the previous query, we expand with all
combinations of expansions of book and of author
every possible way of combining and
cross-correlating info from different sources - Then we throw away all unsatisfiable rewritings
(some expansions will be logically inconsistent) - More efficient, but equivalent, algorithms now
exist - Bucket algorithm Levy et al.
- MiniCon Pottinger Halevy
- Also related chase and backchase Popa,
Tannen, Deutsch
36Summary of Data Integration
- Local-as-view integration has replaced
global-as-view as the standard - More robust way of defining mediated schemas and
sources - Mediated schema is clearly defined, less likely
to change - Sources can be more accurately described
- Methods exist for query reformulation, including
inverse rules - Integration requires standardization on a single
schema - Can be hard to get consensus
- Today we have peer-to-peer data integration,
e.g., Piazza Halevy et al., Orchestra Ives et
al., Hyperion Miller et al. - Some other aspects of integration were addressed
in related papers - Overlap between sources coverage of data at
sources - Semi-automated creation of mappings and wrappers
- Data integration capabilities in commercial
products BEAs Liquid Data, IBMs DB2
Information Integrator, numerous packages from
middleware companies