Title: Data Integration Methods
1Data Integration Methods
- Zachary G. Ives
- University of Pennsylvania
- CIS 650 Database Information Systems
- February 16, 2004
2Administrivia
- Next reading assignment on scalable query
reformulation algorithms - Pottinger and Halevy MiniCon
- Write-up summarize the main ideas of this paper
3Todays Trivia Question
4A Problem
- Weve seen that even with normalization and the
same needs, different people will arrive at
different schemas - In fact, most people also have different needs!
- Often people build databases in isolation, then
want to share their data - Different systems within an enterprise
- Different information brokers on the Web
- Scientific collaborators
- Researchers who want to publish their data for
others to use - This is the goal of data integration tie
together different sources, controlled by many
people, under a common schema
5Example
- We want to build the UltimateMovieGuideTM
- Given
- TV Guide schema ShowingAt(time, title,
year) Shows(title, year, genre,
rating) Director(title, year, name) Starring(tit
le, year, name) - GoodMovies.com FourStarMovie(title, year,
genre) DecentMovie(title, year) - OscarWinners WinningDirector(director, title,
year) - Documentaries.org Documentary(title, year,
director, producer)
6Integrating Data
- Several steps
- Getting data out from a data source it may have
its own query/retrieval interface - Sometimes it will need a query before it returns
answers, e.g., a Web form - Getting all of the data into the same data model
and format - Translating the data into the same schema
- Answering queries
- How might we handle this?
7Data Warehouses Offline Replication
- Get experts together, define a schema they think
best captures all the info - Define a database with this schema
- Define procedural mappings in an ETL tool to
import the data - Perhaps perform data cleaning
- Periodically copy all of the data from the data
sources - Note that the sources and the warehouse are
basically independent at this point
Results
Query
Data Warehouse
Remote, Autonomous Data Sources
8Pros and Cons of Data Warehouses
- Need to spend time to design the physical
database layout, as well as logical - This actually takes a lot of effort!
- Data is generally not up-to-date (lazy or offline
refresh) - Queries over the warehouse dont disrupt the data
sources - Can run very heavy-duty computations, including
data mining and cleaning
9An Alternative Mediators or Virtual Integration
Systems
- Get experts together, define a schema they think
best captures all the info - Define as a virtual mediated schema
- Create declarative mappings specifying how to get
data from each source into the warehouse - Evaluate queries over the mediated schema on the
fly using the current data at the sources
Results
Query
Data Integration System
Mediated Schema
Source Catalog
Schema Mappings
Remote, Autonomous Data Sources
10Core Question How Do We Define and Use
Mappings?
- Queries must be directly composed with mappings
- Leads to use of views as the means of specifying
mappings - So which direction do we specify views?
- Mediated relations as views over source relations
- Source relations as views over mediated relations
- TSIMMIS chooses option 1
- Information Mainfold chooses option 2
- Neither is perfect or comprehensive, as well see
11The Job of Mappings
- Between different data sources
- May have different numbers of tables different
decompositions - Attributes may be broken down differently
(rating vs. EbertThumb and RoeperThumb) - Metadata in one relation may be data in another
- Values may not exactly correspond (shows vs.
movies) - It may be unclear whether a value is the same
(COPPOLA vs. Francis Ford Coppola) - May have different, but synonymous terms (ImdbID
123456 ? SSN 987-45-3210) - Might have sub/superclass relationships
12General Techniques
- Value-value correspondences accomplished using
concordance tables - Join through a table mapping values to values
- Imdb_Actor(ID, SAG_actor_name)
- Table-multitable correspondences accomplished
using joins (in one direction), projections (in
other direction) - Key question what happens if a needed attribute
is missing? (e.g., DecentMovie has no genre) - Super/subclass relationships generally must be
captured using selection (in one direction),
union (in other direction) - And sometimes we just cant specify the
correspondence!
13Some Examples of Mappings
- Show(ID, Title, Year, Lang, Genre)
- Movie(ID, Title, Year, Genre, Director, Star1,
Star2)
- EnglishMovie(Title, Year, Genre, Rating)
- Docu(ID, Title, Year)Participant(ID, Name, Role)
PieceOfArt(I, T, Y, English, G) -
EnglishMovie(T, Y, G, _), MovieIDFor(I, T, Y)
Movie(I, T, Y, doc, D, S1, S2) - Docu(I, T,
Y), Participant(I, D, Dir), Participant(I,
S1, Cast1), Participant(I, S2, Cast2)
T1
ImdbID CastOf
1234 Catwoman
Name CastOf
Berry, H. Monsters Ball
T2
Need a concordance table from ImdbIDs to actress
names
14TSIMMIS and Information Manifold
- Focus Web-based queryable sources
- CGI forms, online databases, maybe a few RDBMSs
- Each needs to be mapped into the system not as
easy as web search but the benefits are
significant vs. query engines - A few parenthetical notes
- Part of a slew of works on wrappers, source
profiling, etc. - The creation of mappings can be partly automated
systems such as LSD, Cupid, Clio, do this - Today most people look at integrating large
enterprises (thats where the is!) Nimble,
BEA, IBM
15TSIMMIS
- The Stanford-IBM Manager of Multiple Information
Sources or, a Yiddish stew - An instance of a global-as-view mediation
system - One of the first systems to support
semi-structured data, which predated XML by
several years
16Semi-structured Data OEM
- Observation given a particular schema, its
attributes may be unavailable from certain
sources inherent irregularity - Proposal Object Exchange Model, OEM
- OID ltlabel, type, valuegt
- 1 show 2 id 15 , 3 title Catwoman
, 4 year 2004 , 5 lang English ,
6 genre fantasy , 7 criticsrating 8
stars 0.5 , 9 source Bob
17Queries in TSIMMIS
- Specified in OQL-style language called Lorel
- OQL was an object-oriented query language
- Lorel is, in many ways, a predecessor to XQuery
- Based on path expressions over OEM structures
- select showwhere show.title Star Wars and
show.genre sci-fi - This is basically like XQuery, which well use in
place of Lorel and the MSL template language.
Previous query restated - for s in AllData()/showwhere s/title/text()
Star Wars and s/genre/text()
sci-fireturn s
18Query Answering in TSIMMIS
- Basically, its view unfolding, i.e., composing a
query with a view - The query is the one being asked
- The views are the MSL templates for the wrappers
- Some of the views may actually require
parameters, e.g., an author name, before theyll
return answers - Common for web forms (see Amazon, Google, )
- XQuery functions (XQuerys version of views)
support parameters as well, so well see these in
action
19A Wrapper Definition in MSL
- Wrappers have templates and binding patterns (T)
in MSL - S - S ltshow ltgenre Ggtgt // select
from movie where title T // - This reformats a SQL query over Movie(title,
year, genre) - In XQuery, this might look like
- define function GetShow(t AS xsdstring) as show
for s in sql(Amazon.DB, select
from movie where title t )return
ltshowgtlttitlegttlt/titlegt s/year,
s/genrelt/showgt
movie
genre
year
The union of GetShows results is unioned with
others to form the view AllData()
20How to Answer the Query
- Given our query
- for s in AllData()/showwhere s/title/text()
Star Wars and s/genre/text()
sci-fireturn s - Find all wrapper definitions that
- Contain output enough structure to match the
conditions of the query - Or have already tested the conditions for us!
21Query Composition with Views
- We find all views that define book with author
and title, and we compose the query with each - define function GetBook(x AS xsdstring) as book
for b in sql(Amazon.DB, select
from book where author x )return
ltbookgt b/title ltauthorgtxlt/authorgtlt/bookgt -
- for b in AllData()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b
book
author
title
22Example on Board
23Virtues of TSIMMIS
- Early adopter of semistructured data, greatly
predating XML - Can support data from many different kinds of
sources - Obviously, doesnt fully solve heterogeneity
problem - Presents a mediated schema that is the union of
multiple views - Query answering based on view unfolding
- Easily composed in a hierarchy of mediators
24Limitations of TSIMMIS Approach
- Some data sources may contain data with certain
ranges or properties - Books by Aho, Students at UPenn,
- If we ask a query for students at Columbia, dont
want to bother querying students at Penn - How do we express these?
- Mediated schema is basically the union of the
various MSL templates as they change, so may
the mediated schema
25An Alternate ApproachThe Information Manifold
(Levy et al.)
- When you integrate something, you have some
conceptual model of the integrated domain - Define that as a basic frame of reference,
everything else as a view over it - Local as View using mappings that are
conjunctive queries - May have overlapping/incomplete sources
- Define each source as the subset of a query over
the mediated schema the open world assumption - We can use selection or join predicates to
specify that a source contains a range of values - ComputerBooks() ? Books(Title, , Subj), Subj
Computers
26The Local-as-View Model
- The basic model is the following
- Local sources are views over the mediated
schema - Sources have the data mediated schema is
virtual - Sources may not have all the data from the domain
open-world assumption - The system must use the sources (views) to answer
queries over the mediated schema
27Answering Queries Using Views
- Assumption conjunctive queries, set semantics
- Suppose we have a mediated schema show(ID,
title, year, genre), rating(ID, stars, source) - A conjunctive query might be q(t) - show(i,
t, y, g), rating(i, 5, s), y 1997 - Recall intuitions about this class of queries
- Adding a conjunct to a query removes answers from
the result but never adds any - Any conjunctive query with at least the same
constraints conjuncts will give valid answers
28Query Answering
- Suppose we have the query
- q(t) - show(i, t, y, g), rating(i, 5, s), y
1997 - and sources
- 5star(i) ? show(i, t, y, g), rating(i, 5, s)
- TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide) - movieInfo(i,t,y,g) ? show(i, t, y, g)
- critics(i,r, s) ? rating(i, r, s)
- goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997 - We want to compose the query with the source
mappings but theyre in the wrong direction!
29Inverse Rules
- We can take every mapping and invert it, though
sometimes we may have insufficient information - If
- 5star(i) ? show(i, t, y, g), rating(i, 5, s)
- then we can also infer that
- show(i,??? ,??? ,??? ,???) ? 5star(i)
- But how to handle the absence of the missing
attributes? - We know that there must be AT LEAST one instance
of ??? for each attribute for each show ID - So we might simply insert a NULL and define that
NULL means unknown (as opposed to missing)
30But NULLs Lose Information
- Suppose we take these rules and ask for
- q(t) - show(i, t, y, g), rating(i, 5, s), y
1997 - If we look at the rule
- goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997 - By inspection, q(t) ? goodMovies(t,y)
- But if apply our inversion procedure, we get
- show(i, t, y, g) ? goodMovies(t,y), i NULL, g
drama - rating(i, r, s) ? goodMovies(t,y), i NULL, r
5, s NULL - We need a special NULL so we can figure out
which IDs and ratings match up
31The Solution Skolem Functions
- Skolem functions
- Conceptual perfect hash functions
- Each function returns a unique, deterministic
value for each combination of input values - Every function returns a non-overlapping set of
values (Skolem function F will never return a
value that matches any of Skolem function Gs
values) - Skolem functions wont ever be part of the answer
set or the computation it doesnt produce real
values - Theyre just a way of logically generating
special NULLs
32Query Answering Using Inverse Rules
- Invert all rules using the procedures described
- Take the query and the possible rule expansions
and execute them in a Datalog interpreter - In the previous query, we expand with all
combinations of expansions of book and of author
every possible way of combining and
cross-correlating info from different sources - Then we throw away all unsatisfiable rewritings
(some expansions will be logically inconsistent) - More efficient, but equivalent, algorithms now
exist - Bucket algorithm Levy et al., which we discuss
next - MiniCon Pottinger Halevy (next time)
- Also related chase and backchase Popa,
Tannen, Deutsch
33The Bucket Algorithm
- Given a query Q with relations and predicates
- Create a bucket for each subgoal in Q
- Iterate over each view (source mapping)
- If source includes buckets subgoal
- Create mapping between qs vars and the views
var at the same position - If satisfiable with substitutions, add to bucket
- Do cross-product of buckets, see if result is
contained (exptime, but queries are probably
relatively small) - For each result, do a containment check to make
sure the rewriting is contained within the query
34Lets Try a Bucket Example
- Query
- q(t) - show(i, t, y, g), rating(i, 5, s), y
1997 - Sources
- 5star(i) ? show(i, t, y, g), rating(i, 5, s)
- TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide) - movieInfo(i,t,y,g) ? show(i, t, y, g)
- critics(i,r, s) ? rating(i, r, s)
- goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997 - good98(t,y) ? show(i, t, y, drama), rating(i,
5, s), y 1998
35Example of Containment Testing
- Suppose we have two queriesq1(S,C) -
Student(S, N), Takes(S, C), Course(C, X),
inCSE(C), Course(C, DB Info
Systems)q2(S,C) - Student(S, N), Takes(S, C),
Course(C, X) - Intuitively, q1 must contain the same or fewer
answers vs. q2 - It has all of the same conditions, except one
extra conjunction (i.e., its more restricted) - Theres no union or any other way it can add more
data - We can say that q2 contains q1 because this holds
for any instance of our DB Student, Takes,
Course
36Checking Containment via Canonical Databases
- To test for q1 µ q2
- Create a canonical DB that contains a tuple for
each subgoal in q1 - Execute q2 over it
- If q2 returns a tuple that matches the head of
q1, then q1 µ q2 - (This is an NP-complete algorithm in the size of
the query. Testing for full first-order logic
queries is undecidable!!!) - Lets see this for our example
37Example Canonical DB
- q1(S,C) - Student(S, N), Takes(S, C), Course(C,
X), inCSE(C), Course(C, DB Info Systems) - q2(S,C) - Student(S, N), Takes(S, C), Course(C,
X)
Student
Takes
Course
inCSE
C X
C DB Info Systems
S
S N
S C
Need to get tuple ltS,Cgt in executing q2 over this
database
38Next Time
- Well look at the state-of-the-art in query
reformulation, the MiniCon algorithm - Eliminates the need for the containment check
- Eliminates many cross-product comparisons
- This and the ChaseBackchase strategy of Tannen
et al are the two methods most used in virtual
data integration today - Please read the MiniCon paper (Pottinger Halevy)