Data Integration Methods - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Data Integration Methods

Description:

Today's Trivia Question. 4. A Problem ... where show.title = 'Star Wars' and show.genre = 'sci-fi' ... where $s/title/text() = 'Star Wars' and $s/genre/text ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 39

Provided by: zack4

Learn more at: https://www.seas.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Integration Methods

1
Data Integration Methods

Zachary G. Ives
University of Pennsylvania
CIS 650 Database Information Systems
February 16, 2004

2
Administrivia

Next reading assignment on scalable query
reformulation algorithms
Pottinger and Halevy MiniCon
Write-up summarize the main ideas of this paper

3
Todays Trivia Question
4
A Problem

Weve seen that even with normalization and the
same needs, different people will arrive at
different schemas
In fact, most people also have different needs!
Often people build databases in isolation, then
want to share their data
Different systems within an enterprise
Different information brokers on the Web
Scientific collaborators
Researchers who want to publish their data for
others to use
This is the goal of data integration tie
together different sources, controlled by many
people, under a common schema

5
Example

We want to build the UltimateMovieGuideTM
Given
TV Guide schema ShowingAt(time, title,
year) Shows(title, year, genre,
rating) Director(title, year, name) Starring(tit
le, year, name)
GoodMovies.com FourStarMovie(title, year,
genre) DecentMovie(title, year)
OscarWinners WinningDirector(director, title,
year)
Documentaries.org Documentary(title, year,
director, producer)

6
Integrating Data

Several steps
Getting data out from a data source it may have
its own query/retrieval interface
Sometimes it will need a query before it returns
answers, e.g., a Web form
Getting all of the data into the same data model
and format
Translating the data into the same schema
Answering queries
How might we handle this?

7
Data Warehouses Offline Replication

Get experts together, define a schema they think
best captures all the info
Define a database with this schema
Define procedural mappings in an ETL tool to
import the data
Perhaps perform data cleaning
Periodically copy all of the data from the data
sources
Note that the sources and the warehouse are
basically independent at this point

Results
Query
Data Warehouse
Remote, Autonomous Data Sources
8
Pros and Cons of Data Warehouses

Need to spend time to design the physical
database layout, as well as logical
This actually takes a lot of effort!
Data is generally not up-to-date (lazy or offline
refresh)
Queries over the warehouse dont disrupt the data
sources
Can run very heavy-duty computations, including
data mining and cleaning

9
An Alternative Mediators or Virtual Integration
Systems

Get experts together, define a schema they think
best captures all the info
Define as a virtual mediated schema
Create declarative mappings specifying how to get
data from each source into the warehouse
Evaluate queries over the mediated schema on the
fly using the current data at the sources

Results
Query
Data Integration System
Mediated Schema
Source Catalog
Schema Mappings
Remote, Autonomous Data Sources
10
Core Question How Do We Define and Use
Mappings?

Queries must be directly composed with mappings
Leads to use of views as the means of specifying
mappings
So which direction do we specify views?
Mediated relations as views over source relations
Source relations as views over mediated relations
TSIMMIS chooses option 1
Information Mainfold chooses option 2
Neither is perfect or comprehensive, as well see

11
The Job of Mappings

Between different data sources
May have different numbers of tables different
decompositions
Attributes may be broken down differently
(rating vs. EbertThumb and RoeperThumb)
Metadata in one relation may be data in another
Values may not exactly correspond (shows vs.
movies)
It may be unclear whether a value is the same
(COPPOLA vs. Francis Ford Coppola)
May have different, but synonymous terms (ImdbID
123456 ? SSN 987-45-3210)
Might have sub/superclass relationships

12
General Techniques

Value-value correspondences accomplished using
concordance tables
Join through a table mapping values to values
Imdb_Actor(ID, SAG_actor_name)
Table-multitable correspondences accomplished
using joins (in one direction), projections (in
other direction)
Key question what happens if a needed attribute
is missing? (e.g., DecentMovie has no genre)
Super/subclass relationships generally must be
captured using selection (in one direction),
union (in other direction)
And sometimes we just cant specify the
correspondence!

13
Some Examples of Mappings

Show(ID, Title, Year, Lang, Genre)
Movie(ID, Title, Year, Genre, Director, Star1,
Star2)

EnglishMovie(Title, Year, Genre, Rating)
Docu(ID, Title, Year)Participant(ID, Name, Role)

PieceOfArt(I, T, Y, English, G) -
EnglishMovie(T, Y, G, _), MovieIDFor(I, T, Y)
Movie(I, T, Y, doc, D, S1, S2) - Docu(I, T,
Y), Participant(I, D, Dir), Participant(I,
S1, Cast1), Participant(I, S2, Cast2)
T1
ImdbID CastOf
1234 Catwoman
Name CastOf
Berry, H. Monsters Ball
T2
Need a concordance table from ImdbIDs to actress
names
14
TSIMMIS and Information Manifold

Focus Web-based queryable sources
CGI forms, online databases, maybe a few RDBMSs
Each needs to be mapped into the system not as
easy as web search but the benefits are
significant vs. query engines
A few parenthetical notes
Part of a slew of works on wrappers, source
profiling, etc.
The creation of mappings can be partly automated
systems such as LSD, Cupid, Clio, do this
Today most people look at integrating large
enterprises (thats where the is!) Nimble,
BEA, IBM

15
TSIMMIS

The Stanford-IBM Manager of Multiple Information
Sources or, a Yiddish stew
An instance of a global-as-view mediation
system
One of the first systems to support
semi-structured data, which predated XML by
several years

16
Semi-structured Data OEM

Observation given a particular schema, its
attributes may be unavailable from certain
sources inherent irregularity
Proposal Object Exchange Model, OEM
OID ltlabel, type, valuegt

1 show 2 id 15 , 3 title Catwoman
, 4 year 2004 , 5 lang English ,
6 genre fantasy , 7 criticsrating 8
stars 0.5 , 9 source Bob

17
Queries in TSIMMIS

Specified in OQL-style language called Lorel
OQL was an object-oriented query language
Lorel is, in many ways, a predecessor to XQuery
Based on path expressions over OEM structures
select showwhere show.title Star Wars and
show.genre sci-fi
This is basically like XQuery, which well use in
place of Lorel and the MSL template language.
Previous query restated
for s in AllData()/showwhere s/title/text()
Star Wars and s/genre/text()
sci-fireturn s

18
Query Answering in TSIMMIS

Basically, its view unfolding, i.e., composing a
query with a view
The query is the one being asked
The views are the MSL templates for the wrappers
Some of the views may actually require
parameters, e.g., an author name, before theyll
return answers
Common for web forms (see Amazon, Google, )
XQuery functions (XQuerys version of views)
support parameters as well, so well see these in
action

19
A Wrapper Definition in MSL

Wrappers have templates and binding patterns (T)
in MSL
S - S ltshow ltgenre Ggtgt // select
from movie where title T //
This reformats a SQL query over Movie(title,
year, genre)
In XQuery, this might look like
define function GetShow(t AS xsdstring) as show
for s in sql(Amazon.DB, select
from movie where title t )return
ltshowgtlttitlegttlt/titlegt s/year,
s/genrelt/showgt

movie
genre
year

The union of GetShows results is unioned with
others to form the view AllData()
20
How to Answer the Query

Given our query
for s in AllData()/showwhere s/title/text()
Star Wars and s/genre/text()
sci-fireturn s
Find all wrapper definitions that
Contain output enough structure to match the
conditions of the query
Or have already tested the conditions for us!

21
Query Composition with Views

We find all views that define book with author
and title, and we compose the query with each
define function GetBook(x AS xsdstring) as book
for b in sql(Amazon.DB, select
from book where author x )return
ltbookgt b/title ltauthorgtxlt/authorgtlt/bookgt
for b in AllData()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b

book
author
title

22
Example on Board
23
Virtues of TSIMMIS

Early adopter of semistructured data, greatly
predating XML
Can support data from many different kinds of
sources
Obviously, doesnt fully solve heterogeneity
problem
Presents a mediated schema that is the union of
multiple views
Query answering based on view unfolding
Easily composed in a hierarchy of mediators

24
Limitations of TSIMMIS Approach

Some data sources may contain data with certain
ranges or properties
Books by Aho, Students at UPenn,
If we ask a query for students at Columbia, dont
want to bother querying students at Penn
How do we express these?
Mediated schema is basically the union of the
various MSL templates as they change, so may
the mediated schema

25
An Alternate ApproachThe Information Manifold
(Levy et al.)

When you integrate something, you have some
conceptual model of the integrated domain
Define that as a basic frame of reference,
everything else as a view over it
Local as View using mappings that are
conjunctive queries
May have overlapping/incomplete sources
Define each source as the subset of a query over
the mediated schema the open world assumption
We can use selection or join predicates to
specify that a source contains a range of values
ComputerBooks() ? Books(Title, , Subj), Subj
Computers

26
The Local-as-View Model

The basic model is the following
Local sources are views over the mediated
schema
Sources have the data mediated schema is
virtual
Sources may not have all the data from the domain
open-world assumption
The system must use the sources (views) to answer
queries over the mediated schema

27
Answering Queries Using Views

Assumption conjunctive queries, set semantics
Suppose we have a mediated schema show(ID,
title, year, genre), rating(ID, stars, source)
A conjunctive query might be q(t) - show(i,
t, y, g), rating(i, 5, s), y 1997
Recall intuitions about this class of queries
Adding a conjunct to a query removes answers from
the result but never adds any
Any conjunctive query with at least the same
constraints conjuncts will give valid answers

28
Query Answering

Suppose we have the query
q(t) - show(i, t, y, g), rating(i, 5, s), y
1997
and sources
5star(i) ? show(i, t, y, g), rating(i, 5, s)
TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide)
movieInfo(i,t,y,g) ? show(i, t, y, g)
critics(i,r, s) ? rating(i, r, s)
goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997
We want to compose the query with the source
mappings but theyre in the wrong direction!

29
Inverse Rules

We can take every mapping and invert it, though
sometimes we may have insufficient information
If
5star(i) ? show(i, t, y, g), rating(i, 5, s)
then we can also infer that
show(i,??? ,??? ,??? ,???) ? 5star(i)
But how to handle the absence of the missing
attributes?
We know that there must be AT LEAST one instance
of ??? for each attribute for each show ID
So we might simply insert a NULL and define that
NULL means unknown (as opposed to missing)

30
But NULLs Lose Information

Suppose we take these rules and ask for
q(t) - show(i, t, y, g), rating(i, 5, s), y
1997
If we look at the rule
goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997
By inspection, q(t) ? goodMovies(t,y)
But if apply our inversion procedure, we get
show(i, t, y, g) ? goodMovies(t,y), i NULL, g
drama
rating(i, r, s) ? goodMovies(t,y), i NULL, r
5, s NULL
We need a special NULL so we can figure out
which IDs and ratings match up

31
The Solution Skolem Functions

Skolem functions
Conceptual perfect hash functions
Each function returns a unique, deterministic
value for each combination of input values
Every function returns a non-overlapping set of
values (Skolem function F will never return a
value that matches any of Skolem function Gs
values)
Skolem functions wont ever be part of the answer
set or the computation it doesnt produce real
values
Theyre just a way of logically generating
special NULLs

32
Query Answering Using Inverse Rules

Invert all rules using the procedures described
Take the query and the possible rule expansions
and execute them in a Datalog interpreter
In the previous query, we expand with all
combinations of expansions of book and of author
every possible way of combining and
cross-correlating info from different sources
Then we throw away all unsatisfiable rewritings
(some expansions will be logically inconsistent)
More efficient, but equivalent, algorithms now
exist
Bucket algorithm Levy et al., which we discuss
next
MiniCon Pottinger Halevy (next time)
Also related chase and backchase Popa,
Tannen, Deutsch

33
The Bucket Algorithm

Given a query Q with relations and predicates
Create a bucket for each subgoal in Q
Iterate over each view (source mapping)
If source includes buckets subgoal
Create mapping between qs vars and the views
var at the same position
If satisfiable with substitutions, add to bucket
Do cross-product of buckets, see if result is
contained (exptime, but queries are probably
relatively small)
For each result, do a containment check to make
sure the rewriting is contained within the query

34
Lets Try a Bucket Example

Query
q(t) - show(i, t, y, g), rating(i, 5, s), y
1997
Sources
5star(i) ? show(i, t, y, g), rating(i, 5, s)
TVguide(t,y,g,r) ? show(i, t, y, g), rating(i, r,
TVGuide)
movieInfo(i,t,y,g) ? show(i, t, y, g)
critics(i,r, s) ? rating(i, r, s)
goodMovies(t,y) ? show(i, t, y, drama),
rating(i, 5, s), y 1997
good98(t,y) ? show(i, t, y, drama), rating(i,
5, s), y 1998

35
Example of Containment Testing

Suppose we have two queriesq1(S,C) -
Student(S, N), Takes(S, C), Course(C, X),
inCSE(C), Course(C, DB Info
Systems)q2(S,C) - Student(S, N), Takes(S, C),
Course(C, X)
Intuitively, q1 must contain the same or fewer
answers vs. q2
It has all of the same conditions, except one
extra conjunction (i.e., its more restricted)
Theres no union or any other way it can add more
data
We can say that q2 contains q1 because this holds
for any instance of our DB Student, Takes,
Course

36
Checking Containment via Canonical Databases

To test for q1 µ q2
Create a canonical DB that contains a tuple for
each subgoal in q1
Execute q2 over it
If q2 returns a tuple that matches the head of
q1, then q1 µ q2
(This is an NP-complete algorithm in the size of
the query. Testing for full first-order logic
queries is undecidable!!!)
Lets see this for our example

37
Example Canonical DB

q1(S,C) - Student(S, N), Takes(S, C), Course(C,
X), inCSE(C), Course(C, DB Info Systems)
q2(S,C) - Student(S, N), Takes(S, C), Course(C,
X)

Student
Takes
Course
inCSE
C X
C DB Info Systems
S
S N
S C
Need to get tuple ltS,Cgt in executing q2 over this
database
38
Next Time

Well look at the state-of-the-art in query
reformulation, the MiniCon algorithm
Eliminates the need for the containment check
Eliminates many cross-product comparisons
This and the ChaseBackchase strategy of Tannen
et al are the two methods most used in virtual
data integration today
Please read the MiniCon paper (Pottinger Halevy)

Write a Comment

User Comments (0)