Lecture 18: Data Integration

About This Presentation

Title:

Lecture 18: Data Integration

Description:

1. CS511 Advanced Database Management Systems. Lecture 18: Data ... DI for specialized domains (e.g., bioinformatics) on-the-fly DI, entity-centric DI ... – PowerPoint PPT presentation

Number of Views:416

Avg rating:3.0/5.0

Slides: 57

Provided by: ChengXi4

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 18: Data Integration

1
Lecture 18 Data Integration
Nov. 2, 2007 ChengXiang Zhai
Most slides are taken from AnHai Doans
presentation
2
The General Problem

How can we access a set of heterogeneous,
distributed, autonomous databases as if accessing
a single database?
Arises in numerous contexts
on the Web, at enterprises, military, scientific
cooperation, bio-informatics domains, e-commerce,
etc.
Currently very hot
in both database research and industry

3
Current State of Affairs

Mostly ad-hoc programming create a special
solution for every case pay consultants a lot of
money.
Long-standing challenge in the DB community
AI/WWW communities are on board
Annual workshops, vision papers, ...
Companies
Informatica, many others, ...

4
A Brief Research History

Many early ad-hoc solutions
Converged into two approaches
data warehousing vs. virtual DI systems
Semi-structured data, XML
Wrappers, info extraction
Other issues query optimization, schema
matching, ...
Current directions
DI for specialized domains (e.g., bioinformatics)
on-the-fly DI, entity-centric DI
simplify integration tasks
New types of data sharing systems
P2P systems, Semantic Web

5
Data Warehouse Architecture
OLAP / Decision support/ Data cubes/ data mining
User queries
Relational database (warehouse)
Data extraction programs
Data cleaning/ scrubbing
Data source
Data source
Data source
6
Virtual Integration Architecture
User queries
Mediated schema
Mediator
Reformulation engine
optimizer
Data source catalog
Execution engine
wrapper
wrapper
wrapper
Data source
Data source
Data source
Sources can be relational, hierarchical (IMS),
structure files, web sites.
7
A Brief Research History

Many early ad-hoc solutions
Converged into two approaches
data warehousing vs. virtual DI systems
Semi-structured data, XML
Wrappers, info extraction
Other issues query optimization, schema
matching, ...
Current directions
DI for specialized domains (e.g., bioinformatics)
on-the-fly DI, entity-centric DI
simplify integration tasks
New types of data sharing systems
P2P systems, Semantic Web

8
Semi-structured Data

What should be the underlying data model for DI
contexts?
relational model is not an ideal choice
Developed semi-structured data model
started with the OEM (object exchange model)
Then XML came along
It is now the most well-known semi-structured
data model
Generating much research in the DB community
To be covered later

9
A Brief Research History

Many early ad-hoc solutions
Converged into two approaches
data warehousing vs. virtual DI systems
Semi-structured data, XML
Wrappers, info extraction
Other issues query optimization, schema
matching, ...
Current directions
DI for specialized domains (e.g., bioinformatics)
on-the-fly DI, entity-centric DI
simplify integration tasks
New types of data sharing systems
P2P systems, Semantic Web

10
Wrapper Programs

Task to communicate with the data sources and do
format translations.
They are built w.r.t. a specific source.
They can sit either at the source or at the
mediator.
Often hard to build (very little science).
Can be intelligent perform source-specific
optimizations.

11
Example
ltbgt Introduction to DB lt/bgt ltigt Phil Bernstein
lt/igt ltigt Eric Newcomer lt/igt Addison Wesley,
1999 ltbookgt lttitlegt Introduction to DB
lt/titlegt ltauthorgt Phil Bernstein
lt/authorgt ltauthorgt Eric Newcomer
lt/authorgt ltpublishergt Addison Wesley
lt/publishergt ltyeargt 1999 lt/yeargt lt/bookgt
Transform
into
12
Wrapper Construction

Huge amount of research in the past decade
Two major approaches
supervised learning typically requires some
hand-labeled data
unsupervised learning data-intensive, completely
automatic
Different focuses
pull out each record (i.e., segment page into
records)
pull out fields in each record
remove junk portions (ads, etc.)
Current solutions are still brittle
Unclear whether standards such as XML Web
services will eliminate the problem
the need likely will still remain

13
Information Extraction

If the source cannot be wrapped with a grammar or
some easy-to-parse rules
must do information extraction
Huge research in the AI community

14
A Brief Research History

Many early ad-hoc solutions
Converged into two approaches
data warehousing vs. virtual DI systems
Semi-structured data, XML
Wrappers, info extraction
Other issues query optimization, schema
matching, ...
Current directions
DI for specialized domains (e.g., bioinformatics)
on-the-fly DI, entity-centric DI
simplify integration tasks
New types of data sharing systems
P2P systems, Semantic Web

15
Virtual Integration Architecture
User queries
Mediated schema
Mediator
Reformulation engine
Optimizer
Data source catalog
Execution engine
wrapper
wrapper
wrapper
Data source
Data source
Data source
16
Data Source Catalog

Contains all meta-information about the sources
Logical source contents (books, new cars).
Source capabilities (can answer SQL queries)
Source completeness (has all books).
Physical properties of source and network.
Statistics about the data (like in an RDBMS)
Source reliability
Mirror sources
Update frequency.

17
Content Descriptions

User queries refer to the mediated schema.
Data is stored in the sources in a local schema.
Content descriptions provide the semantic
mappings between the different schemas.
Data integration system uses the descriptions to
translate user queries into queries on the
sources.

18
Approaches to Specifying Source Descriptions

Global-as-view express the mediated schema
relations as a set of views over the data source
relations
Local-as-view express the source relations as
views over the mediated schema.
Can be combined with no additional cost.

19
Global-as-View

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS
select from S1 S1(title,dir,year,genre)
union
select from S2 S2(title,
dir,year,genre)
union S3(title,dir),
S4(title,year,genre)
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
where S3.titleS4.title

20
Global-as-View Example 2

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS S1(title,dir,year)
select title, dir, year, NULL
from S1
union S2(title,
dir,genre)
select title, dir, NULL, genre
from S2

21
Global-as-View Example 3

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4 S4(cinema, genre)
Create View Movie AS
select NULL, NULL, NULL, genre
from S4
Create View Schedule AS
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are
playing comedies?

22
Global-as-View Summary

Query reformulation boils down to view unfolding.
Very easy conceptually.
Can build hierarchies of mediated schemas.
You sometimes lose information. Not always
natural.
Adding sources is hard. Need to consider all
other sources that are available.

23
Local-as-View example 1

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create Source S1 AS
select from Movie
Create Source S3 AS S3(title, dir)
select title, dir from Movie
Create Source S5 AS
select title, dir, year
from Movie
where year gt 1960 AND genreComedy

24
Local-as-View Example 2

Mediated schema
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4 S4(cinema, genre)
Create Source S4
select cinema, genre
from Movie m, Schedule s
where m.titles.title
.
Now if we want to find which cinemas are playing
comedies, there is hope

25
Local-as-View Summary

Very flexible. You have the power of the entire
query language to define the contents of the
source.
Hence, can easily distinguish between contents of
closely related sources.
Adding sources is easy theyre independent of
each other.
Query reformulation answering queries using views

26
Answering Queries Using Views

Given a set of views V1,,Vn, and a query Q, can
we answer Q using only the answers to V1,,Vn?
Many, many papers on this problem.
The best performing algorithm The MiniCon
Algorithm, (Pottinger Levy, 2000).
Great survey on the topic (Halevy, 2001).

27
The General Reformulation Problem

Given
A query Q posed over the mediated schema
Descriptions of the data sources
Find
A query Q over the data source relations, such
that
Q provides only correct answers to Q, and
Q provides all possible answers to Q given the
sources.

28
Query Reformulation

Generate good logical query plans quickly
Recent works
bucket algorithm Levy et. al.
inverse rule algorithm Duschka et. al.
Minicon algorithm Pottinger et. al.
iDRIPS/Streamer Doan et. al.

29
Query Optimization

Very related to query reformulation
Goal of the optimizer find a physical plan with
minimal cost.
Key components in optimization
Search space of plans
Search strategy
Cost model

30
Can We Get RDBMS-Level Optimizationfor Data
Integration, without Statistics?

Multiple remote sources
Described and mapped loosely
Data changes frequently
Generally, would like to support same kinds of
queries as in a local setting

Results
Query
Data Integration System
Mediated Schema
Source Catalog
Schema Mappings
Remote, Autonomous Data Sources
31
Adaptive Query Processing

Many domains where cost-based query optimization
fails
Complex queries in traditional databases
estimation error grows exponentially with joins
IC91
Querying over the Internet unpredictable access
rates, delays
Querying external data sources limited
information available about properties of this
source
Monitor real-world conditions, adapt processing
strategy in response

32
What Are the Sources of Inefficiency?

Delays we stall in waiting for I/O
Bad estimation of intermediate result sizes
No info about source cardinalities
Major challenges
Trading off information acquisition (exploration)
vs. use (exploitation)
Extrapolating performance based on what youve
seen so far

33
Types of Adaptive Query Processing

Adaptive scheduling (q. scrambling UF98, dyn.
pipeline sched. UF01, XJoin UF00, PH Join
RS86I99, ripple join HH99)
Changes CPU scheduling to improve feedback or
reduce delays
Redundant computation (competitive exec. AZ96)
Compare two ways of executing the query
Plan partitioning (S76KD98I99M04)
Break the plan into stages re-optimize future
stages as necessary
Adaptive info passing (IT05 sub.)
Pass data between parts of an executing plan
Adaptive data partitioning
Break the data into subsets use a different plan
for each subset
The only way to reduce overall computation with
fine granularity
First (only) implementation has been eddies
AH00R03DH04

34
Adaptive QP in Summary

A variety of different techniques, focusing on
Scheduling
Comparison competition
Information passing
Data plan partitioning
A field that is still fairly open missing
Effective exploration methods
A true theory!
Whats possible? What kinds of queries make
sense to adapt?
Guarantees of optimality and convergence (perhaps
under certain assumptions)

35
A Brief Research History

Many early ad-hoc solutions
Converged into two approaches
data warehousing vs. virtual DI systems
Semi-structured data, XML
Wrappers, info extraction
Other issues query optimization, schema
matching, ...
Current directions
DI for specialized domains (e.g., bioinformatics)
on-the-fly DI, entity-centric DI
simplify integration tasks
New types of data sharing systems
P2P systems, Semantic Web

36
Semantic Matches between Schemas
Mediated-schema
price agent-name address
1-1 match
complex match
homes.com
listed-price contact-name city
state
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
37
Schema Matching is Ubiquitous!

Fundamental problem in numerous applications
Databases
data integration
data translation
schema/view integration
data warehousing
semantic query processing
model management
peer data management
AI
knowledge bases, ontology merging, information
gathering agents, ...
Web
e-commerce
marking up data using ontologies (e.g., on
Semantic Web)

38
Why Schema Matching is Difficult

Schema data never fully capture semantics!
not adequately documented
schema creator has retired to Florida!
Must rely on clues in schema data
using names, structures, types, data values, etc.
Such clues can be unreliable
same names gt different entities area gt
location or square-feet
different names gt same entity area
address gt location
Intended semantics can be subjective
house-style house-description?
military applications require committees to
decide!
Cannot be fully automated, needs user feedback!

39
Current State of Affairs

Finding semantic mappings is now a key
bottleneck!
largely done by hand
labor intensive error prone
data integration at GTE LiClifton, 2000
40 databases, 27000 elements, estimated time 12
years
Will only be exacerbated
data sharing becomes pervasive
translation of legacy data
Need semi-automatic approaches to scale up!
Many research projects in the past few years
Databases IBM Almaden, Microsoft Research, BYU,
George Mason, U of Leipzig, U
Wisconsin, NCSU, UIUC, Washington, ...
AI Stanford, Karlsruhe University, NEC Japan,
...

40
LSD Doan et al. 01

Learning Source Description
Developed at Univ of Washington 2000-2001
Designed for data integration settings
has been adapted to several other contexts
Desirable characteristics
learn from previous matching activities
exploit multiple types of information in schema
and data
incorporate domain integrity constraints
handle user feedback
achieves high matching accuracy (66 -- 97) on
real-world data

41
Schema Matching for Data Integrationthe LSD
Approach

Suppose user wants to integrate 100 data sources
1. User
manually creates matches for a few sources, say 3
shows LSD these matches
2. LSD learns from the matches
3. LSD predicts matches for remaining 97 sources

42
Learning from the Manual Matches
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
43
Must Exploit Multiple Types of Information!
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
44
Multi-Strategy Learning

Use a set of base learners
each exploits well certain types of information
To match a schema element of a new source
apply base learners
combine their predictions using a meta-learner
Meta-learner
uses training sources to measure base learner
accuracy
weighs each learner based on its accuracy

45
Base Learners

Training
Matching
Name Learner
training (location, address)
(contact name, name)
matching agent-name gt (name,0.7),(phone,0
.3)
Naive Bayes Learner
training (Seattle, WA,address)
(250K,price)
matching Kent, WA gt
(address,0.8),(name,0.2)

labels weighted by confidence score
X
46
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
47
Training the Base Learners
Mediated schema
address price agent-name agent-phone
office-phone description
realestate.com
location price contact-name
contact-phone office
comments
Miami, FL 250K James Smith (305) 729
0831 (305) 616 1822 Fantastic house Boston,
MA 320K Mike Doan (617) 253 1429 (617)
112 2315 Great location
48
Meta-Learner StackingWolpert 92,TingWitten99

Training
uses training data to learn weights
one for each (base-learner,mediated-schema
element) pair
weight (Name-Learner,address) 0.2
weight (Naive-Bayes,address) 0.8
Matching combine predictions of base learners
computes weighted average of base-learner
confidence scores

area
Name Learner Naive Bayes
(address,0.4) (address,0.9)
Seattle, WA Kent, WA Bend, OR
Meta-Learner
(address, 0.40.2 0.90.8 0.8)
49
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
50
Applying the Learners
homes.com schema
area sold-at contact-agent
extra-info
area
Name Learner Naive Bayes
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Seattle, WA Kent, WA Bend, OR
Name Learner Naive Bayes
Meta-Learner
Prediction-Combiner
(address,0.7), (description,0.3)
homes.com
sold-at
(price,0.9), (agent-phone,0.1)
contact-agent
(agent-phone,0.9), (description,0.1)
extra-info
(address,0.6), (description,0.4)
51
Domain Constraints

Encode user knowledge about domain
Specified only once, by examining mediated schema
Examples
at most one source-schema element can match
address
if a source-schema element matches house-id then
it is a key
avg-value(price) gt avg-value(num-baths)
Given a mapping combination
can verify if it satisfies a given constraint

area address sold-at
price contact-agent agent-phone extra-info
address
52
The Constraint Handler
Predictions from Prediction Combiner
Domain Constraints At most one element matches
address
area (address,0.7),
(description,0.3) sold-at
(price,0.9), (agent-phone,0.1) contact-agent
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.1 0.4 0.0012
0.7 0.9 0.9 0.4 0.2268
area address sold-at
price contact-agent agent-phone extra-info
description
0.7 0.9 0.9 0.6 0.3402
area address sold-at
price contact-agent agent-phone extra-info
address

Searches space of mapping combinations
efficiently
Can handle arbitrary constraints
Also used to incorporate user feedback
sold-at does not match price

53
The Current LSD System

Can also handle data in XML format
matches XML DTDs
Base learners
Naive Bayes DudaHart-93, DomingosPazzani-97
exploits frequencies of words symbols
WHIRL Nearest-Neighbor Classifier CohenHirsh
KDD-98
employs information-retrieval similarity metric
Name Learner SIGMOD-01
matches elements based on their names
County-Name Recognizer SIGMOD-01
stores all U.S. county names
XML Learner SIGMOD-01
exploits hierarchical structure of XML data

54
Empirical Evaluation

Four domains
Real Estate I II, Course Offerings, Faculty
Listings
For each domain
created mediated schema domain constraints
chose five sources
extracted converted data into XML
mediated schemas 14 - 66 elements, source
schemas 13 - 48

Ten runs for each domain, in each run
manually provided 1-1 matches for 3 sources
asked LSD to propose matches for remaining 2
sources
accuracy of 1-1 matches correctly identified
LSD achieves 66 -- 97 on real-world data

55
A Brief Research History

Many early ad-hoc solutions
Converged into two approaches
data warehousing vs. virtual DI systems
Semi-structured data, XML
Wrappers, info extraction
Other issues query optimization, schema
matching, ...
Current directions
DI for specialized domains (e.g., bioinformatics)
on-the-fly DI, simplified queries (e.g.,
entity-centric)
simplify integration tasks (e.g., inexact
answers)
New types of data sharing systems
P2P systems, Semantic Web

56
What You Should Know

Data integration is a very active research area
in DB
Strong industry needs
Many interesting research questions
Touches many other fields artificial
intelligence (ontology, machine learning, ),
information retrieval,
Web data integration is a very important special
case

Write a Comment

User Comments (0)