Title: Lecture 18: Data Integration
1Lecture 18 Data Integration
Nov. 2, 2007 ChengXiang Zhai
Most slides are taken from AnHai Doans
presentation
2The General Problem
- How can we access a set of heterogeneous,
distributed, autonomous databases as if accessing
a single database? - Arises in numerous contexts
- on the Web, at enterprises, military, scientific
cooperation, bio-informatics domains, e-commerce,
etc. - Currently very hot
- in both database research and industry
3Current State of Affairs
- Mostly ad-hoc programming create a special
solution for every case pay consultants a lot of
money. - Long-standing challenge in the DB community
- AI/WWW communities are on board
- Annual workshops, vision papers, ...
- Companies
- Informatica, many others, ...
4A Brief Research History
- Many early ad-hoc solutions
- Converged into two approaches
- data warehousing vs. virtual DI systems
- Semi-structured data, XML
- Wrappers, info extraction
- Other issues query optimization, schema
matching, ... - Current directions
- DI for specialized domains (e.g., bioinformatics)
- on-the-fly DI, entity-centric DI
- simplify integration tasks
- New types of data sharing systems
- P2P systems, Semantic Web
5Data Warehouse Architecture
OLAP / Decision support/ Data cubes/ data mining
User queries
Relational database (warehouse)
Data extraction programs
Data cleaning/ scrubbing
Data source
Data source
Data source
6Virtual Integration Architecture
User queries
Mediated schema
Mediator
Reformulation engine
optimizer
Data source catalog
Execution engine
wrapper
wrapper
wrapper
Data source
Data source
Data source
Sources can be relational, hierarchical (IMS),
structure files, web sites.
7A Brief Research History
- Many early ad-hoc solutions
- Converged into two approaches
- data warehousing vs. virtual DI systems
- Semi-structured data, XML
- Wrappers, info extraction
- Other issues query optimization, schema
matching, ... - Current directions
- DI for specialized domains (e.g., bioinformatics)
- on-the-fly DI, entity-centric DI
- simplify integration tasks
- New types of data sharing systems
- P2P systems, Semantic Web
8Semi-structured Data
- What should be the underlying data model for DI
contexts? - relational model is not an ideal choice
- Developed semi-structured data model
- started with the OEM (object exchange model)
- Then XML came along
- It is now the most well-known semi-structured
data model - Generating much research in the DB community
- To be covered later
9A Brief Research History
- Many early ad-hoc solutions
- Converged into two approaches
- data warehousing vs. virtual DI systems
- Semi-structured data, XML
- Wrappers, info extraction
- Other issues query optimization, schema
matching, ... - Current directions
- DI for specialized domains (e.g., bioinformatics)
- on-the-fly DI, entity-centric DI
- simplify integration tasks
- New types of data sharing systems
- P2P systems, Semantic Web
10Wrapper Programs
- Task to communicate with the data sources and do
format translations. - They are built w.r.t. a specific source.
- They can sit either at the source or at the
mediator. - Often hard to build (very little science).
- Can be intelligent perform source-specific
optimizations.
11Example
ltbgt Introduction to DB lt/bgt ltigt Phil Bernstein
lt/igt ltigt Eric Newcomer lt/igt Addison Wesley,
1999 ltbookgt lttitlegt Introduction to DB
lt/titlegt ltauthorgt Phil Bernstein
lt/authorgt ltauthorgt Eric Newcomer
lt/authorgt ltpublishergt Addison Wesley
lt/publishergt ltyeargt 1999 lt/yeargt lt/bookgt
Transform
into
12Wrapper Construction
- Huge amount of research in the past decade
- Two major approaches
- supervised learning typically requires some
hand-labeled data - unsupervised learning data-intensive, completely
automatic - Different focuses
- pull out each record (i.e., segment page into
records) - pull out fields in each record
- remove junk portions (ads, etc.)
- Current solutions are still brittle
- Unclear whether standards such as XML Web
services will eliminate the problem - the need likely will still remain
13Information Extraction
- If the source cannot be wrapped with a grammar or
some easy-to-parse rules - must do information extraction
- Huge research in the AI community
14A Brief Research History
- Many early ad-hoc solutions
- Converged into two approaches
- data warehousing vs. virtual DI systems
- Semi-structured data, XML
- Wrappers, info extraction
- Other issues query optimization, schema
matching, ... - Current directions
- DI for specialized domains (e.g., bioinformatics)
- on-the-fly DI, entity-centric DI
- simplify integration tasks
- New types of data sharing systems
- P2P systems, Semantic Web
15Virtual Integration Architecture
User queries
Mediated schema
Mediator
Reformulation engine
Optimizer
Data source catalog
Execution engine
wrapper
wrapper
wrapper
Data source
Data source
Data source
16Data Source Catalog
- Contains all meta-information about the sources
- Logical source contents (books, new cars).
- Source capabilities (can answer SQL queries)
- Source completeness (has all books).
- Physical properties of source and network.
- Statistics about the data (like in an RDBMS)
- Source reliability
- Mirror sources
- Update frequency.
17Content Descriptions
- User queries refer to the mediated schema.
- Data is stored in the sources in a local schema.
- Content descriptions provide the semantic
mappings between the different schemas. - Data integration system uses the descriptions to
translate user queries into queries on the
sources.
18Approaches to Specifying Source Descriptions
- Global-as-view express the mediated schema
relations as a set of views over the data source
relations - Local-as-view express the source relations as
views over the mediated schema. - Can be combined with no additional cost.
19Global-as-View
- Mediated schema
- Movie(title, dir, year, genre),
- Schedule(cinema, title, time).
- Create View Movie AS
- select from S1 S1(title,dir,year,genre)
- union
- select from S2 S2(title,
dir,year,genre) - union S3(title,dir),
S4(title,year,genre) - select S3.title, S3.dir, S4.year, S4.genre
- from S3, S4
- where S3.titleS4.title
20Global-as-View Example 2
- Mediated schema
- Movie(title, dir, year, genre),
- Schedule(cinema, title, time).
- Create View Movie AS S1(title,dir,year)
- select title, dir, year, NULL
- from S1
- union S2(title,
dir,genre) - select title, dir, NULL, genre
- from S2
-
21Global-as-View Example 3
- Mediated schema
- Movie(title, dir, year, genre),
- Schedule(cinema, title, time).
- Source S4 S4(cinema, genre)
- Create View Movie AS
- select NULL, NULL, NULL, genre
- from S4
- Create View Schedule AS
- select cinema, NULL, NULL
- from S4.
- But what if we want to find which cinemas are
playing comedies?
22Global-as-View Summary
- Query reformulation boils down to view unfolding.
- Very easy conceptually.
- Can build hierarchies of mediated schemas.
- You sometimes lose information. Not always
natural. - Adding sources is hard. Need to consider all
other sources that are available.
23Local-as-View example 1
- Mediated schema
- Movie(title, dir, year, genre),
- Schedule(cinema, title, time).
- Create Source S1 AS
- select from Movie
- Create Source S3 AS S3(title, dir)
- select title, dir from Movie
- Create Source S5 AS
- select title, dir, year
- from Movie
- where year gt 1960 AND genreComedy
24Local-as-View Example 2
- Mediated schema
- Movie(title, dir, year, genre),
- Schedule(cinema, title, time).
- Source S4 S4(cinema, genre)
- Create Source S4
- select cinema, genre
- from Movie m, Schedule s
- where m.titles.title
- .
- Now if we want to find which cinemas are playing
comedies, there is hope
25Local-as-View Summary
- Very flexible. You have the power of the entire
query language to define the contents of the
source. - Hence, can easily distinguish between contents of
closely related sources. - Adding sources is easy theyre independent of
each other. - Query reformulation answering queries using views
26Answering Queries Using Views
- Given a set of views V1,,Vn, and a query Q, can
we answer Q using only the answers to V1,,Vn? - Many, many papers on this problem.
- The best performing algorithm The MiniCon
Algorithm, (Pottinger Levy, 2000). - Great survey on the topic (Halevy, 2001).
27The General Reformulation Problem
- Given
- A query Q posed over the mediated schema
- Descriptions of the data sources
- Find
- A query Q over the data source relations, such
that - Q provides only correct answers to Q, and
- Q provides all possible answers to Q given the
sources.
28Query Reformulation
- Generate good logical query plans quickly
- Recent works
- bucket algorithm Levy et. al.
- inverse rule algorithm Duschka et. al.
- Minicon algorithm Pottinger et. al.
- iDRIPS/Streamer Doan et. al.
29Query Optimization
- Very related to query reformulation
- Goal of the optimizer find a physical plan with
minimal cost. - Key components in optimization
- Search space of plans
- Search strategy
- Cost model
30Can We Get RDBMS-Level Optimizationfor Data
Integration, without Statistics?
- Multiple remote sources
- Described and mapped loosely
- Data changes frequently
- Generally, would like to support same kinds of
queries as in a local setting
Results
Query
Data Integration System
Mediated Schema
Source Catalog
Schema Mappings
Remote, Autonomous Data Sources
31Adaptive Query Processing
- Many domains where cost-based query optimization
fails - Complex queries in traditional databases
estimation error grows exponentially with joins
IC91 - Querying over the Internet unpredictable access
rates, delays - Querying external data sources limited
information available about properties of this
source - Monitor real-world conditions, adapt processing
strategy in response
32What Are the Sources of Inefficiency?
- Delays we stall in waiting for I/O
- Bad estimation of intermediate result sizes
- No info about source cardinalities
- Major challenges
- Trading off information acquisition (exploration)
vs. use (exploitation) - Extrapolating performance based on what youve
seen so far
33Types of Adaptive Query Processing
- Adaptive scheduling (q. scrambling UF98, dyn.
pipeline sched. UF01, XJoin UF00, PH Join
RS86I99, ripple join HH99) - Changes CPU scheduling to improve feedback or
reduce delays - Redundant computation (competitive exec. AZ96)
- Compare two ways of executing the query
- Plan partitioning (S76KD98I99M04)
- Break the plan into stages re-optimize future
stages as necessary - Adaptive info passing (IT05 sub.)
- Pass data between parts of an executing plan
- Adaptive data partitioning
- Break the data into subsets use a different plan
for each subset - The only way to reduce overall computation with
fine granularity - First (only) implementation has been eddies
AH00R03DH04
34Adaptive QP in Summary
- A variety of different techniques, focusing on
- Scheduling
- Comparison competition
- Information passing
- Data plan partitioning
- A field that is still fairly open missing
- Effective exploration methods
- A true theory!
- Whats possible? What kinds of queries make
sense to adapt? - Guarantees of optimality and convergence (perhaps
under certain assumptions)
35A Brief Research History
- Many early ad-hoc solutions
- Converged into two approaches
- data warehousing vs. virtual DI systems
- Semi-structured data, XML
- Wrappers, info extraction
- Other issues query optimization, schema
matching, ... - Current directions
- DI for specialized domains (e.g., bioinformatics)
- on-the-fly DI, entity-centric DI
- simplify integration tasks
- New types of data sharing systems
- P2P systems, Semantic Web
36Semantic Matches between Schemas
Mediated-schema
price agent-name address
1-1 match
complex match
homes.com
listed-price contact-name city
state
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
37Schema Matching is Ubiquitous!
- Fundamental problem in numerous applications
- Databases
- data integration
- data translation
- schema/view integration
- data warehousing
- semantic query processing
- model management
- peer data management
- AI
- knowledge bases, ontology merging, information
gathering agents, ... - Web
- e-commerce
- marking up data using ontologies (e.g., on
Semantic Web)
38Why Schema Matching is Difficult
- Schema data never fully capture semantics!
- not adequately documented
- schema creator has retired to Florida!
- Must rely on clues in schema data
- using names, structures, types, data values, etc.
- Such clues can be unreliable
- same names gt different entities area gt
location or square-feet - different names gt same entity area
address gt location - Intended semantics can be subjective
- house-style house-description?
- military applications require committees to
decide! - Cannot be fully automated, needs user feedback!
39Current State of Affairs
- Finding semantic mappings is now a key
bottleneck! - largely done by hand
- labor intensive error prone
- data integration at GTE LiClifton, 2000
- 40 databases, 27000 elements, estimated time 12
years - Will only be exacerbated
- data sharing becomes pervasive
- translation of legacy data
- Need semi-automatic approaches to scale up!
- Many research projects in the past few years
- Databases IBM Almaden, Microsoft Research, BYU,
George Mason, U of Leipzig, U
Wisconsin, NCSU, UIUC, Washington, ... - AI Stanford, Karlsruhe University, NEC Japan,
...
40LSD Doan et al. 01
- Learning Source Description
- Developed at Univ of Washington 2000-2001
- Designed for data integration settings
- has been adapted to several other contexts
- Desirable characteristics
- learn from previous matching activities
- exploit multiple types of information in schema
and data - incorporate domain integrity constraints
- handle user feedback
- achieves high matching accuracy (66 -- 97) on
real-world data
41Schema Matching for Data Integrationthe LSD
Approach
- Suppose user wants to integrate 100 data sources
- 1. User
- manually creates matches for a few sources, say 3
- shows LSD these matches
- 2. LSD learns from the matches
- 3. LSD predicts matches for remaining 97 sources
42Learning from the Manual Matches
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
43Must Exploit Multiple Types of Information!
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
44Multi-Strategy Learning
- Use a set of base learners
- each exploits well certain types of information
- To match a schema element of a new source
- apply base learners
- combine their predictions using a meta-learner
- Meta-learner
- uses training sources to measure base learner
accuracy - weighs each learner based on its accuracy
45Base Learners
- Training
- Matching
- Name Learner
- training (location, address)
(contact name, name) - matching agent-name gt (name,0.7),(phone,0
.3) - Naive Bayes Learner
- training (Seattle, WA,address)
(250K,price) - matching Kent, WA gt
(address,0.8),(name,0.2)
labels weighted by confidence score
X
46The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
47Training the Base Learners
Mediated schema
address price agent-name agent-phone
office-phone description
realestate.com
location price contact-name
contact-phone office
comments
Miami, FL 250K James Smith (305) 729
0831 (305) 616 1822 Fantastic house Boston,
MA 320K Mike Doan (617) 253 1429 (617)
112 2315 Great location
48Meta-Learner StackingWolpert 92,TingWitten99
- Training
- uses training data to learn weights
- one for each (base-learner,mediated-schema
element) pair - weight (Name-Learner,address) 0.2
- weight (Naive-Bayes,address) 0.8
- Matching combine predictions of base learners
- computes weighted average of base-learner
confidence scores
area
Name Learner Naive Bayes
(address,0.4) (address,0.9)
Seattle, WA Kent, WA Bend, OR
Meta-Learner
(address, 0.40.2 0.90.8 0.8)
49The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
50Applying the Learners
homes.com schema
area sold-at contact-agent
extra-info
area
Name Learner Naive Bayes
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Seattle, WA Kent, WA Bend, OR
Name Learner Naive Bayes
Meta-Learner
Prediction-Combiner
(address,0.7), (description,0.3)
homes.com
sold-at
(price,0.9), (agent-phone,0.1)
contact-agent
(agent-phone,0.9), (description,0.1)
extra-info
(address,0.6), (description,0.4)
51Domain Constraints
- Encode user knowledge about domain
- Specified only once, by examining mediated schema
- Examples
- at most one source-schema element can match
address - if a source-schema element matches house-id then
it is a key - avg-value(price) gt avg-value(num-baths)
- Given a mapping combination
- can verify if it satisfies a given constraint
area address sold-at
price contact-agent agent-phone extra-info
address
52The Constraint Handler
Predictions from Prediction Combiner
Domain Constraints At most one element matches
address
area (address,0.7),
(description,0.3) sold-at
(price,0.9), (agent-phone,0.1) contact-agent
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.1 0.4 0.0012
0.7 0.9 0.9 0.4 0.2268
area address sold-at
price contact-agent agent-phone extra-info
description
0.7 0.9 0.9 0.6 0.3402
area address sold-at
price contact-agent agent-phone extra-info
address
- Searches space of mapping combinations
efficiently - Can handle arbitrary constraints
- Also used to incorporate user feedback
- sold-at does not match price
53The Current LSD System
- Can also handle data in XML format
- matches XML DTDs
- Base learners
- Naive Bayes DudaHart-93, DomingosPazzani-97
- exploits frequencies of words symbols
- WHIRL Nearest-Neighbor Classifier CohenHirsh
KDD-98 - employs information-retrieval similarity metric
- Name Learner SIGMOD-01
- matches elements based on their names
- County-Name Recognizer SIGMOD-01
- stores all U.S. county names
- XML Learner SIGMOD-01
- exploits hierarchical structure of XML data
54Empirical Evaluation
- Four domains
- Real Estate I II, Course Offerings, Faculty
Listings - For each domain
- created mediated schema domain constraints
- chose five sources
- extracted converted data into XML
- mediated schemas 14 - 66 elements, source
schemas 13 - 48
- Ten runs for each domain, in each run
- manually provided 1-1 matches for 3 sources
- asked LSD to propose matches for remaining 2
sources - accuracy of 1-1 matches correctly identified
- LSD achieves 66 -- 97 on real-world data
55A Brief Research History
- Many early ad-hoc solutions
- Converged into two approaches
- data warehousing vs. virtual DI systems
- Semi-structured data, XML
- Wrappers, info extraction
- Other issues query optimization, schema
matching, ... - Current directions
- DI for specialized domains (e.g., bioinformatics)
- on-the-fly DI, simplified queries (e.g.,
entity-centric) - simplify integration tasks (e.g., inexact
answers) - New types of data sharing systems
- P2P systems, Semantic Web
56What You Should Know
- Data integration is a very active research area
in DB - Strong industry needs
- Many interesting research questions
- Touches many other fields artificial
intelligence (ontology, machine learning, ),
information retrieval, - Web data integration is a very important special
case