Lecture 18: Data Integration - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Lecture 18: Data Integration

Description:

1. CS511 Advanced Database Management Systems. Lecture 18: Data ... DI for specialized domains (e.g., bioinformatics) on-the-fly DI, entity-centric DI ... – PowerPoint PPT presentation

Number of Views:416
Avg rating:3.0/5.0
Slides: 57
Provided by: ChengXi4
Category:

less

Transcript and Presenter's Notes

Title: Lecture 18: Data Integration


1
Lecture 18 Data Integration
Nov. 2, 2007 ChengXiang Zhai
Most slides are taken from AnHai Doans
presentation
2
The General Problem
  • How can we access a set of heterogeneous,
    distributed, autonomous databases as if accessing
    a single database?
  • Arises in numerous contexts
  • on the Web, at enterprises, military, scientific
    cooperation, bio-informatics domains, e-commerce,
    etc.
  • Currently very hot
  • in both database research and industry

3
Current State of Affairs
  • Mostly ad-hoc programming create a special
    solution for every case pay consultants a lot of
    money.
  • Long-standing challenge in the DB community
  • AI/WWW communities are on board
  • Annual workshops, vision papers, ...
  • Companies
  • Informatica, many others, ...

4
A Brief Research History
  • Many early ad-hoc solutions
  • Converged into two approaches
  • data warehousing vs. virtual DI systems
  • Semi-structured data, XML
  • Wrappers, info extraction
  • Other issues query optimization, schema
    matching, ...
  • Current directions
  • DI for specialized domains (e.g., bioinformatics)
  • on-the-fly DI, entity-centric DI
  • simplify integration tasks
  • New types of data sharing systems
  • P2P systems, Semantic Web

5
Data Warehouse Architecture
OLAP / Decision support/ Data cubes/ data mining
User queries
Relational database (warehouse)
Data extraction programs
Data cleaning/ scrubbing
Data source
Data source
Data source
6
Virtual Integration Architecture
User queries
Mediated schema
Mediator
Reformulation engine
optimizer
Data source catalog
Execution engine
wrapper
wrapper
wrapper
Data source
Data source
Data source
Sources can be relational, hierarchical (IMS),
structure files, web sites.
7
A Brief Research History
  • Many early ad-hoc solutions
  • Converged into two approaches
  • data warehousing vs. virtual DI systems
  • Semi-structured data, XML
  • Wrappers, info extraction
  • Other issues query optimization, schema
    matching, ...
  • Current directions
  • DI for specialized domains (e.g., bioinformatics)
  • on-the-fly DI, entity-centric DI
  • simplify integration tasks
  • New types of data sharing systems
  • P2P systems, Semantic Web

8
Semi-structured Data
  • What should be the underlying data model for DI
    contexts?
  • relational model is not an ideal choice
  • Developed semi-structured data model
  • started with the OEM (object exchange model)
  • Then XML came along
  • It is now the most well-known semi-structured
    data model
  • Generating much research in the DB community
  • To be covered later

9
A Brief Research History
  • Many early ad-hoc solutions
  • Converged into two approaches
  • data warehousing vs. virtual DI systems
  • Semi-structured data, XML
  • Wrappers, info extraction
  • Other issues query optimization, schema
    matching, ...
  • Current directions
  • DI for specialized domains (e.g., bioinformatics)
  • on-the-fly DI, entity-centric DI
  • simplify integration tasks
  • New types of data sharing systems
  • P2P systems, Semantic Web

10
Wrapper Programs
  • Task to communicate with the data sources and do
    format translations.
  • They are built w.r.t. a specific source.
  • They can sit either at the source or at the
    mediator.
  • Often hard to build (very little science).
  • Can be intelligent perform source-specific
    optimizations.

11
Example
ltbgt Introduction to DB lt/bgt ltigt Phil Bernstein
lt/igt ltigt Eric Newcomer lt/igt Addison Wesley,
1999 ltbookgt lttitlegt Introduction to DB
lt/titlegt ltauthorgt Phil Bernstein
lt/authorgt ltauthorgt Eric Newcomer
lt/authorgt ltpublishergt Addison Wesley
lt/publishergt ltyeargt 1999 lt/yeargt lt/bookgt
Transform
into
12
Wrapper Construction
  • Huge amount of research in the past decade
  • Two major approaches
  • supervised learning typically requires some
    hand-labeled data
  • unsupervised learning data-intensive, completely
    automatic
  • Different focuses
  • pull out each record (i.e., segment page into
    records)
  • pull out fields in each record
  • remove junk portions (ads, etc.)
  • Current solutions are still brittle
  • Unclear whether standards such as XML Web
    services will eliminate the problem
  • the need likely will still remain

13
Information Extraction
  • If the source cannot be wrapped with a grammar or
    some easy-to-parse rules
  • must do information extraction
  • Huge research in the AI community

14
A Brief Research History
  • Many early ad-hoc solutions
  • Converged into two approaches
  • data warehousing vs. virtual DI systems
  • Semi-structured data, XML
  • Wrappers, info extraction
  • Other issues query optimization, schema
    matching, ...
  • Current directions
  • DI for specialized domains (e.g., bioinformatics)
  • on-the-fly DI, entity-centric DI
  • simplify integration tasks
  • New types of data sharing systems
  • P2P systems, Semantic Web

15
Virtual Integration Architecture
User queries
Mediated schema
Mediator
Reformulation engine
Optimizer
Data source catalog
Execution engine
wrapper
wrapper
wrapper
Data source
Data source
Data source
16
Data Source Catalog
  • Contains all meta-information about the sources
  • Logical source contents (books, new cars).
  • Source capabilities (can answer SQL queries)
  • Source completeness (has all books).
  • Physical properties of source and network.
  • Statistics about the data (like in an RDBMS)
  • Source reliability
  • Mirror sources
  • Update frequency.

17
Content Descriptions
  • User queries refer to the mediated schema.
  • Data is stored in the sources in a local schema.
  • Content descriptions provide the semantic
    mappings between the different schemas.
  • Data integration system uses the descriptions to
    translate user queries into queries on the
    sources.

18
Approaches to Specifying Source Descriptions
  • Global-as-view express the mediated schema
    relations as a set of views over the data source
    relations
  • Local-as-view express the source relations as
    views over the mediated schema.
  • Can be combined with no additional cost.

19
Global-as-View
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).
  • Create View Movie AS
  • select from S1 S1(title,dir,year,genre)
  • union
  • select from S2 S2(title,
    dir,year,genre)
  • union S3(title,dir),
    S4(title,year,genre)
  • select S3.title, S3.dir, S4.year, S4.genre
  • from S3, S4
  • where S3.titleS4.title

20
Global-as-View Example 2
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).
  • Create View Movie AS S1(title,dir,year)
  • select title, dir, year, NULL
  • from S1
  • union S2(title,
    dir,genre)
  • select title, dir, NULL, genre
  • from S2

21
Global-as-View Example 3
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).
  • Source S4 S4(cinema, genre)
  • Create View Movie AS
  • select NULL, NULL, NULL, genre
  • from S4
  • Create View Schedule AS
  • select cinema, NULL, NULL
  • from S4.
  • But what if we want to find which cinemas are
    playing comedies?

22
Global-as-View Summary
  • Query reformulation boils down to view unfolding.
  • Very easy conceptually.
  • Can build hierarchies of mediated schemas.
  • You sometimes lose information. Not always
    natural.
  • Adding sources is hard. Need to consider all
    other sources that are available.

23
Local-as-View example 1
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).
  • Create Source S1 AS
  • select from Movie
  • Create Source S3 AS S3(title, dir)
  • select title, dir from Movie
  • Create Source S5 AS
  • select title, dir, year
  • from Movie
  • where year gt 1960 AND genreComedy

24
Local-as-View Example 2
  • Mediated schema
  • Movie(title, dir, year, genre),
  • Schedule(cinema, title, time).
  • Source S4 S4(cinema, genre)
  • Create Source S4
  • select cinema, genre
  • from Movie m, Schedule s
  • where m.titles.title
  • .
  • Now if we want to find which cinemas are playing
    comedies, there is hope

25
Local-as-View Summary
  • Very flexible. You have the power of the entire
    query language to define the contents of the
    source.
  • Hence, can easily distinguish between contents of
    closely related sources.
  • Adding sources is easy theyre independent of
    each other.
  • Query reformulation answering queries using views

26
Answering Queries Using Views
  • Given a set of views V1,,Vn, and a query Q, can
    we answer Q using only the answers to V1,,Vn?
  • Many, many papers on this problem.
  • The best performing algorithm The MiniCon
    Algorithm, (Pottinger Levy, 2000).
  • Great survey on the topic (Halevy, 2001).

27
The General Reformulation Problem
  • Given
  • A query Q posed over the mediated schema
  • Descriptions of the data sources
  • Find
  • A query Q over the data source relations, such
    that
  • Q provides only correct answers to Q, and
  • Q provides all possible answers to Q given the
    sources.

28
Query Reformulation
  • Generate good logical query plans quickly
  • Recent works
  • bucket algorithm Levy et. al.
  • inverse rule algorithm Duschka et. al.
  • Minicon algorithm Pottinger et. al.
  • iDRIPS/Streamer Doan et. al.

29
Query Optimization
  • Very related to query reformulation
  • Goal of the optimizer find a physical plan with
    minimal cost.
  • Key components in optimization
  • Search space of plans
  • Search strategy
  • Cost model

30
Can We Get RDBMS-Level Optimizationfor Data
Integration, without Statistics?
  • Multiple remote sources
  • Described and mapped loosely
  • Data changes frequently
  • Generally, would like to support same kinds of
    queries as in a local setting

Results
Query
Data Integration System
Mediated Schema
Source Catalog
Schema Mappings
Remote, Autonomous Data Sources
31
Adaptive Query Processing
  • Many domains where cost-based query optimization
    fails
  • Complex queries in traditional databases
    estimation error grows exponentially with joins
    IC91
  • Querying over the Internet unpredictable access
    rates, delays
  • Querying external data sources limited
    information available about properties of this
    source
  • Monitor real-world conditions, adapt processing
    strategy in response

32
What Are the Sources of Inefficiency?
  • Delays we stall in waiting for I/O
  • Bad estimation of intermediate result sizes
  • No info about source cardinalities
  • Major challenges
  • Trading off information acquisition (exploration)
    vs. use (exploitation)
  • Extrapolating performance based on what youve
    seen so far

33
Types of Adaptive Query Processing
  • Adaptive scheduling (q. scrambling UF98, dyn.
    pipeline sched. UF01, XJoin UF00, PH Join
    RS86I99, ripple join HH99)
  • Changes CPU scheduling to improve feedback or
    reduce delays
  • Redundant computation (competitive exec. AZ96)
  • Compare two ways of executing the query
  • Plan partitioning (S76KD98I99M04)
  • Break the plan into stages re-optimize future
    stages as necessary
  • Adaptive info passing (IT05 sub.)
  • Pass data between parts of an executing plan
  • Adaptive data partitioning
  • Break the data into subsets use a different plan
    for each subset
  • The only way to reduce overall computation with
    fine granularity
  • First (only) implementation has been eddies
    AH00R03DH04

34
Adaptive QP in Summary
  • A variety of different techniques, focusing on
  • Scheduling
  • Comparison competition
  • Information passing
  • Data plan partitioning
  • A field that is still fairly open missing
  • Effective exploration methods
  • A true theory!
  • Whats possible? What kinds of queries make
    sense to adapt?
  • Guarantees of optimality and convergence (perhaps
    under certain assumptions)

35
A Brief Research History
  • Many early ad-hoc solutions
  • Converged into two approaches
  • data warehousing vs. virtual DI systems
  • Semi-structured data, XML
  • Wrappers, info extraction
  • Other issues query optimization, schema
    matching, ...
  • Current directions
  • DI for specialized domains (e.g., bioinformatics)
  • on-the-fly DI, entity-centric DI
  • simplify integration tasks
  • New types of data sharing systems
  • P2P systems, Semantic Web

36
Semantic Matches between Schemas
Mediated-schema
price agent-name address
1-1 match
complex match
homes.com
listed-price contact-name city
state
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
37
Schema Matching is Ubiquitous!
  • Fundamental problem in numerous applications
  • Databases
  • data integration
  • data translation
  • schema/view integration
  • data warehousing
  • semantic query processing
  • model management
  • peer data management
  • AI
  • knowledge bases, ontology merging, information
    gathering agents, ...
  • Web
  • e-commerce
  • marking up data using ontologies (e.g., on
    Semantic Web)

38
Why Schema Matching is Difficult
  • Schema data never fully capture semantics!
  • not adequately documented
  • schema creator has retired to Florida!
  • Must rely on clues in schema data
  • using names, structures, types, data values, etc.
  • Such clues can be unreliable
  • same names gt different entities area gt
    location or square-feet
  • different names gt same entity area
    address gt location
  • Intended semantics can be subjective
  • house-style house-description?
  • military applications require committees to
    decide!
  • Cannot be fully automated, needs user feedback!

39
Current State of Affairs
  • Finding semantic mappings is now a key
    bottleneck!
  • largely done by hand
  • labor intensive error prone
  • data integration at GTE LiClifton, 2000
  • 40 databases, 27000 elements, estimated time 12
    years
  • Will only be exacerbated
  • data sharing becomes pervasive
  • translation of legacy data
  • Need semi-automatic approaches to scale up!
  • Many research projects in the past few years
  • Databases IBM Almaden, Microsoft Research, BYU,
    George Mason, U of Leipzig, U
    Wisconsin, NCSU, UIUC, Washington, ...
  • AI Stanford, Karlsruhe University, NEC Japan,
    ...

40
LSD Doan et al. 01
  • Learning Source Description
  • Developed at Univ of Washington 2000-2001
  • Designed for data integration settings
  • has been adapted to several other contexts
  • Desirable characteristics
  • learn from previous matching activities
  • exploit multiple types of information in schema
    and data
  • incorporate domain integrity constraints
  • handle user feedback
  • achieves high matching accuracy (66 -- 97) on
    real-world data

41
Schema Matching for Data Integrationthe LSD
Approach
  • Suppose user wants to integrate 100 data sources
  • 1. User
  • manually creates matches for a few sources, say 3
  • shows LSD these matches
  • 2. LSD learns from the matches
  • 3. LSD predicts matches for remaining 97 sources

42
Learning from the Manual Matches
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
43
Must Exploit Multiple Types of Information!
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
44
Multi-Strategy Learning
  • Use a set of base learners
  • each exploits well certain types of information
  • To match a schema element of a new source
  • apply base learners
  • combine their predictions using a meta-learner
  • Meta-learner
  • uses training sources to measure base learner
    accuracy
  • weighs each learner based on its accuracy

45
Base Learners
  • Training
  • Matching
  • Name Learner
  • training (location, address)
    (contact name, name)
  • matching agent-name gt (name,0.7),(phone,0
    .3)
  • Naive Bayes Learner
  • training (Seattle, WA,address)
    (250K,price)
  • matching Kent, WA gt
    (address,0.8),(name,0.2)

labels weighted by confidence score
X
46
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
47
Training the Base Learners
Mediated schema
address price agent-name agent-phone
office-phone description
realestate.com
location price contact-name
contact-phone office
comments
Miami, FL 250K James Smith (305) 729
0831 (305) 616 1822 Fantastic house Boston,
MA 320K Mike Doan (617) 253 1429 (617)
112 2315 Great location
48
Meta-Learner StackingWolpert 92,TingWitten99
  • Training
  • uses training data to learn weights
  • one for each (base-learner,mediated-schema
    element) pair
  • weight (Name-Learner,address) 0.2
  • weight (Naive-Bayes,address) 0.8
  • Matching combine predictions of base learners
  • computes weighted average of base-learner
    confidence scores

area
Name Learner Naive Bayes
(address,0.4) (address,0.9)
Seattle, WA Kent, WA Bend, OR
Meta-Learner
(address, 0.40.2 0.90.8 0.8)
49
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
50
Applying the Learners
homes.com schema
area sold-at contact-agent
extra-info
area
Name Learner Naive Bayes
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Seattle, WA Kent, WA Bend, OR
Name Learner Naive Bayes
Meta-Learner
Prediction-Combiner
(address,0.7), (description,0.3)
homes.com
sold-at
(price,0.9), (agent-phone,0.1)
contact-agent
(agent-phone,0.9), (description,0.1)
extra-info
(address,0.6), (description,0.4)
51
Domain Constraints
  • Encode user knowledge about domain
  • Specified only once, by examining mediated schema
  • Examples
  • at most one source-schema element can match
    address
  • if a source-schema element matches house-id then
    it is a key
  • avg-value(price) gt avg-value(num-baths)
  • Given a mapping combination
  • can verify if it satisfies a given constraint

area address sold-at
price contact-agent agent-phone extra-info
address
52
The Constraint Handler
Predictions from Prediction Combiner
Domain Constraints At most one element matches
address
area (address,0.7),
(description,0.3) sold-at
(price,0.9), (agent-phone,0.1) contact-agent
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.1 0.4 0.0012
0.7 0.9 0.9 0.4 0.2268
area address sold-at
price contact-agent agent-phone extra-info
description
0.7 0.9 0.9 0.6 0.3402
area address sold-at
price contact-agent agent-phone extra-info
address
  • Searches space of mapping combinations
    efficiently
  • Can handle arbitrary constraints
  • Also used to incorporate user feedback
  • sold-at does not match price

53
The Current LSD System
  • Can also handle data in XML format
  • matches XML DTDs
  • Base learners
  • Naive Bayes DudaHart-93, DomingosPazzani-97
  • exploits frequencies of words symbols
  • WHIRL Nearest-Neighbor Classifier CohenHirsh
    KDD-98
  • employs information-retrieval similarity metric
  • Name Learner SIGMOD-01
  • matches elements based on their names
  • County-Name Recognizer SIGMOD-01
  • stores all U.S. county names
  • XML Learner SIGMOD-01
  • exploits hierarchical structure of XML data

54
Empirical Evaluation
  • Four domains
  • Real Estate I II, Course Offerings, Faculty
    Listings
  • For each domain
  • created mediated schema domain constraints
  • chose five sources
  • extracted converted data into XML
  • mediated schemas 14 - 66 elements, source
    schemas 13 - 48
  • Ten runs for each domain, in each run
  • manually provided 1-1 matches for 3 sources
  • asked LSD to propose matches for remaining 2
    sources
  • accuracy of 1-1 matches correctly identified
  • LSD achieves 66 -- 97 on real-world data

55
A Brief Research History
  • Many early ad-hoc solutions
  • Converged into two approaches
  • data warehousing vs. virtual DI systems
  • Semi-structured data, XML
  • Wrappers, info extraction
  • Other issues query optimization, schema
    matching, ...
  • Current directions
  • DI for specialized domains (e.g., bioinformatics)
  • on-the-fly DI, simplified queries (e.g.,
    entity-centric)
  • simplify integration tasks (e.g., inexact
    answers)
  • New types of data sharing systems
  • P2P systems, Semantic Web

56
What You Should Know
  • Data integration is a very active research area
    in DB
  • Strong industry needs
  • Many interesting research questions
  • Touches many other fields artificial
    intelligence (ontology, machine learning, ),
    information retrieval,
  • Web data integration is a very important special
    case
Write a Comment
User Comments (0)
About PowerShow.com