Data Integration: A Status Report - PowerPoint PPT Presentation

About This Presentation
Title:

Data Integration: A Status Report

Description:

Mine: Information Manifold, Tukwila, LSD. Companies: Many startups, big guys getting in. ... Big guys making announcements: IBM, BEA, MS, (Oracle still being defiant) ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 34
Provided by: uw387
Category:

less

Transcript and Presenter's Notes

Title: Data Integration: A Status Report


1
Data IntegrationA Status Report
  • Alon Halevy
  • University of Washington, Seattle
  • BTW 2003

2
Data Integration Report
  • Recent progress
  • Mediation languages
  • Query processing (XML and other)
  • Commercial
  • Current challenges
  • Flexible architectures peer-data mgmt.
  • Getting to the root of semantic heterogeneity
    schema mapping.

3
Data Integration Systems
  • This is one possible architecture (virtual
    integration)
  • Only logical mediated schema is central. Data
    stays at the sources.

4
Motivation and Activity
  • Application areas of data integration
  • Enterprise information integration ()
  • The government
  • Data sources on the web
  • Scientific data sharing.
  • Many research projects
  • Mine Information Manifold, Tukwila, LSD.
  • Companies
  • Many startups, big guys getting in.

5
Outline
  • Recent progress
  • Mediation languages
  • Adaptive Query processing
  • XML data management
  • Commercial
  • Current challenges
  • Flexible architectures peer-data mgmt.
  • Getting to the root of semantic heterogeneity
    schema mapping.
  • Crossing the Structure Chasm.

6
Mediation Languages
Goal
Mediated Schema
Language for Specifying Semantic relationships
7
Global-as-View (GAV)
Create view Actor AS R1 Union Select A,B From
S2 Union
Mediated Schema
Title, Actor,
R1
R2
R3
R4
R5
8
Local-as-View (LAV)
(GLAV)
Create View R5 as Select From Movie Where
langGerman
Create View R1 as Select title, name From Title
Join Actor Where Yeargt1970
Mediated Schema
Title, Actor
R1
R2
R3
R4
R5
9
Adaptive Query Processing
  • Problem no stats, network unstable
  • Cannot Plan and then execute
  • Need to adapt plan during execution.
  • Idea already in Ingres (1976)
  • Proposed before data integration
  • Cole and Graefe (choose nodes)
  • Kabra and Dewitt (mid-query re-opt).

10
Convergent Query ProcessingZack Ives, Ph.D
2002, U. Penn
  • Processor starts with initial plan
  • Monitors execution, accumulating stats.
  • Switches plan when a better one found
  • Reuses intermediate results.
  • Final, cleanup phase.
  • Possible transformation types
  • Plan partitioning, data partitioning, low-level
    rescheduling.
  • Can be aggressive (e.g., with aggregations).

11
XML Query Processing
  • XML facilitates integration.
  • Mediator query processor may manipulate XML
    directly.
  • Progress on
  • Publishing to XML, XML views on relations
  • Physical algebras for manipulating XML
  • Optimization of XQuery.

12
The Commercial World
  • Some startups
  • Nimble, MetaMatrix, Calixa, Enosys,
  • Big guys making announcements
  • IBM, BEA, MS, (Oracle still being defiant).
  • Progress analysts have buzzword -- EII.
  • Challenges
  • Integration with EAI?
  • Yet another middleware?
  • Horizontal vs. vertical?

13
Outline
  • Recent progress
  • Mediation languages
  • Adaptive Query processing
  • XML data management
  • Commercial
  • Current challenges
  • Flexible architectures peer-data mgmt.
  • Getting to the root of semantic heterogeneity
    schema mapping.

14
Peer Data-Management
  • PDMS a network of peers
  • Peers can
  • Export base data
  • Provide views on base data
  • Serve as logical mediators for other peers
  • A peer can be both a server and a client.
  • Semantic relationships are specified locally
    (between small sets of peers).

15
Network of Mappings (Piazza)
CiteSeer
UW
Stanford
GAV, LAV GLAV
DBLP
Leipzig
Saarbruecken
Berlin
16
Advantages of PDMS
  • No need for a central mediated schema.
  • Can map data opportunistically, as is most
    convenient.
  • Queries are posed using the peers schema.
    Answers come from anywhere in the system.
  • Semantic Web.
  • This is not P2P file sharing.
  • Data has rich semantics
  • Membership is not as dynamic.

17
Schema Mediation
When can LAV and GAV be combined to form such a
network structure? ICDE-03, WWW-03 for XML
CiteSeer
UW
Stanford
GAV, LAV GLAV
DBLP
Leipzig
Saarbruecken
Berlin
18
Query Optimization
  • Problems
  • redundant paths
  • expensive reformulation.

CiteSeer
UW
Stanford
  • Possible solution
  • Pre-compose some paths

DBLP
Leipzig
Saarbruecken
Berlin
19
Mapping Composition
  • Incredibly subtle! w/ Madhavan
  • In general, composition can be an infinite set of
    GLAV formulas.
  • Results
  • Finite in many cases
  • Even when infinite, often has finite, useful
    encoding.
  • Hence, compositions can usually be pre-optimized.

20
Management of Updatesw/ Mork, Gribble
  • Problem when updates are generated, we dont
    know who will use them.
  • Solution
  • represent updates as first-class citizens
  • Complement with boosters
  • Rules for usage.

CiteSeer
UW
Stanford
DBLP
Leipzig
Saarbruecken
Berlin
21
Other Research Issues
Intelligent data placement Management of mapping
networks Improving networks finding additional
connections. Indexing of views
CiteSeer
UW
Stanford
DBLP
Leipzig
Saarbruecken
Berlin
22
Schema Matching/Mapping
  • Given
  • S1 and S2 a pair of schemas/DTDs/ontologies,
  • Possibly, data accompanying instances
  • Additional domain knowledge
  • Find
  • A match between S1 and S2
  • A set of correspondences between the terms.
  • Ultimately, a mapping
  • Should enable translating data between the
    schemas.

23
Example House Listings
house
address
num-baths
Water view
Lake Mountains
?
1-1 mapping
non 1-1 mapping
house
location view
full-baths
half-baths
front back
24
Motivations
  • Heart of any data sharing architecture
  • Virtual, warehouse, messaging,
  • web services, semantic web
  • Translation of legacy data, EAI,
  • Key operator in model management
  • Algebra for manipulating models of data
  • See Bernstein, CIDR-03, Melnik et al. SIGMOD
    03.
  • Currently, a bottleneck. Done mostly by hand.

25
Approaches to Matching
  • Matching is hard because schema does not fully
    capture the semantics.
  • Many techniques proposed. They consider
    similarities in
  • Attribute names (synonyms)
  • Data values, data types
  • Relationships between columns
  • Structural similarities
  • Anything a human expert would try!
  • Hence, lets try to simulate a human.

26
Philosophy of Solutions
  • Effective schema matching requires a principled
    combination of techniques.
  • Like human experts, the matcher should improve
    over time
  • Learn from seeing many schemas, matches.
  • LSD Doan, Ph.D 2002, U. of Illinois
  • COMA Do et al.

27
Corpus Based SolutionMadhavan, Bernstein, Chen,
Halevy, Shenoy
  • Collect a corpus of schemas and matches.
  • Learn from the corpus
  • Create a classifier for every corpus element
  • Use multi-strategy learning.
  • Given S1 and S2
  • Compare each schema element to corpus elements.
  • If two elements similarity vectors are close,
    then maybe they match each other.

28
Learning from Corpus vs. Learning from the schemas
29
Finding Different Matches
30
Other Corpus Based Tools
  • Conjecture a corpus of schemas can be the basis
    for many useful tools.
  • Auto-complete
  • I start creating a schema (or show sample data),
    and the tool suggests a completion.
  • Query reformulation
  • I ask a query using my terminology, and it gets
    reformulated appropriately.
  • Improving structured queries over structured web
    sites (and focused crawling, a la BINGO!)

31
The Corpus
  • Contents
  • Schemas, ontologies, meta-data, data, queries.
  • Sample statistics
  • How often does a word appear as a relation name?
  • When it does, what tend to be the attribute
    names?
  • What other tables are there? What are the foreign
    keys?

32
Conclusion Crossing the Structure Chasm
  • Data authoring, querying and sharing is
    everywhere done by novices too.
  • Semantic web the extreme example.

Corpus Of schemas
33
Some References
  • www.cs.washington.edu/homes/alon
  • Piazza WebDB01, ICDE03, WWW03
  • The Structure Chasm CIDR-03
  • Mediation surveys VLDB Journal 01
  • Lenzerini, PODS 02 tutorial.
  • Schema matching
  • Rahm and Bernstein, VLDB Journal 01.
Write a Comment
User Comments (0)
About PowerShow.com