Information Integration: A Status Report

1 / 42

About This Presentation

Title:

Information Integration: A Status Report

Description:

Big guys making announcements: IBM, BEA, MS, (Oracle still being defiant) ... What other tables are there? Support a KR-style interface on the corpus (OKBC-like) ... – PowerPoint PPT presentation

Number of Views:11

Avg rating:3.0/5.0

Slides: 43

Provided by: uw3

Learn more at: https://homes.cs.washington.edu

more less

Transcript and Presenter's Notes

Title: Information Integration: A Status Report

1
Information IntegrationA Status Report

Alon Halevy
University of Washington, Seattle
IJCAI 2003

2
Mediated Schema
Entity
Sequenceable Entity
Structured Vocabulary
Gene
Phenotype
Experiment
Nucleotide Sequence
Microarray Experiment
Protein
OMIM
Swiss- Prot
HUGO
GO
Gene- Clinics
Entrez
Locus- Link
GEO
Query For the micro-array experiment I just ran,
what are the related nucleotide sequences and for
what protein do they code?
3
Motivation and Activity

Application areas of data integration
Enterprise information integration ()
The government
Data sources on the web
Scientific data sharing.
Several data sharing architectures
Virtual data integration, warehousing,
message-passing, web-services.
Many research projects
Mine Information Manifold, Tukwila, LSD, Piazza.
EII a new industry buzzword.

4
Todays Agenda

Recent progress
Mediation languages
Query processing (XML and other)
Some lessons from commercial world.
Current challenges
Enabling large-scale data sharing peer-data
management systems.
The age of problem semantic heterogeneity.
A new agenda item for AI corpus-based KR.
AI is more vital than ever for progress here!

5
Mediation Languages
Goal
Language for Specifying Semantic Relationships
(not full FOL)
Mediated Schema
Assume data at the sources is structure (or
seems so).
6
Global-as-View (GAV)
Actor(x,y) - R1(x,y,z) Actor(x,y) - R2(x,z),
R3(z,y)
Mediated Schema
Title, Actor,
R1
R2
R3
R4
R5
7
Local-as-View (LAV,GLAV)
R1(x,y,z) - Title(x,y), Actor(x,z), ylt 1970
R5(x,y,z) - Movie(x,y,French)
Mediated Schema
Title, Actor
R1
R2
R3
R4
R5
8
Mediation Languages Summary

A lot of nice theory and practical algorithms.
Careful choice of expressive power mattered.
Algorithms for answering queries using views are
in every commercial DBMS.
Description Logics also an attractive formalism
for mediation.
Bottleneck is coming up with the mapping
expressions.

9
Outline

Recent progress
Mediation languages
Query processing (XML and other)
Some lessons from commercial world.
Current challenges
Enabling large-scale data sharing peer-data
management systems.
The age old problem semantic heterogeneity.
A new agenda item for AI corpus-based KR.

10
Adaptive Query Processing

Problem no stats, network unstable
Cannot Plan and then execute
Need to adapt plan during execution.
Ideas already in
Ingres (1976) (early database system)
Interleaving planning and execution (AI)
Key question when and granularity of adaptation
For every tuple? Materialization points?
See Ives et al. 2002 for our solution.

11
Convergent Query ProcessingIves et al., 2002
Join In-stock, Orders, Shipping
(I ? O ? S)
I OS
IO
12
XML Query Processing

XML facilitates integration.
Mediator query processor may manipulate XML
directly.
Challenges
XML is not flat, but nested Path queries.
Can be irregular doesnt adhere to a strict
schema.
Progress
Defining and optimizing XQuery.
Going back and forth XML to relational.

13
The Commercial World

Some startups
Nimble, MetaMatrix, Calixa, Composite, Enosys
Big guys making announcements
IBM, BEA, MS, (Oracle still being defiant).
Integration technology in different layers
E.g., reporting companies want it (Actuate)
Progress analysts have buzzword -- EII.
Challenges
Integration with EAI?
Yet another middleware?
Horizontal vs. vertical?

14
What Worked?

Performance was not an issue.
Tools, tools, tools
For managing sources and creating mediated
schemas.
XML query processing was needed.
Concordance need common keys to join sources
Active research area!

15
Outline

Recent progress
Mediation languages
Query processing (XML and other)
Some lessons from commercial world.
Current challenges
Enabling large-scale data sharing peer-data
management systems.
The age old problem semantic heterogeneity.
A new agenda item for AI corpus-based KR.

16
Limitations of Mediated Schema
Mediated Schema
17
Peer Data-Management

PDMS a network of peers (data sources)
Peers can
Export base data, or combinations of data
Serve as logical mediators for other peers
A peer can be both a server and a client.
Semantic relationships are specified locally
(between small sets of peers).
This is a Semantic Web (different angle)

18
Network of Mappings (Piazza)
CiteSeer
UW
Stanford
GAV, LAV GLAV
DBLP
Paris
Roma
Vienna
19
Advantages of PDMS

No need for a central mediated schema.
Can map data opportunistically, as is most
convenient.
Queries are posed using the peers schema.
Answers come from anywhere in the system.
Infrastructure for Semantic Web applications
This is not P2P file sharing.
Data has rich semantics
Membership is not as dynamic.

20
Schema Mediation for PDMS
When can LAV and GAV be combined to form such a
network structure? (semantics not yet obvious.
CiteSeer
UW
Stanford
GAV, LAV GLAV
ICDE-03, WWW-03 for XML
DBLP
Paris
Roma
Vienna
21
Efficient Query Answering

Problems
redundant paths
expensive reformulation.

CiteSeer
UW
Stanford

Possible solution
Pre-compose some paths

DBLP
Paris
Roma
Vienna
22
Mapping CompositionJayant Madhavan and Halevy,
VLDB 2003

Incredibly subtle!
In general, composition can be an infinite set of
GLAV formulas.
Results
Finite in many cases
Even when infinite, often has finite, useful
encoding.
Hence, compositions can usually be pre-optimized.

23
Other Research Issues
Intelligent data placement Management of mapping
networks Improving networks finding additional
connections. Handling inconsistencies
CiteSeer
UW
Stanford
DBLP
Leipzig
Saarbruecken
Berlin
24
PDMS-Related Projects

Hyperion (Toronto)
PeerDB (Singapore)
Local relational models (Trento)
Edutella (Hannover, Germany)
Semantic Gossiping (EPFL Zurich)
Raccoon (UC Irvine)
Orchestra (Ives, U. Penn)

25
Outline

Recent progress
Mediation languages
Query processing (XML and other)
Some lessons from commercial world.
Current challenges
Enabling large-scale data sharing peer-data
management systems.
The age old problem semantic heterogeneity.
A new agenda item for AI corpus-based KR.

26
Schema/Ontology Matching
Data Source
Consumer
Mediator
Data Source
Data Source

Schema heterogeneity a key roadblock for
information integration
Different data sources speak their own schema
Mapping is key to any data sharing architecture

27
Schema Matching
Books Title ISBN Price DiscountPrice
Edition
Authors ISBN FirstName LastName
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
BookCategories ISBN Category
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice St
udio
Artists ASIN ArtistName GroupName
Inventory Database A
Inventory Database B

Schema Matching Discovering correspondences
between similar elements
Eventually BooksAndMusic(xTitle,)
Books(xTitle,) ? CDs(xAlbum,)

28
Typical Approaches

Multiple sources of evidences in the schemas
Schema element names
BooksAndCDs/Categories BookCategories/Category
Descriptions and documentation
ItemID unique identifier for a book or a CD
ISBN unique identifier for any book
Data types, data instances
DateTime ? Integer,
addresses have similar formats
Schema structure
All books have similar attributes
Use domain knowledge

In isolation, techniques are incomplete or brittle
Combine multiple techniques to exploit all
available evidence
29
Philosophy of Solutions

Effective schema matching requires a principled
combination of techniques.
Like human experts, the matcher should improve
over time
LSD
Mapping data sources to a mediated schema.
Use a few mappings as training examples to learn
hypotheses for elements of the mediated schema.
See Doan et al., SIGMOD-2001, MLJ-2003
Next step corpus-based matching.

30
Corpus-Based Matching
Collection of schemas and mappings
31
Mapping Knowledge Base
Data Instances Learner
Structure Learner
Name Learner
Data Type Learner
Description Learner
Meta Learner
C1
CN
NL DIL DTL DL SL ML
NL DIL DTL DL SL ML
Mapping Knowledge Base
32
Preliminary results Corpus is useful
33
With and without the corpus
34
Outline