Data Integration: The Teenage Years

1 / 54
About This Presentation
Title:

Data Integration: The Teenage Years

Description:

This is not a survey on data integration (See the paper in the proceedings ... Revivals by Graefe (1993) and DeWitt (1998). Query scrambling [Urhan & Franklin] ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 55
Provided by: goog79

less

Transcript and Presenter's Notes

Title: Data Integration: The Teenage Years


1
Data Integration The Teenage Years
  • Alon Halevy (Google)
  • Anand Rajaraman (Kosmix)
  • Joann Ordille (Avaya)
  • VLDB 2006

2
Agenda
  • A few perspectives on the last 10 years
  • Technical, commercial
  • Perspectives from our personal paths
  • Wild speculations about the future
  • This is not a survey on data integration
  • (See the paper in the proceedings for another
    non-survey)

3
Acknowledgements
  • Other members of the Information Manifold
    Project
  • Jaewoo Kang (NCSU, Korea Univ.)
  • Divesh Srivastava (ATT Labs)
  • Shuky Sagiv (Hebrew U.)
  • Tom Kirk

4
Acknowledgements
  • To the SIGMOD 1996 Program committee
  • For rejecting the earlier version of the paper.

5
Timeline
95
96
97
98
99
00
01
02
03
04
05
06
6
Data Integration
7
The Information Manifold
  • Goal integrate data from multiple sources on the
    web
  • Find the Woody Allen movies playing in my area,
    and their reviews
  • Need to describe the data sources
  • Contents, constraints, access patterns

8
Design time
Run time
Mediated Schema
query reformulation
Semantic mappings
optimization execution
9
Semantic Mappingsa.k.a. Source Descriptions
Mediated Schema
CD ASIN, Title, Genre,
Artist ASIN, name,
logic
Books Title ISBN Price DiscountPrice
Edition
CDs Album ASIN Price DiscountPrice St
udio
Authors ISBN FirstName LastName
Artists ASIN ArtistName GroupName
BookCategories ISBN Category
CDCategories ASIN Category
10
Global-as-View (GAV)
Mapping
CD(A,T,G) - R1(A,T,G) CD(A,T,G) - R2(A,T),
R3(T,G)
Mediated Schema
CD ASIN, Title, Genre,
Artist ASIN, name,
R1
R2
R3
R4
R5
11
Local-as-View (LAV)
Mapping
R1(A,T,G) - CD(A,T,G,Y), Artist(A,N), Ylt
1970 R2(A,T) - CD(A,T,French,Y)
Mediated Schema
CD ASIN, Title, Genre, Year
Artist ASIN, Name,
R1
R2
R3
R4
R5
12
Query Answering in LAV Answering queries using
views
  • Given a set of views V1,,Vn,
  • And a query Q,
  • Can we answer Q using only the answers to V1,,Vn?

13
AQUV (I)
  • Larson et al., 85 87, Tsatalos et al., 94,
    Chaudhuri et al., 95,
  • Focus on AQUV for
  • Query optimization
  • Supporting physical data independence
  • Every commercial DBMS supports AQUV.

14
AQUV (II)
  • AQUV for data integration
  • Find maximally contained rewriting
  • Not necessarily equivalent rewriting
  • Algorithms
  • Bucket algorithm LRO, 96
  • Inverse rules Duschka, 97
  • Minicon Pottinger and Halevy, 2000
  • Views and security Miklau and Suciu, 04

Survey Halevy, VLDB Journal, 2001
15
Some Subsequent Results
  • Semantics of data integration
  • Abiteboul Duschka, 1998 certain answers
  • Open vs. closed world assumption
  • CWA is bad complexity news!

Survey Lenzerini, PODS 2002
16
Certain Answers
Mediated schema Route (Origin, Destination)
Source 1 Origins SF NY
Source 2 Destinations Seattle Seoul
Query Route (SF, Seattle)?
Possible databases
Origin Destination
SF Seattle
NY Seoul
Origin Destination
SF Seoul
NY Seattle
17
Some Subsequent Results
  • Limitations due to binding patterns
  • Input title, get book info Rajaraman et al., 95
  • Additional query processing capabilities
  • Form applies multiple predicates
  • Disjunction, negation in sources.
  • Ordering sources, probabilistic mappings
  • Florescu et al., 97, Doan et al., Dong et al.
  • GLAV Millstein et al., 99

Survey Lenzerini, PODS 2002
18
A word on Description Logics
  • Selecting relevant sources reasoning.
  • Description logics to the rescue
  • Catarci and Lenzerini, 93
  • Information Manifold
  • Combined the Classic DL with Datalog (CARIN)
  • See AAAI-96 (not sigmod)
  • Brought DL and DB closer together.
  • A very active area of research today.

19

XML
95
96
97
98
99
00
01
02
03
04
05
06
20
XML and Semi-structured Data
  • Tsimmis semi-structured data for integration.
  • XML whetted the integration appetites
  • We have the syntax
  • Now just solve the silly semantics problems
  • Dont bother well all standardize on DTDs.
  • XML will have a significant role on the data
    integration industry and research.

21

XML
95
96
97
98
99
00
01
02
03
04
05
06
22
Back in the Lab
  • Two observations
  • Whos going to write all these LAV/GAV formulas?
  • This was the bottleneck.
  • Once we have mappings, how can we execute
    queries?
  • Traditional plan-then-execute doesnt work.

23
Semantic Mappings
Books Title ISBN Price DiscountPrice
Edition
Authors ISBN FirstName LastName
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
BookCategories ISBN Category
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice St
udio
Artists ASIN ArtistName GroupName
Inventory Database A
Inventory Database B
Standards are great, but there are too many of
them.
24
Techniques for Schema MappingSurvey by Rahm and
Bernstein, VLDBJ 2001
  • Compare schema elements based on
  • Names (or n-grams)
  • Data types and instances
  • Text descriptions, integrity constraints
  • Combine multiple techniques
  • Momis, Cupid, LSD, Coma
  • Create mappings from matches
  • Clio _at_ IBM Miller

25
A Machine Learning ApproachDoan et al., 2001,
ACM Distinguished Dissertation 2003
Mediated schema
Given matches
Predict new ones
  • Many mapping tasks are repetitive
  • Learn from previous experience
  • Build a classifier for every element of the
    mediated schema.
  • Many kinds of cues ? meta-strategy learning

26
Matching Real-Estate Sources
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
27
Reference ReconciliationTo Join or not to Join?
  • Many ways to refer to the same object in the
    world
  • IBM, International Business Machines
  • Alon Levy, Alon Halevy
  • Automated methods are necessity
  • Cant go through all the data manually
  • Very active area in ML, KDD, DB, UAI,

28
Query ProcessingTo Plan or to Execute?
  • In addition to distributed query processing
    issues
  • Few statistics, if any.
  • Network behavior issues latency, burstiness,
  • Garlic _at_IBM
  • Adaptive query processing
  • Stonebraker saw it coming in Ingres.
  • Revivals by Graefe (1993) and DeWitt (1998).
  • Query scrambling Urhan Franklin
  • Eddies Avnur Hellerstein
  • Convergent query processing Ives et al.

29

XML
95
96
97
98
99
00
01
02
03
04
05
06
30
Commercialization
  • Late 90s anything goes.
  • Want money from VCs?
  • Say XML 3 times loud and clear.
  • Academia at the forefront
  • Nimble (UW), Cohera (Berkeley), Enosys (UCSD),
  • Big companies took notice
  • Some faster than others

31
Commercialization RetrospectiveSee
Panel-of-Experts, SIGMOD 05
  • Uphill battle vs. the warehousing folks
  • Virtual integration was more pay-as-you-go
  • Another battle with the EAI folks
  • Should really be a symbiosis there.
  • Go vertical or horizontal?
  • Obvious go vertical if you can find the right
    one.
  • The technology worked
  • But its all in the timing

32
After 30M
Front-End
Lens Builder
User Applications
Lens File
InfoBrowser
Software Developers Kit
NIMBLE APIs
Management Tools
Integration Layer
Nimble Integration Engine
Metadata Server
Compiler
Executor
Cache
Security Tools
Common XML View
Integration Builder
Concordance Developer
Data Administrator
33

NASDAQ
XML
95
96
97
98
99
00
01
02
03
04
05
06
34
So Back in the Lab
  • Model management
  • Peer data management systems
  • Data exchange

35
Model ManagementBernstein et al.
  • Generic infrastructure for managing schemas and
    mappings
  • Manipulate models and mappings as bulk objects
  • Operators to create compose mappings, merge
    diff models
  • Short operator scripts can solve schema
    integration, schema evolution, reverse
    engineering, etc.
  • First challenge semantics of operators.

36
Peer Data Management Systems
UW (Wisconsin)
Stanford
Berkeley
LAV, GLAV
DBLP
CiteSeer
UW (Washington)
UW (Waterloo)
37
PDMS-Related Projects
  • Piazza (Washington)
  • Hyperion (Toronto)
  • PeerDB (Singapore)
  • Local relational models (Trento, Toronto)
  • Active XML (INRIA)
  • Edutella (Hannover, Germany)
  • Semantic Gossiping (EPFL Lausanne)
  • Raccoon (UC Irvine)
  • Orchestra (U. Penn)

38
PDMS Challenges
  • Semantics
  • careful about cycles
  • Optimization
  • Compose mappings
  • Prune paths

UW (Wisconsin)
Stanford
Berkeley
  • Manage networks
  • Consistency
  • Quality
  • Caching

DBLP
UW (Washington)
CiteSeer
UW (Waterloo)
39
Data Exchange
S
T
M
  • Key question given an instance of S and a
    mapping, create an instance for T.
  • Fagin, Kolaitis, Popa Tan

40

XML
95
96
97
98
99
00
01
02
03
04
05
06
41

XML
?
95
96
97
98
99
00
01
02
03
04
05
06
42
2006 Status ReportThe People Angle
  • Joann _at_ Avaya
  • Integrating communications into business
    processes
  • Anand _at_ Kosmix
  • Creating a new kind of search company
  • Alon _at_ Google
  • Working for Joanns old boss
  • Deep web evangelist
  • Pondering data management for the masses

43
2006 Status ReportEnterprise Angle
  • Enterprise Information Integration is
    established
  • IBM, BEA, Oracle, MetaMatrix, Composite,
    Actuate,
  • Impact on design tools
  • IBM Rational Data Architect
  • ADO .NET v. 3

44
Forrester Says
  • "Enterprises are facing the growing challenges of
    using disparate sources of data managed by
    different applications, including problems with
    data integration, security, performance,
    availability and quality.... New technology is
    emerging that Forrester has coined "information
    fabric," a term defined as a virtualized data
    layer that integrates heterogeneous data and
    content repositories in real time.... The
    potential benefits of this technology are so
    great that enterprises should develop a strategy
    to leverage information fabric technology as it
    becomes more widely available."

45
2006 Status ReportWeb Angle
  • Vertical search engines one domain
  • At scale need even better source descriptions
  • deep web can be surfaced
  • Terminology Data integration mashups!

46
  • Wikipedia
  • A mashup is a website or Web 2.0 application that
    uses content from more than one source to create
    a completely new service. This is akin to
    transclusion.

47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
Looking Ahead
  • Data management from the enterprise to the
    masses
  • Challenges
  • Databases of everything
  • Need support for collaboration
  • Help people structure their data
  • Pay-as-you go data management

51
Pay-as-you-go Data Management
Dataspaces Franklin, Halevy, Maier see PODS
2006
Benefit
Dataspaces
Data integration solutions
Investment (time, cost)
Artist Mike Franklin
52
Big Carrots
53
Reusing Human Attention
  • Principle
  • User action statement of semantic relationship
  • Leverage actions to infer other semantic
    relationships
  • Examples
  • Providing a semantic mapping
  • Infer other mappings
  • Writing a query
  • Infer content of sources, relationships between
    sources
  • Creating a digital workspace
  • Infer relatedness of documents/sources
  • Infer co-reference between objects in the
    dataspace
  • Annotating, cutting pasting, browsing among
    docs

54
Conclusion
  • Weve done extremely well as a community!
  • Next challenge data management and integration
    tools for the masses
Write a Comment
User Comments (0)