Title: Data Integration: The Teenage Years
1Data Integration The Teenage Years
- Alon Halevy (Google)
- Anand Rajaraman (Kosmix)
- Joann Ordille (Avaya)
- VLDB 2006
2Agenda
- A few perspectives on the last 10 years
- Technical, commercial
- Perspectives from our personal paths
- Wild speculations about the future
- This is not a survey on data integration
- (See the paper in the proceedings for another
non-survey)
3Acknowledgements
- Other members of the Information Manifold
Project - Jaewoo Kang (NCSU, Korea Univ.)
- Divesh Srivastava (ATT Labs)
- Shuky Sagiv (Hebrew U.)
- Tom Kirk
4Acknowledgements
- To the SIGMOD 1996 Program committee
- For rejecting the earlier version of the paper.
5Timeline
95
96
97
98
99
00
01
02
03
04
05
06
6Data Integration
7The Information Manifold
- Goal integrate data from multiple sources on the
web - Find the Woody Allen movies playing in my area,
and their reviews - Need to describe the data sources
- Contents, constraints, access patterns
8Design time
Run time
Mediated Schema
query reformulation
Semantic mappings
optimization execution
9Semantic Mappingsa.k.a. Source Descriptions
Mediated Schema
CD ASIN, Title, Genre,
Artist ASIN, name,
logic
Books Title ISBN Price DiscountPrice
Edition
CDs Album ASIN Price DiscountPrice St
udio
Authors ISBN FirstName LastName
Artists ASIN ArtistName GroupName
BookCategories ISBN Category
CDCategories ASIN Category
10Global-as-View (GAV)
Mapping
CD(A,T,G) - R1(A,T,G) CD(A,T,G) - R2(A,T),
R3(T,G)
Mediated Schema
CD ASIN, Title, Genre,
Artist ASIN, name,
R1
R2
R3
R4
R5
11Local-as-View (LAV)
Mapping
R1(A,T,G) - CD(A,T,G,Y), Artist(A,N), Ylt
1970 R2(A,T) - CD(A,T,French,Y)
Mediated Schema
CD ASIN, Title, Genre, Year
Artist ASIN, Name,
R1
R2
R3
R4
R5
12Query Answering in LAV Answering queries using
views
- Given a set of views V1,,Vn,
- And a query Q,
- Can we answer Q using only the answers to V1,,Vn?
13AQUV (I)
- Larson et al., 85 87, Tsatalos et al., 94,
Chaudhuri et al., 95, - Focus on AQUV for
- Query optimization
- Supporting physical data independence
- Every commercial DBMS supports AQUV.
14AQUV (II)
- AQUV for data integration
- Find maximally contained rewriting
- Not necessarily equivalent rewriting
- Algorithms
- Bucket algorithm LRO, 96
- Inverse rules Duschka, 97
- Minicon Pottinger and Halevy, 2000
- Views and security Miklau and Suciu, 04
Survey Halevy, VLDB Journal, 2001
15Some Subsequent Results
- Semantics of data integration
- Abiteboul Duschka, 1998 certain answers
- Open vs. closed world assumption
- CWA is bad complexity news!
Survey Lenzerini, PODS 2002
16Certain Answers
Mediated schema Route (Origin, Destination)
Source 1 Origins SF NY
Source 2 Destinations Seattle Seoul
Query Route (SF, Seattle)?
Possible databases
Origin Destination
SF Seattle
NY Seoul
Origin Destination
SF Seoul
NY Seattle
17Some Subsequent Results
- Limitations due to binding patterns
- Input title, get book info Rajaraman et al., 95
- Additional query processing capabilities
- Form applies multiple predicates
- Disjunction, negation in sources.
- Ordering sources, probabilistic mappings
- Florescu et al., 97, Doan et al., Dong et al.
- GLAV Millstein et al., 99
Survey Lenzerini, PODS 2002
18A word on Description Logics
- Selecting relevant sources reasoning.
- Description logics to the rescue
- Catarci and Lenzerini, 93
- Information Manifold
- Combined the Classic DL with Datalog (CARIN)
- See AAAI-96 (not sigmod)
- Brought DL and DB closer together.
- A very active area of research today.
19 XML
95
96
97
98
99
00
01
02
03
04
05
06
20XML and Semi-structured Data
- Tsimmis semi-structured data for integration.
- XML whetted the integration appetites
- We have the syntax
- Now just solve the silly semantics problems
- Dont bother well all standardize on DTDs.
- XML will have a significant role on the data
integration industry and research.
21 XML
95
96
97
98
99
00
01
02
03
04
05
06
22Back in the Lab
- Two observations
- Whos going to write all these LAV/GAV formulas?
- This was the bottleneck.
- Once we have mappings, how can we execute
queries? - Traditional plan-then-execute doesnt work.
23Semantic Mappings
Books Title ISBN Price DiscountPrice
Edition
Authors ISBN FirstName LastName
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
BookCategories ISBN Category
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice St
udio
Artists ASIN ArtistName GroupName
Inventory Database A
Inventory Database B
Standards are great, but there are too many of
them.
24Techniques for Schema MappingSurvey by Rahm and
Bernstein, VLDBJ 2001
- Compare schema elements based on
- Names (or n-grams)
- Data types and instances
- Text descriptions, integrity constraints
- Combine multiple techniques
- Momis, Cupid, LSD, Coma
- Create mappings from matches
- Clio _at_ IBM Miller
25A Machine Learning ApproachDoan et al., 2001,
ACM Distinguished Dissertation 2003
Mediated schema
Given matches
Predict new ones
- Many mapping tasks are repetitive
- Learn from previous experience
- Build a classifier for every element of the
mediated schema. - Many kinds of cues ? meta-strategy learning
26Matching Real-Estate Sources
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name gt agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values gt description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
27Reference ReconciliationTo Join or not to Join?
- Many ways to refer to the same object in the
world - IBM, International Business Machines
- Alon Levy, Alon Halevy
- Automated methods are necessity
- Cant go through all the data manually
- Very active area in ML, KDD, DB, UAI,
28Query ProcessingTo Plan or to Execute?
- In addition to distributed query processing
issues - Few statistics, if any.
- Network behavior issues latency, burstiness,
- Garlic _at_IBM
- Adaptive query processing
- Stonebraker saw it coming in Ingres.
- Revivals by Graefe (1993) and DeWitt (1998).
- Query scrambling Urhan Franklin
- Eddies Avnur Hellerstein
- Convergent query processing Ives et al.
29 XML
95
96
97
98
99
00
01
02
03
04
05
06
30Commercialization
- Late 90s anything goes.
- Want money from VCs?
- Say XML 3 times loud and clear.
- Academia at the forefront
- Nimble (UW), Cohera (Berkeley), Enosys (UCSD),
- Big companies took notice
- Some faster than others
31Commercialization RetrospectiveSee
Panel-of-Experts, SIGMOD 05
- Uphill battle vs. the warehousing folks
- Virtual integration was more pay-as-you-go
- Another battle with the EAI folks
- Should really be a symbiosis there.
- Go vertical or horizontal?
- Obvious go vertical if you can find the right
one. - The technology worked
- But its all in the timing
32After 30M
Front-End
Lens Builder
User Applications
Lens File
InfoBrowser
Software Developers Kit
NIMBLE APIs
Management Tools
Integration Layer
Nimble Integration Engine
Metadata Server
Compiler
Executor
Cache
Security Tools
Common XML View
Integration Builder
Concordance Developer
Data Administrator
33 NASDAQ
XML
95
96
97
98
99
00
01
02
03
04
05
06
34So Back in the Lab
- Model management
- Peer data management systems
- Data exchange
35Model ManagementBernstein et al.
- Generic infrastructure for managing schemas and
mappings - Manipulate models and mappings as bulk objects
- Operators to create compose mappings, merge
diff models - Short operator scripts can solve schema
integration, schema evolution, reverse
engineering, etc. - First challenge semantics of operators.
36Peer Data Management Systems
UW (Wisconsin)
Stanford
Berkeley
LAV, GLAV
DBLP
CiteSeer
UW (Washington)
UW (Waterloo)
37PDMS-Related Projects
- Piazza (Washington)
- Hyperion (Toronto)
- PeerDB (Singapore)
- Local relational models (Trento, Toronto)
- Active XML (INRIA)
- Edutella (Hannover, Germany)
- Semantic Gossiping (EPFL Lausanne)
- Raccoon (UC Irvine)
- Orchestra (U. Penn)
38PDMS Challenges
- Semantics
- careful about cycles
- Optimization
- Compose mappings
- Prune paths
UW (Wisconsin)
Stanford
Berkeley
- Manage networks
- Consistency
- Quality
- Caching
DBLP
UW (Washington)
CiteSeer
UW (Waterloo)
39Data Exchange
S
T
M
- Key question given an instance of S and a
mapping, create an instance for T. - Fagin, Kolaitis, Popa Tan
40 XML
95
96
97
98
99
00
01
02
03
04
05
06
41 XML
?
95
96
97
98
99
00
01
02
03
04
05
06
422006 Status ReportThe People Angle
- Joann _at_ Avaya
- Integrating communications into business
processes - Anand _at_ Kosmix
- Creating a new kind of search company
- Alon _at_ Google
- Working for Joanns old boss
- Deep web evangelist
- Pondering data management for the masses
432006 Status ReportEnterprise Angle
- Enterprise Information Integration is
established - IBM, BEA, Oracle, MetaMatrix, Composite,
Actuate, - Impact on design tools
- IBM Rational Data Architect
- ADO .NET v. 3
44Forrester Says
- "Enterprises are facing the growing challenges of
using disparate sources of data managed by
different applications, including problems with
data integration, security, performance,
availability and quality.... New technology is
emerging that Forrester has coined "information
fabric," a term defined as a virtualized data
layer that integrates heterogeneous data and
content repositories in real time.... The
potential benefits of this technology are so
great that enterprises should develop a strategy
to leverage information fabric technology as it
becomes more widely available."
452006 Status ReportWeb Angle
- Vertical search engines one domain
- At scale need even better source descriptions
- deep web can be surfaced
- Terminology Data integration mashups!
46 - Wikipedia
- A mashup is a website or Web 2.0 application that
uses content from more than one source to create
a completely new service. This is akin to
transclusion.
47(No Transcript)
48(No Transcript)
49(No Transcript)
50Looking Ahead
- Data management from the enterprise to the
masses - Challenges
- Databases of everything
- Need support for collaboration
- Help people structure their data
- Pay-as-you go data management
51Pay-as-you-go Data Management
Dataspaces Franklin, Halevy, Maier see PODS
2006
Benefit
Dataspaces
Data integration solutions
Investment (time, cost)
Artist Mike Franklin
52Big Carrots
53Reusing Human Attention
- Principle
- User action statement of semantic relationship
- Leverage actions to infer other semantic
relationships - Examples
- Providing a semantic mapping
- Infer other mappings
- Writing a query
- Infer content of sources, relationships between
sources - Creating a digital workspace
- Infer relatedness of documents/sources
- Infer co-reference between objects in the
dataspace - Annotating, cutting pasting, browsing among
docs
54Conclusion
- Weve done extremely well as a community!
- Next challenge data management and integration
tools for the masses