Title: Information Integration: A Status Report
1Information IntegrationA Status Report
- Alon Halevy
- University of Washington, Seattle
- IJCAI 2003
-
2Mediated Schema
Entity
Sequenceable Entity
Structured Vocabulary
Gene
Phenotype
Experiment
Nucleotide Sequence
Microarray Experiment
Protein
OMIM
Swiss- Prot
HUGO
GO
Gene- Clinics
Entrez
Locus- Link
GEO
Query For the micro-array experiment I just ran,
what are the related nucleotide sequences and for
what protein do they code?
3Motivation and Activity
- Application areas of data integration
- Enterprise information integration ()
- The government
- Data sources on the web
- Scientific data sharing.
- Several data sharing architectures
- Virtual data integration, warehousing,
message-passing, web-services. - Many research projects
- Mine Information Manifold, Tukwila, LSD, Piazza.
- EII a new industry buzzword.
4Todays Agenda
- Recent progress
- Mediation languages
- Query processing (XML and other)
- Some lessons from commercial world.
- Current challenges
- Enabling large-scale data sharing peer-data
management systems. - The age of problem semantic heterogeneity.
- A new agenda item for AI corpus-based KR.
- AI is more vital than ever for progress here!
5Mediation Languages
Goal
Language for Specifying Semantic Relationships
(not full FOL)
Mediated Schema
Assume data at the sources is structure (or
seems so).
6Global-as-View (GAV)
Actor(x,y) - R1(x,y,z) Actor(x,y) - R2(x,z),
R3(z,y)
Mediated Schema
Title, Actor,
R1
R2
R3
R4
R5
7Local-as-View (LAV,GLAV)
R1(x,y,z) - Title(x,y), Actor(x,z), ylt 1970
R5(x,y,z) - Movie(x,y,French)
Mediated Schema
Title, Actor
R1
R2
R3
R4
R5
8Mediation Languages Summary
- A lot of nice theory and practical algorithms.
- Careful choice of expressive power mattered.
- Algorithms for answering queries using views are
in every commercial DBMS. - Description Logics also an attractive formalism
for mediation. - Bottleneck is coming up with the mapping
expressions.
9Outline
- Recent progress
- Mediation languages
- Query processing (XML and other)
- Some lessons from commercial world.
- Current challenges
- Enabling large-scale data sharing peer-data
management systems. - The age old problem semantic heterogeneity.
- A new agenda item for AI corpus-based KR.
10Adaptive Query Processing
- Problem no stats, network unstable
- Cannot Plan and then execute
- Need to adapt plan during execution.
- Ideas already in
- Ingres (1976) (early database system)
- Interleaving planning and execution (AI)
- Key question when and granularity of adaptation
- For every tuple? Materialization points?
- See Ives et al. 2002 for our solution.
11Convergent Query ProcessingIves et al., 2002
Join In-stock, Orders, Shipping
(I ? O ? S)
I OS
IO
12XML Query Processing
- XML facilitates integration.
- Mediator query processor may manipulate XML
directly. - Challenges
- XML is not flat, but nested Path queries.
- Can be irregular doesnt adhere to a strict
schema. - Progress
- Defining and optimizing XQuery.
- Going back and forth XML to relational.
13The Commercial World
- Some startups
- Nimble, MetaMatrix, Calixa, Composite, Enosys
- Big guys making announcements
- IBM, BEA, MS, (Oracle still being defiant).
- Integration technology in different layers
- E.g., reporting companies want it (Actuate)
- Progress analysts have buzzword -- EII.
- Challenges
- Integration with EAI?
- Yet another middleware?
- Horizontal vs. vertical?
14What Worked?
- Performance was not an issue.
- Tools, tools, tools
- For managing sources and creating mediated
schemas. - XML query processing was needed.
- Concordance need common keys to join sources
- Active research area!
15Outline
- Recent progress
- Mediation languages
- Query processing (XML and other)
- Some lessons from commercial world.
- Current challenges
- Enabling large-scale data sharing peer-data
management systems. - The age old problem semantic heterogeneity.
- A new agenda item for AI corpus-based KR.
16Limitations of Mediated Schema
Mediated Schema
17 Peer Data-Management
- PDMS a network of peers (data sources)
- Peers can
- Export base data, or combinations of data
- Serve as logical mediators for other peers
- A peer can be both a server and a client.
- Semantic relationships are specified locally
(between small sets of peers). - This is a Semantic Web (different angle)
18Network of Mappings (Piazza)
CiteSeer
UW
Stanford
GAV, LAV GLAV
DBLP
Paris
Roma
Vienna
19Advantages of PDMS
- No need for a central mediated schema.
- Can map data opportunistically, as is most
convenient. - Queries are posed using the peers schema.
Answers come from anywhere in the system. - Infrastructure for Semantic Web applications
- This is not P2P file sharing.
- Data has rich semantics
- Membership is not as dynamic.
20Schema Mediation for PDMS
When can LAV and GAV be combined to form such a
network structure? (semantics not yet obvious.
CiteSeer
UW
Stanford
GAV, LAV GLAV
ICDE-03, WWW-03 for XML
DBLP
Paris
Roma
Vienna
21Efficient Query Answering
- Problems
- redundant paths
- expensive reformulation.
CiteSeer
UW
Stanford
- Possible solution
- Pre-compose some paths
DBLP
Paris
Roma
Vienna
22Mapping CompositionJayant Madhavan and Halevy,
VLDB 2003
- Incredibly subtle!
- In general, composition can be an infinite set of
GLAV formulas. - Results
- Finite in many cases
- Even when infinite, often has finite, useful
encoding. - Hence, compositions can usually be pre-optimized.
23Other Research Issues
Intelligent data placement Management of mapping
networks Improving networks finding additional
connections. Handling inconsistencies
CiteSeer
UW
Stanford
DBLP
Leipzig
Saarbruecken
Berlin
24PDMS-Related Projects
- Hyperion (Toronto)
- PeerDB (Singapore)
- Local relational models (Trento)
- Edutella (Hannover, Germany)
- Semantic Gossiping (EPFL Zurich)
- Raccoon (UC Irvine)
- Orchestra (Ives, U. Penn)
25Outline
- Recent progress
- Mediation languages
- Query processing (XML and other)
- Some lessons from commercial world.
- Current challenges
- Enabling large-scale data sharing peer-data
management systems. - The age old problem semantic heterogeneity.
- A new agenda item for AI corpus-based KR.
26Schema/Ontology Matching
Data Source
Consumer
Mediator
Data Source
Data Source
- Schema heterogeneity a key roadblock for
information integration - Different data sources speak their own schema
- Mapping is key to any data sharing architecture
27Schema Matching
Books Title ISBN Price DiscountPrice
Edition
Authors ISBN FirstName LastName
BooksAndMusic Title Author Publisher ItemID ItemTy
pe SuggestedPrice Categories Keywords
BookCategories ISBN Category
CDCategories ASIN Category
CDs Album ASIN Price DiscountPrice St
udio
Artists ASIN ArtistName GroupName
Inventory Database A
Inventory Database B
- Schema Matching Discovering correspondences
between similar elements - Eventually BooksAndMusic(xTitle,)
Books(xTitle,) ? CDs(xAlbum,)
28Typical Approaches
- Multiple sources of evidences in the schemas
- Schema element names
- BooksAndCDs/Categories BookCategories/Category
- Descriptions and documentation
- ItemID unique identifier for a book or a CD
- ISBN unique identifier for any book
- Data types, data instances
- DateTime ? Integer,
- addresses have similar formats
- Schema structure
- All books have similar attributes
- Use domain knowledge
In isolation, techniques are incomplete or brittle
Combine multiple techniques to exploit all
available evidence
29Philosophy of Solutions
- Effective schema matching requires a principled
combination of techniques. - Like human experts, the matcher should improve
over time - LSD
- Mapping data sources to a mediated schema.
- Use a few mappings as training examples to learn
hypotheses for elements of the mediated schema. - See Doan et al., SIGMOD-2001, MLJ-2003
- Next step corpus-based matching.
30Corpus-Based Matching
Collection of schemas and mappings
31Mapping Knowledge Base
Data Instances Learner
Structure Learner
Name Learner
Data Type Learner
Description Learner
Meta Learner
C1
CN
NL DIL DTL DL SL ML
NL DIL DTL DL SL ML
Mapping Knowledge Base
32Preliminary results Corpus is useful
33With and without the corpus
34Outline
- Recent progress
- Mediation languages
- Query processing (XML and other)
- Some lessons from commercial world.
- Current challenges
- Enabling large-scale data sharing peer-data
management systems. - The age old problem semantic heterogeneity.
- A new agenda item for AI corpus-based KR.
35Corpus vs. Traditional KR
- A large corpus of uncoordinated knowledge
fragments - vs.
- Carefully designed knowledge base
- Can a corpus offer a more attractive solution for
some KR problems?
36Pause KR vs. Corpus
- Knowledge base
- Hard to engineer, brittle at the boundaries
- Only one way of saying things.
- Corpus
- Easier to build, coverage not predefined.
- Many views of the domain.
- See proceedings for full argument.
37Corpus-based KR
- Contents
- Schemas, ontologies, meta-data, data, queries,
mappings. - Collect statistics on the corpus
- How often does a word appear as a relation name?
- When it does, what tend to be the attribute
names? What other tables are there? - Support a KR-style interface on the corpus
(OKBC-like)
38Other Applications of C-B-KR
- Question answering on the web
- Focused crawling
- Natural language interfaces to DBs
- Schema and ontology authoring
- Semantic query optimization.
- Whenever we need knowledge to help us rank
multiple answers/plans.
39Example Queries
- How are two terms related?
- GPA(studentID, value),
- Student(studentID, GPA, address)
- Find different ways of saying the same
- Class(Lexus, Luxury)
- LuxuryCar(Lexus, Toyota)
- When do two terms play similar roles?
- IJCAIReview(p1, rev2, accept)
- AIJReferees(round2, p3, rev4, reject)
40Challenges for C-B-KR
- Building the corpus.
- How focused should the corpus be?
- Is human tuning needed or helpful?
- How do we accommodate inference?
- How do we leverage traditional KR?
41Summary
- The vision data authoring, querying and sharing
by everyone. - We got the plumbing to work. To go further, we
need AI techniques. - Challenge cross the structure chasm
- Its hard to author query structured data!
- PDMS architecture for ad-hoc sharing.
- Ontology/schema matching is key!
- Are we providing the right tools?
- Corpus-based knowledge representation.
- We need benchmarks!
42Some References
- www.cs.washington.edu/homes/alon
- Piazza ICDE03, WWW03, VLDB-03
- The Structure Chasm CIDR-03
- Mediation surveys VLDB Journal 01
- Lenzerini tutorial.
- Schema matching
- Rahm and Bernstein, VLDB Journal 01.
- Workshops IJCAI, Semantic Web Conf.
- Teaching integration to undergraduates SIGMOD
Record, September, 2003.