Title: Integrating Multiple Data Sources using a Standardized XML Dictionary
1Integrating Multiple Data Sources using a
Standardized XML Dictionary
Ramon Lawrence University of Manitoba umlawren_at_cs.
umanitoba.ca Supervisor Dr. Ken Barker
TRLabs - Winnipeg
2Outline
- Introduction, Motivation, and Background
- Integration architecture components
- Integration architecture
- Example integration
- Applications to the WWW
- Future work and conclusions
- Demonstration of Unity
3Introduction
- Integration of data is required when accessing
multiple databases within an organization or on
the WWW. - Our focus is automatically combining database
schema using schema integration. - Schema integration requires knowledge of data
semantics and use of metadata.
4Motivation
- Organizations have several database systems which
must interoperate. - Users often access multiple Web databases whose
knowledge must be integrated and presented in a
useful form. - Data warehouses and OLAP systems require data
semantics to be understood and data to be
cleansed and summarized.
5Background
- Schema integration involves combining diverse
database schema into an integrated view by
resolving conflicts. - Schema conflicts include naming, structural, and
semantic conflicts. - Schema integration is required for database
interoperability, but it is currently a manual
process.
6MDBS Architecture
Global Transactions
- Global Transaction Manager (GTM)
- processes global transactions
- insures information in all LDBSs is consistent
- submits subtransactions to the GTSs for each LDBS
GTM
subtransactions
- Global Transaction Servers (GTSs)
- one for each LDBS
- converts subtransactions from the GTM into a form
usable by the LDBS and vice versa
- Local Database Systems (LDBSs)
- databases combined into MDBS
- unchanged as still process local transactions
Local Transactions
7Previous Work
- Research systems
- integrating systems by logical rules (Sheth)
- defining global dictionaries (Castano)
- Carnot Project using the Cyc knowledge base
- Industrial systems and standards
- Metadata Interchange Specification (MDIS)
- XML, BizTalk, E-commerce portals
8Architecture Objective
- The objective of our architecture is to provide a
system for automatically integrating diverse
relational schemas into a multidatabase - Desirable properties
- individual mappings - information sources
integrated one-at-a-time and independently - global view constructed for query transparency
- handles schema conflicts - including semantic,
structural, and naming conflicts - automated global integration - global view
constructed efficiently and automatically
9The Idea
- The major idea is that schema conflicts can be
resolved if we - eliminate all naming conflicts
- define a language capable of determining schema
equivalence and performing transformations - With these two properties, schema conflicts can
be resolved automatically at the global level
10Architecture Components The Global Dictionary
- A global dictionary (GD) provides standardized
terms to capture data semantics. - Hierarchy of terms related by IS-A or Has-A links
- Contains base set of common database concepts,
but new concepts can be added - A GD term is a single, unambiguous semantic
definition. - Several GD entries for a single English word are
required if the word has multiple definitions.
11Architecture ComponentsUsing the Global
Dictionary
- GD terms are used to build semantic names to
describe the semantics of schema elements. - Semantic names have the form
- semantic name CT CT ,CT CN
- CT context term, CN concept name
- each CT and CN is a single term from the GD
- Semantic names are included in specifications
describing a data source.
12Architecture ComponentsX-Specs
- Database metadata and semantic names are combined
into specifications called X-Specs - stored and transmitted using XML
- contains information on a relational schema
- organized into database, table, and field levels
- stores semantic names to describe and integrate
schema elements
13Architecture ComponentsIntegrating X-Specs
- Each database to be integrated is described using
a X-Spec. - Identical concepts in different databases are
identified by similar semantic names. - Concepts with identical (or hierarchially
related) semantic names are combined regardless
of their physical representation in the
individual databases.
14Integration Architecture
- Our integration architecture consists of two
separate phases - capture process X-Specs are constructed for each
data source independently - integration process X-Specs are combined using
the integration algorithm which matches semantic
names using the global dictionary
15Integration ArchitectureThe Capture Process
- Capture process involves
- automatically extracting the schema information
and metadata using a specification editor - assigning semantic names to each schema element
(tables and fields) to capture their semantics
16Integration ArchitectureThe Capture Process
Relational Schema
Automatic Extraction
X-Spec
Specification Editor
DBA Lookup of terms
Global Dictionary
17Integration ArchitectureThe Integration Process
- Integration process involves
- automatically identifying identical concepts by
matching semantic names - constructing a global view of database concepts
consisting of a hierarchy of concept terms - resolving structural differences during query
generation and submission (e.g. a concept may be
represented as a table in one database and a
field (attribute) in another)
18Integration ArchitectureThe Integration Process
.
Client
Client
Integration Site
Subtransactions
X-Spec
X-Spec
..
RDBMS
RDBMS
19Integration Architecture Benefits
- The benefits of the two phase architecture are
- Dynamic integration schemas integrated as needed
- X-Specs are constructed only once and independent
of each other - Automatic conflict resolution by integrating
based on semantic name rather than physical
structure - Users are isolated from system names and
organization by querying through a global view
using semantic names for concepts
20Integration Example
- Two claims databases to be integrated
- ABC Company Claims_tb(claim_id, claimant,
net_amount, paid_amount) - XYZ Company T_claims(id, customer, claim_amt),
T_payments(cid, pid, amount) - First step is to construct X-Specs for each
database.
21Integration ExampleABC Database X-Spec
22Integration ExampleXYZ Database X-Spec
23Integration ExampleIntegrated View
- Global view after integration
- Claim
- Id
- Net amount
- Customer
- name
- Payment
- id
- amount
24Integration ExampleDiscussion
- Important points
- system and field names are not presented to the
user who queries based on semantic names - database structure is not shown to the user
- different physical representations for the same
concept are combined (e.g. payment (attribute) in
ABC with payment table in XYZ database) - hierarchially related concepts (customer vs.
claimant) are combined based on their IS-A
relationship in the global dictionary
25Applications to the WWW
- Integrating diverse data sources is involved in
constructing a data warehouse and other
operational systems. - The WWW is a diverse organizations of databases
which users access. - Automatically integrating web data sources by a
browser or portal reduces query complexity and
integration of results for the user.
26Conclusions
- Automatic integration of database schema is
possible by using a global dictionary of terms
and constructing semantic names for schema
elements. - Integration of data sources has applications to
the WWW and construction of data warehouses.
27Future Work
- The integration architecture is evolving with
standards on XML and captures metadata
information in XML documents. - The system is being tested on sample problems,
and a query mechanism is work-in-progress. - We are refining a prototype of the system called
Unity.