Title: A Semantic Approach to XMLBased Data Integration
1A Semantic Approach to XML-Based Data Integration
- A paper by Patricia Rodríguez-Gianolli and John
Mylopoulos - in University of Toronto
- Speaker Yi Lu
- Wayne State University
2Outline
- Introduction
- The DIXSE framework
- The mapping language
- A case study
- System architecture of DIXSE.
- Conclusion
3Introduction
- Integrating data from multiple heterogeneous data
sources has been a major focus of database
research for more than two decades. - With the widespread acceptance of Web, interest
in data integration has been renewed, with focus
on semi-structured data. - Little has been proposed for data integration of
XML documents.
4Introduction
- Two data integration category traditional schema
integration and semi-structured data integration.
The key to successful data integration is the
identification of inter-schema relationships. - Traditional schema integration the
identification of inter-schema relationships can
be done at different levels of abstraction. - Inter-schema relationship identification is done
differently in data integration systems for
semi-structured data because it lack of schema.
5Introduction
- DIXSE stands for Data Integration for Xml based
on Schematic knowlEdge. It blends techniques
from conventional and semi-structured data
integration systems. It provide a semi-automatic
integration for XML data. - The main step of DIXSE
- DIXSE semi-automatically deriving a common
semantic description from the input DTDs, and
allows user to enrich and fine-tune this
description. - DIXSE automatically generates wrappers for XML
documents that conform these DTDs and populates
the conceptual schema.
6System Information Flow in DIXSE
7XML and Telos
8DIXSE Framework
- Consists of a data model and a derivation
mechanism. - The data model supports concepts such as entity
classes, attributes, and mappings for
representing conceptual schemas as a collection
of interrelated entity classes. - The mechanism exploits the schema information
provided by the DTD and generates a DIXSE
conceptual schema as output. A set of heuristics
rules drives the derivation process.
9DIXSE Framework Data Model
- representing conceptual schemas as collections of
entity types and their attributes. - model supports four main concepts entity
classes, entity attributes, mappings and document
types. - Entity classes represent types of objects or
concepts found in XML DTDs. - In addition, the data model offers two
structuring facilities to capture the semantics
of entity attributes. - attribute categories
- attribute constraints
10Attribute Categories
- There are three types of attributes (or
categories), namely components, properties and
links. - attribute is a component when it represents the
content (or structure) of an entity - it is a property when it represents information
about the content of an entity - and it is a link when it represents
intra-document or inter-document information.
11Attribute Constraints
- These constraints are inspired on the constraints
that XML itself imposes on elements and
attributes. - exactlyOne
- atMostOne
- zeroOrMore
- oneOrMore
- union
- fixed
- idref
- xLink
- key
12Mappings and Documents
- A mapping in the XML Framework describes a
conceptual schema of the information represented
by a given XML DTD, typically authored for a
given context. - a document type describes a given XML DTD and a
collection of mappings (i.e. conceptual schemas)
attached to it. Contexts are represented in the
data model as distinguishing attributes (string
names) of document types, mappings and entity
classes.
13DIXSE Framework Mechanism
- The DIXSE framework offers a mechanism to derive
a default conceptual schema of the information
represented by an XML DTD. This mapping is purely
reasoned on the basis of the schematic knowledge
offered by DTDs, and thus captures the semantics
conveyed by the data only partially.
14DIXSE Framework Mechanism (Cont)
15DIXSE Framework Mechanism -- Default Mapping
Rule 1 (DR1)
16DIXSE Framework Mechanism -- Default Mapping
Rule 2 (DR2)
17DIXSE Framework Mechanism -- Default Mapping
Rule 3 (DR3)
18DIXSE Framework Mechanism -- Default Mapping
Rule 4 (DR4)
19DIXSE Framework Mechanism --Default Mapping
Rule 5 (DR5)
20DIXSE Framework Mechanism -- Default Mapping
Rule 6 (DR6)
21DIXSE Framework Mechanism -- Default Mapping
Rule 7 (DR7)
22Default Mapping for Sigmod Record
23The Mapping Language
- DIXSE offers the possibility of customizing the
default mapping --- DIXml. DIXml present a simple
mapping language that allow us to write mapping
specifications. - DIXml is a declarative mapping language for
specifying a mapping or conceptual schema of the
information represented by a given XML DTD. This
specification annotates a DTD with simple
instructions for generating entity classes from
DTD element type declarations.
24The Mapping Language (Cont)
- DIXml is also a XML, and provides its own
vocabulary to describe DIXSE mappings. Two main
elements are directive and DIXSEmapping. - directive represents a DIXml directive rule,
where the target is the value of the directive
elements attribute and the action is the
elements contents. - DIXSEmapping, represents the mapping itself by
encompassing the collection of specified
directive rules.
25The Mapping Language (Cont)
- A directive consists of two parts the target
element and the action body. The first one
identifies the XML element addressed by the rule.
The action body describes how this target element
should be mapped into a DIXSE conceptual
representation. There are five different
directive actions, namely default, create-class,
create-attribute, inline, and ignore.
26The Mapping Language (Cont)
27The Mapping Language (Cont) -- The Default
Directive Action
28The Mapping Language (Cont) -- The
Create-class Directive Action
29The Mapping Language (Cont) -- The
Create-class Directive Action
30The Mapping Language (Cont) -- The
Create-attribute Directive Action
31The Mapping Language (Cont) -- The Inline
Directive Action
32The Mapping Language (Cont) -- The Ignore
Directive Action
33A Case Study (Cont)
34A Case Study (Cont)
35A Case Study (Cont)
36System Architecture
37System Architecture (Cont)
- Schema Engine and Document Loader
- Schema Engine subsystem includes five components
the DTD parser, the XML parser, the Schema
Derivator, the Schema Generator and the XSL
Wrapper Generator. - the Document Loader consists of the XSL Processor
and the Data Integrator. - The communication between these two subsystems is
accomplished through the Catalog Manager and the
XSL Wrapper Repository.
38Conclusion
- The paper proposes a semantic framework for XML
data integration called DIXSE. The DIXSE
framework offers a tool, which can be used
semi-automatically to generate a conceptual
schema from several XML DTDs. - DIXSE differs from other data integration
systems - Exploits the structural information provided by
DTDs - It is based on schema integration like
conventional data integration systems, but allows
the user to enrich and fine-tune the schematic
knowledge. - It employs a specialized object-based repository
to store an integrated and semantically richer
version of data.