CS 630 InDepth Study On Integrating disparate databases using XML Bindumadhuri Arla presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS 630 InDepth Study On Integrating disparate databases using XML Bindumadhuri Arla

1
CS 630In-Depth Study On Integrating disparate
databases using XMLBindumadhuri Arla
2

What are different types of databases?
Text Retrieval Systems
Relational or object database Systems
Semi Structured/ XML database Systems
Why we need to Integrate them??
How to Integrate them?
Data-warehousing which is a repository of
integrated information,
available for queries and analysis. Data and
information are extracted
from heterogeneous sources as they are generated
making easier and
efficient to run queries over data that
originally came from different
sources.

The CWM XML package defines metamodel classes
needed to support the
description of XML documents as data resources
in a data warehouse.
The CWM XML metamodel does not contain an XML
document with data to be
interchanged, rather it contains the XML DTD that
describes the structure of
XML documents that can be interchanged
XMI
This specifications allow warehouse metadata and
the CWM metamodel to be
interchanged using W3Cs Extensible Markup
Language (XML).
XMI is used to transform the CWM metamodel into
a CWM Document Type
Definition (DTD), to transfer instances of
warehouse metadata that conform to
the CWM metamodel as XML documents, based on
the CWM DTD, and to
transform the CWM metamodel itself into an XML
document, based on the MOF
DTD, for interchange between MOF-compliant
repositories.

Text Retrieval
Text retrieval systems are concerned with the
management and query-based retrieval of
collections of unstructured text documents.
Relational or object oriented
Relational or object-oriented database
systems are concerned with the management of
structured or strictly-typed data, i.e., data
that conforms to a well-defined scheme.
Semi structured data.
Finally, semi structured databases are
designed to efficiently manage data that only
partially conforms to a schema, or whose schema
can evolve rapidly.

The three common Integration Architectural
Approaches
Layered Approach
Here an IMS of one type is implemented as an
application that operates over
an IMS of another type
Loosely coupling or Middleware approach
A middleware layer may be employed that provides
a uniform and common
query interface to a collection of diverse
management systems
Extension Approach
Here we extend one type of system natively or
through the use of plugin
models to support operators, features , or
datatypes normally found in another.

Text Retrieval and Relational/Object Database
Systems
Here Integration is a challenging and difficult
task.
Fundamental differences in the query and
retrieval models (precisely
defined declarative queries and exact answers in
databases versus
imprecise queries and approximate retrieval in IR
systems) have
resulted in vastly different query languages,
index structures, storage
formats, and query processing techniques.
The database community has favored the extension
architecture with
the aim of efficiently providing IR features
within the DBMS framework
In contrast, the IR community has shown a
preference for the layered
architecture, with the aim of exploiting DBMS
features (concurrency
control, recovery, security, transaction
semantics, robustness, etc.) to
build more scalable and robust text retrieval
systems

Extensions to Database Systems
The extension-based approach of the DB community
has led to work
on extended relational models and algebras,
extensions to database
query languages, new index structures and data
types , and query
execution strategies for optimizing IR-style
text operations.
Extensions to the relational model fall into two
categories
- nested (non-first normal form or NF 2 )
relational models to capture hierarchical
document structure and
probabilistic models to incorporate uncertainty
and imprecision into the DBMS framework. Such
probabilistic extensions, though promising,
require substantial changes to the core query
processing algorithms of database systems and as
a result, are not yet a part of actual
implementations.
In the technique of cooperative indexing, a
framework is provided for
scalable integration of IR and database systems.
In this approach, the
IR extension components define how documents are
processed, how
index terms are extracted and stemmed, and the
kinds of information
that are associated with each index entry.

Layering IR systems atop Database Systems
Some of the earliest attempts at integrating IR
and DB systems treated a text
retrieval system as a database application that
was implemented on top of a
standard relational DBMS.
The inverted index, the lexicon, and other term
frequency statistics were stored
in standard database tables.
IR queries were translated into SQL queries over
these tables and executed by
the database.
In addition, several prototype text retrieval
systems have also been built using
object database systems . Since the object data
model natively supports nesting,
in addition to collection types and sets, the
systems for content-based retrieval of
structured documents could be effectively
implemented on top of OODB
systems.

XML and Relational Database
XML is a meta-language designed to be both human
and computer readable and it
can be understood by virtually any software
application.
A relational database is a collection of
inter-related tables, each consisting of rows
and columns.
Data is stored inside these tables and all
operations on data are performed on the
tables themselves, producing additional tables as
a result.
When storing XML documents in a relational
database, the various XML elements
and attributes must be mapped to a pre-determined
structure according to
information contained in a DTD or XML schema.
Relational databases are particularly good for
storing and querying highly-
structured information.
RDBMS store data efficiently and with no
redundancy because each unit of
information is saved in only one place
(normalization).

RDBMS are also known for their reliability and
scalability and can be
accessed by a very large number of concurrent
users.
RDBMS were never designed to handle
semi-structured content often stored as
XML.
Semistructured content is information considered
to be more document-centric,
that is, content of a more unpredictable, less
structured nature content that varies
in length and type, with elements or attributes
that are often empty or missing, and
whose ordering is important.
Examples of document-centric XML documents
include books, technical
documentation, legal briefs, patient health
records and news content.

XML is a versatile markup language, capable of
labeling the information content
of diverse data sources including structured and
semi-structured documents,
relational databases, and object repositories.
A query language that uses the structure of XML
intelligently can express queries
across all these kinds of data, whether
physically stored in XML or viewed as
XML via middleware.
This specification describes a query language
called XQuery, which is designed to
be broadly applicable across many types of XML
data sources.
The major issues using XML for Data Integration
are
Schema Management When mapping heterogeneous
data sets mappings are created between their
schemas
Correspondence management to integrate data
sources correspondences are made. Automating that
is called schema matching.
Mapping Management to establish a meaning for
correspondences, inter-schema constraints are
established. Containment constraints are
established by mapping.

12
Text Retrieval and Semistructured Database
SystemsSince XML shares the same graph-based
data model as several other semistructured
database query languages, to date, most of the
work on querying, indexing, and searching XML
corpora has its origins in the database community
, Fuhr and Grossjohann refer to this approach as
the data-centric view of XML.The alternative
document-centric approach treats an XML corpus as
a collection of logically structured text
documents. By extending IR models and indexes to
encode the structure and semantics of XML
documents, it becomes possible to apply
well-known IR techniques and support keyword
searches, similarity-based retrieval, automatic
classification, and clustering, of XML
corpora.Another approach is of integrating
keyword searches with XML query processing.They
extend the XML-QL query language by introducing a
new contains predicate for keyword-based search
operations. They define the precise semantics of
the extended query language and describe how to
efficiently execute queries that involve keyword
search as well as non-text operations. XIRQL, is
an extension to the XQL query language to
support IR-related features such as weighting,
ranking, relevance-oriented search, and vague
predicates.
13

Relational/Object and Semistructured Database
Systems
The integration of XML with relational and
object database systems is an
extremely active research area . Techniques for
mapping the XML (semistructured)
data model to the relational or object data
model, for exporting relational data as
XML documents, for providing XML views of
relational data, and for extending
relational query engines to process queries over
XML data, are topics currently
being investigated by the research community.
In the commercial arena, most major relational
database vendors already provide
support for an XML data type to natively store
and manage XML documents, as
well as some primitive programming APIs for
importing and exporting XML
documents to and from database tables .

Extensions to Relational/Object Database Systems
Since XML is intended as a language for
inter-enterprise information interchange, it
is natural that techniques for publishing
relational data as XML documents are in great
demand. Several commercial tools already provide
this functionality, but with some
limitations.
Oracles XSQL tool generates a fixed canonical
mapping of the relational data into
XML documents, by mapping each relation and
attribute to an XML element,
and nesting tuple elements within table elements.
IBMs DB2 XML Extender supports a language for
composing relational data into
arbitrary XML as well as to decompose XML
documents into relations.
In general, there are two parts to designing a
system for publishing relations as XML
documents.
The first is the need for a language to
specify the conversion/mapping from relations to
XML
documents.
The second is an efficient implementation
strategy to actually carry out the conversion.

SilkRoute system is one of the earliest research
prototypes that supported automatic
XML generation from relational tables. SilkRoute
used a language called RXL, based
on a combination of SQL and XML query languages,
for specifying mappings of
relational tables to arbitrary XML DTDs. Using
this language, it is possible to define
XML views of the relational data. SilkRoute
efficiently executes queries over these
XML views by materializing only the portion of
the XML that is required to answer the
query.
Ozone, an extension of an object database system
to handle both structured
and semistructured data.
Layered systems
To design a database-backed XML repository, one
must precisely define
a mapping from XML documents to tables or
objects,
algorithm for translating queries over XML
documents into SQL or OQL queries over the
underlying database, and
(iii) a mechanism for translating the result of
database query executing into XML.
There are several proposals for implementing XML
repositories on top of relational and object
database systems, differing in their choices for
(i),(ii), and (iii).

There are three basic alternatives for mapping
XML documents into relational
tables.
The simplest, and least useful mapping, is to
store an entire XML document
as a single database attribute.
Another possibility is to interpret XML documents
as graph structures and supply
a relational schema that can store such
graphs.
A third approach is to map the structure of the
XML documents, (for e.g.,
expressed as a DTD) into a corresponding
relational schema and to store the
documents based on these mappings .
Only the last approach allows the repository to
fully exploit the query processing
and optimization capabilities of the underlying
database system.
Techniques for mapping XML documents into object
databases tend to be
considerably simpler
Two key challenges that need to be addressed are
OODB systems are generally strongly typed
whereas XML, being semistructured,
is not. As a result, most often, the object
model of the database must be extended,
before it can be used for implementing the XML
repository.
Many OODB systems support only simple path
expressions whereas most XML

Conclusion
Integration of semistructured data in general,
and XML in particular, with the
relational/object database world has received a
lot more attention than
corresponding integration with text retrieval
systems .
This will change, once efforts to incorporate
IR-style operators and structured
text retrieval models into XML query languages,
bear fruit.
The above Paper concentrated mainly on Pairs of
systems.

My study is narrowed to Integrating XML and
Relational database
Most of the worlds data repositories are
relational databases.
XML is a useful syntax for data exchange.
Need tools to facilitate transforming relational
data into XML.
The other two papers I am going to discuss are
on how to integrate Relational
database and XML
Integrating XML and Relational Database
Technologies A position paper
By Giovanni Guardalben, HiT Software , Inc
X-Ray Towards Integrating XML and Relational
Database Systems
Technical Report By Gerti Kappel

HiT Allora
Techniques for Modeling
User Defined Mapping
SQL to materialize XML is generated
automatically based on the mapping
and the relational catalog
Mapping definitions trigger the usage of
referential constraints to define
joins/outer joins
Dynamic SQL creation can use user-defined
predicates and parametrical
predicates as well as scripts
Portability across multiple RDBMs
Schema-based
Define a virtual XML Schema-based collection of
relation tables (default
database)
Query Normalization
Translate XQuery constructs into SQL queries
based on the default
database

HiT Software's Allora is a family of XML-to-RDBMS
integration middleware. It natively supports
Sun's Java platform (jAllora), and Microsoft's
Windows COM architecture (winAllora).
Allora is a set of tools that can be integrated
with server applications such as IXIASOFTs
TEXTML Server. It aims to solve the problem of
integrating any relational data source to any XML
schema document.
It can be used by Java application servers,
Java Server Page applications,Visual Basic
applications, C/C/.Net applications, and Active
Server Page applications.
Allora features two major software components
the Allora Mapper and
the Allora Engine.

Fig shows a typical system
architecture of an end-to-end XML
authoring and publishing application.
In this architecture, winAllora serves
as a bridge between legacy content or
XML content generated on the fly and
stored in a RDBMS.
WinAllora also provides a conversion
agent for XML content extracted from
TEXTML Server and exported to a
RDBMS.

22
(No Transcript)
23

HiT Software's winAllora Mapper is a GUI
application that allows creation and
editing of mapping definitions between a W3C
Schema or DTD instance and table
fields from a relational database.
Data mapping is made by dragging-and-dropping
items between the XML structure
and the Database Structure
The Mapper also supports retrieval and definition
of referential integrity constraints.
These are used to perform meaningful joins among
different relational tables.
winAllora also offers a complete set of APIs for
programmatically automating the
integration process using VBScript.
Design-time mapping tools such as winAllora use
scripting to provide the freedom to
operate on either SQL data (marshalling) or XML
data (unmarshalling).

24
(No Transcript)
25

winAllora Engine
The winAllora Engine imports and exports the data
to and from the XML
structure.
The Engine is a set of COM interfaces that take
the mapping definitions from
the Mapper and process them to perform XML
export or import.
winAllora supports any ODBC and OLEDBcompliant
database including most
relational databases such as Oracle, DB2, MS-SQL
Server, Sybase, Pervasive,
Pointbase or Timesten.
Import from RDBMS
When the Import process is started, winAllora
deposits the relational data in
XML format according to a specific DTD or XML
schema.
Export to RDBMS
Users of Content Management applications would
modify XML documents stored
in TEXTML Server and would like the changes to be
reflected in other
applications or databases.

The winAllora Export to RDBMS feature enables
TEXTML Server to export XML
documents to justabout any RDBMS.
winAllora can perform insert, update and delete
operations on the relational
database and still enforce its referential
integrity. No programming is necessary as
all the logic is included in the mapping
definition of the winAllora Mapper.
HiT Software Allora Framework for XML- to- RDBMS
Integration
Mapping XML- to- RDBMS (Queries and DBMS
Catalogues)
Marshaling Unmarshaling XML
GUI Mapper
Relational Storage Creation from XML Schema
XML Schema Creation from Relational Catalogs
Data Type Portability Across Heterogeneous
RDBMSs
Support for XML SQL Expressions
Support for Scripting and Parametrical Queries
Support for XML QBE Queries
Future
XQuery support based on XML- Schema to RDBMS
Mapping

27
References Integrating Diverse Information
Management Systems A brief Survey Sriram
Raghavan , Hector Garcia-Molina http//www-db.stan
ford.edu/rsram/pubs/deb01/deb01.pdf White
Papers From HiT Software, http//www.hitsw.com

Write a Comment

User Comments (0)

About PowerShow.com

CS 630 InDepth Study On Integrating disparate databases using XML Bindumadhuri Arla PowerPoint PPT Presentation