Title: CS 630 InDepth Study On Integrating disparate databases using XML Bindumadhuri Arla
1CS 630In-Depth Study On Integrating disparate
databases using XMLBindumadhuri Arla
2- What are different types of databases?
- Text Retrieval Systems
- Relational or object database Systems
- Semi Structured/ XML database Systems
- Why we need to Integrate them??
- How to Integrate them?
- Data-warehousing which is a repository of
integrated information, - available for queries and analysis. Data and
information are extracted - from heterogeneous sources as they are generated
making easier and - efficient to run queries over data that
originally came from different - sources.
3- The CWM XML package defines metamodel classes
needed to support the - description of XML documents as data resources
in a data warehouse. - The CWM XML metamodel does not contain an XML
document with data to be - interchanged, rather it contains the XML DTD that
describes the structure of - XML documents that can be interchanged
-
- XMI
- This specifications allow warehouse metadata and
the CWM metamodel to be - interchanged using W3Cs Extensible Markup
Language (XML). - XMI is used to transform the CWM metamodel into
a CWM Document Type - Definition (DTD), to transfer instances of
warehouse metadata that conform to - the  CWM metamodel as XML documents, based on
the CWM DTD, and to - transform the CWM metamodel itself into an XML
document, based on the MOF - DTD, for interchange between MOF-compliant
repositories.
4- Text Retrieval
- Text retrieval systems are concerned with the
management and query-based retrieval of
collections of unstructured text documents. -
- Relational or object oriented
- Relational or object-oriented database
systems are concerned with the management of
structured or strictly-typed data, i.e., data
that conforms to a well-defined scheme. - Semi structured data.
- Finally, semi structured databases are
designed to efficiently manage data that only
partially conforms to a schema, or whose schema
can evolve rapidly.
5- The three common Integration Architectural
Approaches - Layered Approach
- Here an IMS of one type is implemented as an
application that operates over - an IMS of another type
- Loosely coupling or Middleware approach
- A middleware layer may be employed that provides
a uniform and common - query interface to a collection of diverse
management systems - Extension Approach
- Here we extend one type of system natively or
through the use of plugin - models to support operators, features , or
datatypes normally found in another.
6- Text Retrieval and Relational/Object Database
Systems - Here Integration is a challenging and difficult
task. - Fundamental differences in the query and
retrieval models (precisely - defined declarative queries and exact answers in
databases versus - imprecise queries and approximate retrieval in IR
systems) have - resulted in vastly different query languages,
index structures, storage - formats, and query processing techniques.
- The database community has favored the extension
architecture with - the aim of efficiently providing IR features
within the DBMS framework - In contrast, the IR community has shown a
preference for the layered - architecture, with the aim of exploiting DBMS
features (concurrency - control, recovery, security, transaction
semantics, robustness, etc.) to - build more scalable and robust text retrieval
systems
7- Extensions to Database Systems
- The extension-based approach of the DB community
has led to work - on extended relational models and algebras,
extensions to database - query languages, new index structures and data
types , and query - execution strategies for optimizing IR-style
text operations. - Extensions to the relational model fall into two
categories - - nested (non-first normal form or NF 2 )
relational models to capture hierarchical
document structure and - probabilistic models to incorporate uncertainty
and imprecision into the DBMS framework. Such
probabilistic extensions, though promising,
require substantial changes to the core query
processing algorithms of database systems and as
a result, are not yet a part of actual
implementations. - In the technique of cooperative indexing, a
framework is provided for - scalable integration of IR and database systems.
In this approach, the - IR extension components define how documents are
processed, how - index terms are extracted and stemmed, and the
kinds of information - that are associated with each index entry.
8- Layering IR systems atop Database Systems
- Some of the earliest attempts at integrating IR
and DB systems treated a text - retrieval system as a database application that
was implemented on top of a - standard relational DBMS.
- The inverted index, the lexicon, and other term
frequency statistics were stored - in standard database tables.
- IR queries were translated into SQL queries over
these tables and executed by - the database.
- In addition, several prototype text retrieval
systems have also been built using - object database systems . Since the object data
model natively supports nesting, - in addition to collection types and sets, the
systems for content-based retrieval of - structured documents could be effectively
implemented on top of OODB - systems.
9- XML and Relational Database
- XML is a meta-language designed to be both human
and computer readable and it - can be understood by virtually any software
application. - A relational database is a collection of
inter-related tables, each consisting of rows - and columns.
- Data is stored inside these tables and all
operations on data are performed on the - tables themselves, producing additional tables as
a result. - When storing XML documents in a relational
database, the various XML elements - and attributes must be mapped to a pre-determined
structure according to - information contained in a DTD or XML schema.
- Relational databases are particularly good for
storing and querying highly- - structured information.
- RDBMS store data efficiently and with no
redundancy because each unit of - information is saved in only one place
(normalization).
10- RDBMS are also known for their reliability and
scalability and can be - accessed by a very large number of concurrent
users. - RDBMS were never designed to handle
semi-structured content often stored as - XML.
- Semistructured content is information considered
to be more document-centric, - that is, content of a more unpredictable, less
structured nature content that varies - in length and type, with elements or attributes
that are often empty or missing, and - whose ordering is important.
- Examples of document-centric XML documents
include books, technical - documentation, legal briefs, patient health
records and news content.
11- XML is a versatile markup language, capable of
labeling the information content - of diverse data sources including structured and
semi-structured documents, - relational databases, and object repositories.
- A query language that uses the structure of XML
intelligently can express queries - across all these kinds of data, whether
physically stored in XML or viewed as - XML via middleware.
- This specification describes a query language
called XQuery, which is designed to - be broadly applicable across many types of XML
data sources. - The major issues using XML for Data Integration
are - Schema Management When mapping heterogeneous
data sets mappings are created between their
schemas - Correspondence management to integrate data
sources correspondences are made. Automating that
is called schema matching. - Mapping Management to establish a meaning for
correspondences, inter-schema constraints are
established. Containment constraints are
established by mapping.
12Text Retrieval and Semistructured Database
SystemsSince XML shares the same graph-based
data model as several other semistructured
database query languages, to date, most of the
work on querying, indexing, and searching XML
corpora has its origins in the database community
, Fuhr and Grossjohann refer to this approach as
the data-centric view of XML.The alternative
document-centric approach treats an XML corpus as
a collection of logically structured text
documents. By extending IR models and indexes to
encode the structure and semantics of XML
documents, it becomes possible to apply
well-known IR techniques and support keyword
searches, similarity-based retrieval, automatic
classification, and clustering, of XML
corpora.Another approach is of integrating
keyword searches with XML query processing.They
extend the XML-QL query language by introducing a
new contains predicate for keyword-based search
operations. They define the precise semantics of
the extended query language and describe how to
efficiently execute queries that involve keyword
search as well as non-text operations. XIRQL, is
an extension to the XQL query language to
support IR-related features such as weighting,
ranking, relevance-oriented search, and vague
predicates.
13- Relational/Object and Semistructured Database
Systems - The integration of XML with relational and
object database systems is an - extremely active research area . Techniques for
mapping the XML (semistructured) - data model to the relational or object data
model, for exporting relational data as - XML documents, for providing XML views of
relational data, and for extending - relational query engines to process queries over
XML data, are topics currently - being investigated by the research community.
- In the commercial arena, most major relational
database vendors already provide - support for an XML data type to natively store
and manage XML documents, as - well as some primitive programming APIs for
importing and exporting XML - documents to and from database tables .
14- Extensions to Relational/Object Database Systems
- Since XML is intended as a language for
inter-enterprise information interchange, it - is natural that techniques for publishing
relational data as XML documents are in great - demand. Several commercial tools already provide
this functionality, but with some - limitations.
-
- Oracles XSQL tool generates a fixed canonical
mapping of the relational data into - XML documents, by mapping each relation and
attribute to an XML element, - and nesting tuple elements within table elements.
- IBMs DB2 XML Extender supports a language for
composing relational data into - arbitrary XML as well as to decompose XML
documents into relations. - In general, there are two parts to designing a
system for publishing relations as XML - documents.
- The first is the need for a language to
specify the conversion/mapping from relations to
XML - documents.
- The second is an efficient implementation
strategy to actually carry out the conversion.
15-
- SilkRoute system is one of the earliest research
prototypes that supported automatic - XML generation from relational tables. SilkRoute
used a language called RXL, based - on a combination of SQL and XML query languages,
for specifying mappings of - relational tables to arbitrary XML DTDs. Using
this language, it is possible to define - XML views of the relational data. SilkRoute
efficiently executes queries over these - XML views by materializing only the portion of
the XML that is required to answer the - query.
- Ozone, an extension of an object database system
to handle both structured - and semistructured data.
- Layered systems
- To design a database-backed XML repository, one
must precisely define - a mapping from XML documents to tables or
objects, - algorithm for translating queries over XML
documents into SQL or OQL queries over the
underlying database, and - (iii) a mechanism for translating the result of
database query executing into XML. - There are several proposals for implementing XML
repositories on top of relational and object
database systems, differing in their choices for
(i),(ii), and (iii).
16- There are three basic alternatives for mapping
XML documents into relational - tables.
- The simplest, and least useful mapping, is to
store an entire XML document - as a single database attribute.
- Another possibility is to interpret XML documents
as graph structures and supply - a relational schema that can store such
graphs. - A third approach is to map the structure of the
XML documents, (for e.g., - expressed as a DTD) into a corresponding
relational schema and to store the - documents based on these mappings .
- Only the last approach allows the repository to
fully exploit the query processing - and optimization capabilities of the underlying
database system. - Techniques for mapping XML documents into object
databases tend to be - considerably simpler
- Two key challenges that need to be addressed are
- OODB systems are generally strongly typed
whereas XML, being semistructured, - is not. As a result, most often, the object
model of the database must be extended, - before it can be used for implementing the XML
repository. - Many OODB systems support only simple path
expressions whereas most XML
17- Conclusion
- Integration of semistructured data in general,
and XML in particular, with the - relational/object database world has received a
lot more attention than - corresponding integration with text retrieval
systems . - This will change, once efforts to incorporate
IR-style operators and structured - text retrieval models into XML query languages,
bear fruit. - The above Paper concentrated mainly on Pairs of
systems.
18- My study is narrowed to Integrating XML and
Relational database - Most of the worlds data repositories are
relational databases. - XML is a useful syntax for data exchange.
- Need tools to facilitate transforming relational
data into XML. - The other two papers I am going to discuss are
on how to integrate Relational - database and XML
- Integrating XML and Relational Database
Technologies A position paper - By Giovanni Guardalben, HiT Software , Inc
- X-Ray Towards Integrating XML and Relational
Database Systems - Technical Report By Gerti Kappel
19- HiT Allora
-
- Techniques for Modeling
- User Defined Mapping
- SQL to materialize XML is generated
automatically based on the mapping - and the relational catalog
- Mapping definitions trigger the usage of
referential constraints to define - joins/outer joins
- Dynamic SQL creation can use user-defined
predicates and parametrical - predicates as well as scripts
- Portability across multiple RDBMs
- Schema-based
- Define a virtual XML Schema-based collection of
relation tables (default - database)
- Query Normalization
- Translate XQuery constructs into SQL queries
based on the default - database
20- HiT Software's Allora is a family of XML-to-RDBMS
integration middleware. It natively supports
Sun's Java platform (jAllora), and Microsoft's
Windows COM architecture (winAllora). - Allora is a set of tools that can be integrated
with server applications such as IXIASOFTs
TEXTML Server. It aims to solve the problem of
integrating any relational data source to any XML
schema document. - It can be used by Java application servers,
Java Server Page applications,Visual Basic
applications, C/C/.Net applications, and Active
Server Page applications. - Allora features two major software components
- the Allora Mapper and
- the Allora Engine.
21- Fig shows a typical system
- architecture of an end-to-end XML
- authoring and publishing application.
- In this architecture, winAllora serves
- as a bridge between legacy content or
- XML content generated on the fly and
- stored in a RDBMS.
- WinAllora also provides a conversion
- agent for XML content extracted from
- TEXTML Server and exported to a
- RDBMS.
22(No Transcript)
23- HiT Software's winAllora Mapper is a GUI
application that allows creation and - editing of mapping definitions between a W3C
Schema or DTD instance and table - fields from a relational database.
- Data mapping is made by dragging-and-dropping
items between the XML structure - and the Database Structure
- The Mapper also supports retrieval and definition
of referential integrity constraints. - These are used to perform meaningful joins among
different relational tables. - winAllora also offers a complete set of APIs for
programmatically automating the - integration process using VBScript.
- Design-time mapping tools such as winAllora use
scripting to provide the freedom to - operate on either SQL data (marshalling) or XML
data (unmarshalling).
24(No Transcript)
25- winAllora Engine
- The winAllora Engine imports and exports the data
to and from the XML - structure.
- The Engine is a set of COM interfaces that take
the mapping definitions from - the Mapper and process them to perform XML
export or import. - winAllora supports any ODBC and OLEDBcompliant
database including most - relational databases such as Oracle, DB2, MS-SQL
Server, Sybase, Pervasive, - Pointbase or Timesten.
- Import from RDBMS
- When the Import process is started, winAllora
deposits the relational data in - XML format according to a specific DTD or XML
schema. - Export to RDBMS
- Users of Content Management applications would
modify XML documents stored - in TEXTML Server and would like the changes to be
reflected in other - applications or databases.
26- The winAllora Export to RDBMS feature enables
TEXTML Server to export XML - documents to justabout any RDBMS.
- winAllora can perform insert, update and delete
operations on the relational - database and still enforce its referential
integrity. No programming is necessary as - all the logic is included in the mapping
definition of the winAllora Mapper. - HiT Software Allora Framework for XML- to- RDBMS
- Integration
- Mapping XML- to- RDBMS (Queries and DBMS
Catalogues) - Marshaling Unmarshaling XML
- GUI Mapper
- Relational Storage Creation from XML Schema
- XML Schema Creation from Relational Catalogs
- Data Type Portability Across Heterogeneous
RDBMSs - Support for XML SQL Expressions
- Support for Scripting and Parametrical Queries
- Support for XML QBE Queries
- Future
- XQuery support based on XML- Schema to RDBMS
Mapping
27References Integrating Diverse Information
Management Systems A brief Survey Sriram
Raghavan , Hector Garcia-Molina http//www-db.stan
ford.edu/rsram/pubs/deb01/deb01.pdf White
Papers From HiT Software, http//www.hitsw.com