CS 630 InDepth Study On Integrating disparate databases using XML Bindumadhuri Arla PowerPoint PPT Presentation

presentation player overlay
1 / 27
About This Presentation
Transcript and Presenter's Notes

Title: CS 630 InDepth Study On Integrating disparate databases using XML Bindumadhuri Arla


1
CS 630In-Depth Study On Integrating disparate
databases using XMLBindumadhuri Arla
2
  • What are different types of databases?
  • Text Retrieval Systems
  • Relational or object database Systems
  • Semi Structured/ XML database Systems
  • Why we need to Integrate them??
  • How to Integrate them?
  • Data-warehousing which is a repository of
    integrated information,
  • available for queries and analysis. Data and
    information are extracted
  • from heterogeneous sources as they are generated
    making easier and
  • efficient to run queries over data that
    originally came from different
  • sources.

3
  • The CWM XML package defines metamodel classes
    needed to support the
  • description of XML documents as data resources
    in a data warehouse.
  • The CWM XML metamodel does not contain an XML
    document with data to be
  • interchanged, rather it contains the XML DTD that
    describes the structure of
  • XML documents that can be interchanged
  • XMI
  • This specifications allow warehouse metadata and
    the CWM metamodel to be
  • interchanged using W3Cs Extensible Markup
    Language (XML).
  • XMI is used to transform the CWM metamodel into
    a CWM Document Type
  • Definition (DTD), to transfer instances of
    warehouse metadata that conform to
  • the  CWM metamodel as XML documents, based on
    the CWM DTD, and to
  • transform the CWM metamodel itself into an XML
    document, based on the MOF
  • DTD, for interchange between MOF-compliant
    repositories.

4
  • Text Retrieval
  • Text retrieval systems are concerned with the
    management and query-based retrieval of
    collections of unstructured text documents.
  • Relational or object oriented
  • Relational or object-oriented database
    systems are concerned with the management of
    structured or strictly-typed data, i.e., data
    that conforms to a well-defined scheme.
  • Semi structured data.
  • Finally, semi structured databases are
    designed to efficiently manage data that only
    partially conforms to a schema, or whose schema
    can evolve rapidly.

5
  • The three common Integration Architectural
    Approaches
  • Layered Approach
  • Here an IMS of one type is implemented as an
    application that operates over
  • an IMS of another type
  • Loosely coupling or Middleware approach
  • A middleware layer may be employed that provides
    a uniform and common
  • query interface to a collection of diverse
    management systems
  • Extension Approach
  • Here we extend one type of system natively or
    through the use of plugin
  • models to support operators, features , or
    datatypes normally found in another.

6
  • Text Retrieval and Relational/Object Database
    Systems
  • Here Integration is a challenging and difficult
    task.
  • Fundamental differences in the query and
    retrieval models (precisely
  • defined declarative queries and exact answers in
    databases versus
  • imprecise queries and approximate retrieval in IR
    systems) have
  • resulted in vastly different query languages,
    index structures, storage
  • formats, and query processing techniques.
  • The database community has favored the extension
    architecture with
  • the aim of efficiently providing IR features
    within the DBMS framework
  • In contrast, the IR community has shown a
    preference for the layered
  • architecture, with the aim of exploiting DBMS
    features (concurrency
  • control, recovery, security, transaction
    semantics, robustness, etc.) to
  • build more scalable and robust text retrieval
    systems

7
  • Extensions to Database Systems
  • The extension-based approach of the DB community
    has led to work
  • on extended relational models and algebras,
    extensions to database
  • query languages, new index structures and data
    types , and query
  • execution strategies for optimizing IR-style
    text operations.
  • Extensions to the relational model fall into two
    categories
  • - nested (non-first normal form or NF 2 )
    relational models to capture hierarchical
    document structure and
  • probabilistic models to incorporate uncertainty
    and imprecision into the DBMS framework. Such
    probabilistic extensions, though promising,
    require substantial changes to the core query
    processing algorithms of database systems and as
    a result, are not yet a part of actual
    implementations.
  • In the technique of cooperative indexing, a
    framework is provided for
  • scalable integration of IR and database systems.
    In this approach, the
  • IR extension components define how documents are
    processed, how
  • index terms are extracted and stemmed, and the
    kinds of information
  • that are associated with each index entry.

8
  • Layering IR systems atop Database Systems
  • Some of the earliest attempts at integrating IR
    and DB systems treated a text
  • retrieval system as a database application that
    was implemented on top of a
  • standard relational DBMS.
  • The inverted index, the lexicon, and other term
    frequency statistics were stored
  • in standard database tables.
  • IR queries were translated into SQL queries over
    these tables and executed by
  • the database.
  • In addition, several prototype text retrieval
    systems have also been built using
  • object database systems . Since the object data
    model natively supports nesting,
  • in addition to collection types and sets, the
    systems for content-based retrieval of
  • structured documents could be effectively
    implemented on top of OODB
  • systems.

9
  • XML and Relational Database
  • XML is a meta-language designed to be both human
    and computer readable and it
  • can be understood by virtually any software
    application.
  • A relational database is a collection of
    inter-related tables, each consisting of rows
  • and columns.
  • Data is stored inside these tables and all
    operations on data are performed on the
  • tables themselves, producing additional tables as
    a result.
  • When storing XML documents in a relational
    database, the various XML elements
  • and attributes must be mapped to a pre-determined
    structure according to
  • information contained in a DTD or XML schema.
  • Relational databases are particularly good for
    storing and querying highly-
  • structured information.
  • RDBMS store data efficiently and with no
    redundancy because each unit of
  • information is saved in only one place
    (normalization).

10
  • RDBMS are also known for their reliability and
    scalability and can be
  • accessed by a very large number of concurrent
    users.
  • RDBMS were never designed to handle
    semi-structured content often stored as
  • XML.
  • Semistructured content is information considered
    to be more document-centric,
  • that is, content of a more unpredictable, less
    structured nature content that varies
  • in length and type, with elements or attributes
    that are often empty or missing, and
  • whose ordering is important.
  • Examples of document-centric XML documents
    include books, technical
  • documentation, legal briefs, patient health
    records and news content.

11
  • XML is a versatile markup language, capable of
    labeling the information content
  • of diverse data sources including structured and
    semi-structured documents,
  • relational databases, and object repositories.
  • A query language that uses the structure of XML
    intelligently can express queries
  • across all these kinds of data, whether
    physically stored in XML or viewed as
  • XML via middleware.
  • This specification describes a query language
    called XQuery, which is designed to
  • be broadly applicable across many types of XML
    data sources.
  • The major issues using XML for Data Integration
    are
  • Schema Management When mapping heterogeneous
    data sets mappings are created between their
    schemas
  • Correspondence management to integrate data
    sources correspondences are made. Automating that
    is called schema matching.
  • Mapping Management to establish a meaning for
    correspondences, inter-schema constraints are
    established. Containment constraints are
    established by mapping.

12
Text Retrieval and Semistructured Database
SystemsSince XML shares the same graph-based
data model as several other semistructured
database query languages, to date, most of the
work on querying, indexing, and searching XML
corpora has its origins in the database community
, Fuhr and Grossjohann refer to this approach as
the data-centric view of XML.The alternative
document-centric approach treats an XML corpus as
a collection of logically structured text
documents. By extending IR models and indexes to
encode the structure and semantics of XML
documents, it becomes possible to apply
well-known IR techniques and support keyword
searches, similarity-based retrieval, automatic
classification, and clustering, of XML
corpora.Another approach is of integrating
keyword searches with XML query processing.They
extend the XML-QL query language by introducing a
new contains predicate for keyword-based search
operations. They define the precise semantics of
the extended query language and describe how to
efficiently execute queries that involve keyword
search as well as non-text operations. XIRQL, is
an extension to the XQL query language to
support IR-related features such as weighting,
ranking, relevance-oriented search, and vague
predicates.
13
  • Relational/Object and Semistructured Database
    Systems
  • The integration of XML with relational and
    object database systems is an
  • extremely active research area . Techniques for
    mapping the XML (semistructured)
  • data model to the relational or object data
    model, for exporting relational data as
  • XML documents, for providing XML views of
    relational data, and for extending
  • relational query engines to process queries over
    XML data, are topics currently
  • being investigated by the research community.
  • In the commercial arena, most major relational
    database vendors already provide
  • support for an XML data type to natively store
    and manage XML documents, as
  • well as some primitive programming APIs for
    importing and exporting XML
  • documents to and from database tables .

14
  • Extensions to Relational/Object Database Systems
  • Since XML is intended as a language for
    inter-enterprise information interchange, it
  • is natural that techniques for publishing
    relational data as XML documents are in great
  • demand. Several commercial tools already provide
    this functionality, but with some
  • limitations.
  • Oracles XSQL tool generates a fixed canonical
    mapping of the relational data into
  • XML documents, by mapping each relation and
    attribute to an XML element,
  • and nesting tuple elements within table elements.
  • IBMs DB2 XML Extender supports a language for
    composing relational data into
  • arbitrary XML as well as to decompose XML
    documents into relations.
  • In general, there are two parts to designing a
    system for publishing relations as XML
  • documents.
  • The first is the need for a language to
    specify the conversion/mapping from relations to
    XML
  • documents.
  • The second is an efficient implementation
    strategy to actually carry out the conversion.

15
  • SilkRoute system is one of the earliest research
    prototypes that supported automatic
  • XML generation from relational tables. SilkRoute
    used a language called RXL, based
  • on a combination of SQL and XML query languages,
    for specifying mappings of
  • relational tables to arbitrary XML DTDs. Using
    this language, it is possible to define
  • XML views of the relational data. SilkRoute
    efficiently executes queries over these
  • XML views by materializing only the portion of
    the XML that is required to answer the
  • query.
  • Ozone, an extension of an object database system
    to handle both structured
  • and semistructured data.
  • Layered systems
  • To design a database-backed XML repository, one
    must precisely define
  • a mapping from XML documents to tables or
    objects,
  • algorithm for translating queries over XML
    documents into SQL or OQL queries over the
    underlying database, and
  • (iii) a mechanism for translating the result of
    database query executing into XML.
  • There are several proposals for implementing XML
    repositories on top of relational and object
    database systems, differing in their choices for
    (i),(ii), and (iii).

16
  • There are three basic alternatives for mapping
    XML documents into relational
  • tables.
  • The simplest, and least useful mapping, is to
    store an entire XML document
  • as a single database attribute.
  • Another possibility is to interpret XML documents
    as graph structures and supply
  • a relational schema that can store such
    graphs.
  • A third approach is to map the structure of the
    XML documents, (for e.g.,
  • expressed as a DTD) into a corresponding
    relational schema and to store the
  • documents based on these mappings .
  • Only the last approach allows the repository to
    fully exploit the query processing
  • and optimization capabilities of the underlying
    database system.
  • Techniques for mapping XML documents into object
    databases tend to be
  • considerably simpler
  • Two key challenges that need to be addressed are
  • OODB systems are generally strongly typed
    whereas XML, being semistructured,
  • is not. As a result, most often, the object
    model of the database must be extended,
  • before it can be used for implementing the XML
    repository.
  • Many OODB systems support only simple path
    expressions whereas most XML

17
  • Conclusion
  • Integration of semistructured data in general,
    and XML in particular, with the
  • relational/object database world has received a
    lot more attention than
  • corresponding integration with text retrieval
    systems .
  • This will change, once efforts to incorporate
    IR-style operators and structured
  • text retrieval models into XML query languages,
    bear fruit.
  • The above Paper concentrated mainly on Pairs of
    systems.

18
  • My study is narrowed to Integrating XML and
    Relational database
  • Most of the worlds data repositories are
    relational databases.
  • XML is a useful syntax for data exchange.
  • Need tools to facilitate transforming relational
    data into XML.
  • The other two papers I am going to discuss are
    on how to integrate Relational
  • database and XML
  • Integrating XML and Relational Database
    Technologies A position paper
  • By Giovanni Guardalben, HiT Software , Inc
  • X-Ray Towards Integrating XML and Relational
    Database Systems
  • Technical Report By Gerti Kappel

19
  • HiT Allora
  • Techniques for Modeling
  • User Defined Mapping
  • SQL to materialize XML is generated
    automatically based on the mapping
  • and the relational catalog
  • Mapping definitions trigger the usage of
    referential constraints to define
  • joins/outer joins
  • Dynamic SQL creation can use user-defined
    predicates and parametrical
  • predicates as well as scripts
  • Portability across multiple RDBMs
  • Schema-based
  • Define a virtual XML Schema-based collection of
    relation tables (default
  • database)
  • Query Normalization
  • Translate XQuery constructs into SQL queries
    based on the default
  • database

20
  • HiT Software's Allora is a family of XML-to-RDBMS
    integration middleware. It natively supports
    Sun's Java platform (jAllora), and Microsoft's
    Windows COM architecture (winAllora).
  • Allora is a set of tools that can be integrated
    with server applications such as IXIASOFTs
    TEXTML Server. It aims to solve the problem of
    integrating any relational data source to any XML
    schema document.
  • It can be used by Java application servers,
    Java Server Page applications,Visual Basic
    applications, C/C/.Net applications, and Active
    Server Page applications.
  • Allora features two major software components
  • the Allora Mapper and
  • the Allora Engine.

21
  • Fig shows a typical system
  • architecture of an end-to-end XML
  • authoring and publishing application.
  • In this architecture, winAllora serves
  • as a bridge between legacy content or
  • XML content generated on the fly and
  • stored in a RDBMS.
  • WinAllora also provides a conversion
  • agent for XML content extracted from
  • TEXTML Server and exported to a
  • RDBMS.

22
(No Transcript)
23
  • HiT Software's winAllora Mapper is a GUI
    application that allows creation and
  • editing of mapping definitions between a W3C
    Schema or DTD instance and table
  • fields from a relational database.
  • Data mapping is made by dragging-and-dropping
    items between the XML structure
  • and the Database Structure
  • The Mapper also supports retrieval and definition
    of referential integrity constraints.
  • These are used to perform meaningful joins among
    different relational tables.
  • winAllora also offers a complete set of APIs for
    programmatically automating the
  • integration process using VBScript.
  • Design-time mapping tools such as winAllora use
    scripting to provide the freedom to
  • operate on either SQL data (marshalling) or XML
    data (unmarshalling).

24
(No Transcript)
25
  • winAllora Engine
  • The winAllora Engine imports and exports the data
    to and from the XML
  • structure.
  • The Engine is a set of COM interfaces that take
    the mapping definitions from
  • the Mapper and process them to perform XML
    export or import.
  • winAllora supports any ODBC and OLEDBcompliant
    database including most
  • relational databases such as Oracle, DB2, MS-SQL
    Server, Sybase, Pervasive,
  • Pointbase or Timesten.
  • Import from RDBMS
  • When the Import process is started, winAllora
    deposits the relational data in
  • XML format according to a specific DTD or XML
    schema.
  • Export to RDBMS
  • Users of Content Management applications would
    modify XML documents stored
  • in TEXTML Server and would like the changes to be
    reflected in other
  • applications or databases.

26
  • The winAllora Export to RDBMS feature enables
    TEXTML Server to export XML
  • documents to justabout any RDBMS.
  • winAllora can perform insert, update and delete
    operations on the relational
  • database and still enforce its referential
    integrity. No programming is necessary as
  • all the logic is included in the mapping
    definition of the winAllora Mapper.
  • HiT Software Allora Framework for XML- to- RDBMS
  • Integration
  • Mapping XML- to- RDBMS (Queries and DBMS
    Catalogues)
  • Marshaling Unmarshaling XML
  • GUI Mapper
  • Relational Storage Creation from XML Schema
  • XML Schema Creation from Relational Catalogs
  • Data Type Portability Across Heterogeneous
    RDBMSs
  • Support for XML SQL Expressions
  • Support for Scripting and Parametrical Queries
  • Support for XML QBE Queries
  • Future
  • XQuery support based on XML- Schema to RDBMS
    Mapping

27
References Integrating Diverse Information
Management Systems A brief Survey Sriram
Raghavan , Hector Garcia-Molina http//www-db.stan
ford.edu/rsram/pubs/deb01/deb01.pdf White
Papers From HiT Software, http//www.hitsw.com
Write a Comment
User Comments (0)
About PowerShow.com