Title: Information Extraction and Integration from Heterogeneous, Distributed, Autonomous Information Sourc
1Information Extraction and Integration from
Heterogeneous, Distributed, Autonomous
Information Sources A Federated Ontology-
Driven Query-Centric ApproachBy Jaime Castillo,
Adrian Silvescu, Doina Caragea, Jyotishman
Pathak, Vasant Honavar Aritificial
Intelligence Research Laboratory,
Department of Computer Science IOWA State
University, USA
- Presented By
- Ronak Shah
- Graduate Student
- University of Southern California.
2Presentation Overview
- Introduction
- Data Integration Systems
- Data Integration in INDUS
- Implementation of Indus
- Summary and Discussion
3Introduction
- Development of High throughput Data Acquisition
- Advances in digital storage, computing and
Communication technologies -
- Opportunity in data-driven knowledge acquisition
and decision making
4- Challenges in use of increasing amounts of Data
from disparate sources
5- Issue
- Data Repositories are large in size, dynamic and
physically distributed - Neither desirable nor feasible to gather all data
in centralized location for analysis - Solution
- Algorithm to efficiently extract the relevant
information from disparate sources on demand
6- Issue
- Data sources are autonomously owned and operated
- Range of operations and precise mode of allowed
interactions can be diverse - Solution
- Strategy for obtaining required information
within operational constraints
7- Issue
- Data Sources are heterogenous in structure and
content - Each Data source uses its own ontology
- Solution
- Effective Integration of data from different
sources bridging the semantic and syntactic
mismatches among the data sources
8- Issue
- No Single Universal Ontology
- Solution
- Methods for context-dependent dynamic information
extraction and integration from distributed data
based on user-specified ontologies for knowledge
acquisition and decision making
9Data Integration Systems
- Provide users with seamless and flexible access
to information from multiple autonomous,
distributed and heterogenous data sources - Unified Query Interface
- Allow users to specify what information is needed
without mentioning how and from where to obtain
information
10Data Integration Systems should provide
mechanisms for
- Communication and interaction with each data
source - Specification of a query, expressed in terms of
user specified ontology across disparate sources - Mapping between user and data source specific
ontologies - Transformation of a query into plan
- Integration and presentation of the results in
the vocabulary known to the user
11Two Classes of Approach to Data Integration
- Data Warehousing
- Data Federation
12Data Warehousing
- Data is collected from disparate sources and
mapped to the common structure and stored in the
central location - Need to periodically update the warehouse in
order to ensure that the data is accurate - Not possible to analyze same data from different
perspective - Each user queries warehouse using common
vocabulary and a common query interface
13Data Federation
- Data directly gathered from the data sources
- Results are up-to-date
- Allows user to impose their own ontology on data
from disparate sources - Provides two sets of operations
- get()
- transform()
- Operations capable of dealing with syntactic and
semantic mismatches between global ontology and
source specific ontology
14Approaches to deal with semantic mismatches
between global and local ontologies
- Source Centric Approach
- Query Centric Approach
15Source Centric Approach
- Individual data source determines how the
concepts in local ontology are mapped to cpncepts
on global ontology - User has little control on the true meaning of
concepts in the global ontology - User is not responsible for specifying the
transformation between global concepts and the
local concepts
16Query Centric Approach
- Concepts in global ontology are defined in terms
of concepts in local ontology - Suited for application which requires users to
impose their own ontologies to flexibly interpret
and analyze data from different sources - User or administrator needs to specify precisely
how global concepts can be composed from local
concepts
17INDUS
- Intelligent Data Understanding System
- Environment for data driven extraction and
integration from Heterogeneous, distributed and
autonomous information sources - Users able to flexibly interpret and analyze the
data from the various perspective - Motivated by the requirements of application such
as scientific discovery - Provides Federated, Query Centric approach to
data integration using user specified ontologies
18Data Integration In INDUS
- 3-Layer Architecture
- Physical Layer
- Ontological Layer
- User Interface Layer
19Physical Layer
- Allows system to communicate with the information
sources - Based on federated database architecture
20Ontological Layer
- Contains global ontologies specified by the users
and their mapping to local ontologies - Transforms queries expressed in terms of global
ontologies into execution plans
21User Interface Layer
- Unables user to interact with the system
- Define ontologies
- Post query and receive information
- Hides the complexity from the user
22Few INDUS related Terminologies
- Concepts
- Ground Concepts
- Compound concepts
- Global Ontology
- Queries
23Concepts
- Equivalent to mathematical entity for a relation
- Is a Subset of cartesian product of a list of
domains where each domain is finite - Stored as an instances in the relational
databases - Two types of concepts
- Ground concept
- - Ground concepts are those whose
instances can be retrieved from one or more
data sources using a set of predefined operations
- Compound concept
- - The definition of a compound concept X
specifies the set of operations that must be
applied over a set of instances of other
previously defined concepts in order to determine
the set of instances of X - - INDUS uses four operations to define new
compound concepts namely, selection, projection,
vertical integration and horizontal integration
24Global Ontology
- Global ontology consists of the set of concepts
that are used to describe entities and
relationships in the domain of discourse - It can be customized to suit the need of the user
or the group of users - Queries are expressed in terms of concept
- Can be extended by defining new concepts
- Hides the complexity of accessing and retrieving
the information from the data sources - Mapping the semantics of concepts in global and
user defined ontologies helps the resolve the
semantic mismatches - Its resides in the ontological layer
25Queries
- Query allows to access the instances of the
respective concept - Generally specified by selection and projection
- The query is expressed in terms of an expression
tree before getting answered - Internal node represents operations and leaf node
represents ground concept - Query is executed after the plan (tree) is
created - INDUS provides instantiator that are able to
interact with the data sources and retrieve the
information from them
26Implementation of INDUS
- Implementation of the Data Integration component
of the INDUS - Five principal modules
- Graphical User Interface
- Common Global Ontological Area
- Instantiator Library
- Query Resolution
- User Wrokspace
27Implementation of Indus
28Graphical User Interface
- Allows the user to interact with the INDUS
- Unables the user to describe ontologies,define
operational definition of ground
concepts,compound concepts and queries,register
the iterators and execute queries
29Common Global Ontology Area
- Manages the repositorywhere definition of
ontologies,ground concepts,compound
concepts,queries and iterator signatures are
stored
30Instantiator Library
- Contains Sets of function used to interact with
the individual data sources - Each instantiator based on iterator
- Iterator interacts direclty with the data sources
- Iterator is implemented as Java Class
- Instantiator suppies parameter to control
behavior of iterator - Maps the instances returned by the iterator to
instances od corresponding ground concept - Functionality corresponds to that of wrapper
31Query Resolution Module
- Accepts query expressed in terms of concepts in
global ontology as input - Returns the answer to the query constructed from
the relevant data sources
32User Workspace
- Manage private workspace where users store
answers to posted queries - Partial Results are also stored
- Set of Instances associated with each ground
concept present in the expression tree are stored
as populated relational tables
33Advantages 0f INDUS Design
- Modular design ensures that each module is
updated and alternative implementation easily
explored - Unables INDUS to use different network
architecture
34Technologies used to develop INDUS
- JSP for developing Graphical User Interface
- Hosted in an Apache Tomcat 4.0 Web Server
- Relational database for common global ontology
area - ODBC and JDBC protocol to share the ontology with
other applications - Iterators and the resolution algorithm are
implemented in Java - All compoents of Indus are platform Independent
35Different Roles
- Domain Scientist
- Ontology Engineer
- Administrator
- Developer
- Each user may place multiple role
36Domain Scientist
- Define ontologies, compund concepts and queries
- Execute queries and manipulate the retrieved data
- Should be familiar with the relevant domain, data
sources and thir capabilities - No programming knowledge required
37Ontology Engineer
- Programming new iterators
- Define ground concepts associated with new ground
concepts - Define new modes of interaction with existing
data sources
38Administrator
- Install INDUS software
- Set up and manage databases
- Adding new users to the system
39Developer
- Add new compositional operations to INDUS
- Modify the graphical user interface module and
query resolution module
40Summary
- Described the design and implementation of data
integration component of INDUS - INDUS implements federated query centric approach
for data integration - Information extraction ioperations to be executed
are dynamically determined on the basis of user
supplied ontology and query
41Related Work
- Early work on multi-database systems focused on
relational and object oriented databases - Recently, mediators and wrappers have been
developed to integrate information from disparate
sources
42Information Manifold System
- Developed at AT T Bell laboratories
- Heterogenous data integration system offering
unified interface for retrieving information from
www and internal sources - Source centric approach
- Allows definition of only one capability record
per data source - Only equality operator is supported
43TSIMMIS
- Stanford IBM Manager of multiple information
sources - Based on concepts of wrappers and mediators
- Uses Object Exchange Model(OEM)
- Query centric approach
44TAMBIS
- Transparent access to multiple Bioinformatics
Information System - Ontology centered system for evaluating query
- Offers access to multiple heterogeneous
bioinformatics data sources - Three layer wrapper/mediator architecture
- Uses description logic language GRAIL
- Returns the answer for the query as an HTML file
45Work in Progress
- Development of the INDUS prototype into platform
to support exploratory data analysis and
knowledge acquisition from heteregenous,
distributed information sources - Extending the information extraction framework to
support extraction of sufficient statistics - Extending recently developed algorithms to work
with heteregenous, distributed information
sources - Performance Improvements using more sophisticated
query optimization methods - Exploration of the use of emerging frameworks for
data and meta data description, ontologies and
registry services, being developed as part of the
semantic web project - Exploration of methods for automatically learning
mapping between data sources - Extension of INDUS to support information
integration in peer to peer environments and
distributed sendor network
46